pyspark 将模式应用于 csv - 仅返回空值

Posted 2023-04-15

技术标签:

【中文标题】pyspark 将模式应用于 csv - 仅返回空值【英文标题】：pyspark apply schema to csv - returns only null values 【发布时间】：2020-07-28 12:20:27 【问题描述】：

我在一个目录中有许多“宽”的 csv 文件（100 多列）。我想我已经在某处读到，通过应用模式，我已经可以预先选择应该读取的列。不幸的是，我的代码只返回“NULL”。

有人知道我对“模式”的假设是否错误吗？下面代码中read-statement中的路径就ok了。

这里是代码

from pyspark.sql import functions as F
from pyspark.sql import types as T

DCU_schema = T.StructType([
  T.StructField("consistId", T.StringType(), True),
  T.StructField("subsystemId", T.StringType(), True),
  T.StructField("E13", T.BooleanType(), True),
  T.StructField("E40", T.BooleanType(), True),
  T.StructField("Strom_links", T.DoubleType(), True),
  T.StructField("Strom_rechts", T.DoubleType(), True),
  T.StructField("Spannung_links", T.DoubleType(), True),
  T.StructField("Spannung_rechts", T.DoubleType(), True),
  T.StructField("Position_links", T.IntegerType(), True),
  T.StructField("Position_rechts", T.IntegerType(), True),
  T.StructField("canTimeStamp", T.LongType(), True),
  T.StructField("latitude", T.DoubleType(), True),
  T.StructField("longitude", T.DoubleType(), True),
  T.StructField("fileName", T.StringType(), True)
])

first_kb_df = (spark.read.csv(path=path, schema=DCU_schema, inferSchema=False, header=True, sep=";")
              .orderBy("canTimeStamp"))
display(first_kb_df)

附上结果截图。

提前感谢您的帮助和问候亚历克斯

Screenshot of Returned Data

Screenshot of Input Data

【问题讨论】：

添加一些示例输入数据..？你能检查输入文件吗？ inferSchema=True 和 Header=True 的输入文件 -> 在原始帖子中查看新屏幕截图 【参考方案1】：

来自微软文档

import org.apache.spark.sql.types._

val schema = new StructType()
  .add("_c0",IntegerType,true)
  .add("carat",DoubleType,true)
  .add("cut",StringType,true)
  .add("color",StringType,true)
  .add("clarity",StringType,true)
  .add("depth",DoubleType,true)
  .add("table",DoubleType,true)
  .add("price",IntegerType,true)
  .add("x",DoubleType,true)
  .add("y",DoubleType,true)
  .add("z",DoubleType,true)

val diamonds_with_schema = spark.read.format("csv")
  .option("header", "true")
  .schema(schema)
  .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")

【讨论】：

或者您的类型与实际类型相差太多。例如。您正在尝试将浮点数（其中包含 . 值）转换为整数。这是 scala 虽然不是问题；我已经找到了这个；我的问题的不同之处在于，此架构包括“所有”列；我有 100 多列，只需要阅读其中的 14 个；类型是 100% 正确的使用无济于事的架构。我会怎么做是向它添加一个 .select(*lCols) 并将所需的列放在一个名为 lCols 的列表中嗨，但这并不能解决所描述的应用预定义模式的问题。我发现，对于 csv 文件，不可能将模式应用于数据的子集。我现在已经解决了它，我读取了.csv（inferSchema = False），选择列（如你所建议的那样），然后使用 withColumn("E13", F.col("E13").cast("boolean")) 到获得所需的火花类型。感谢您的 cmets，祝您有美好的一天

以上是关于pyspark 将模式应用于 csv - 仅返回空值的主要内容，如果未能解决你的问题，请参考以下文章