为啥有两个选项可以在 PySpark 中读取 CSV 文件？我应该使用哪一个？

Posted 2023-04-18

技术标签:

【中文标题】为啥有两个选项可以在 PySpark 中读取 CSV 文件？我应该使用哪一个？【英文标题】：Why are there two options to read a CSV file in PySpark? Which one should I use?为什么有两个选项可以在 PySpark 中读取 CSV 文件？我应该使用哪一个？ 【发布时间】：2019-10-06 22:42:20 【问题描述】：

火花 2.4.4：

我想导入 CSV 文件，但有两个选项。这是为什么？哪个更好？我应该使用哪一个？

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .config('spark.cores.max', '3') \
    .config('spark.executor.memory', '2g') \
    .config('spark.executor.cores', '2') \
    .config('spark.driver.memory','1g') \
    .getOrCreate()

选项 1

df = spark.read \
    .format("com.databricks.spark.csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data/myfile.csv")

选项 2

df = spark.read.load("data/myfile.csv", format="csv", inferSchema="true", header="true")

【问题讨论】：

他们都做同样的事情还有第三个选项spark.read.csv("data/myfile.csv", inferSchema=True, header=True) 谢谢！为什么有这么多方法来做同样的事情？还有一个问题：Python 的语法和 Scala 一样吗？例如调用函数时。例如这里：spark.apache.org/docs/latest/… 当我在 Scala 和 Python 之间切换时，它总是基本相同的代码。就像他们写的那样......不，Python没有相同的语法，。例如，如果您在数据帧上.map() 但是像90-95%一样吗？你能解释一下为什么调用.map() 函数的语法不同吗？例如这里：spark.apache.org/examples.html Python 和 Scala 的语法是一样的：textFile.flatMap(...).map(...).reduceByKey(...) 或者你的意思是rdd.map()？ 【参考方案1】：

从 Spark 2 开始，com.databricks.spark.csv 不需要完全写出，因为包含 CSV 阅读器。因此，选项 2 将是首选。

或者稍微短一点，

spark.read.csv("data/myfile.csv", inferSchema=True, header=True)

但是如果将输入格式提取到某个配置文件中，选项2会更好

【讨论】：

【参考方案2】：

在所有语言（天气编程或会话）中，总是有几种不同的方式达到同一目的。

读取 CSV 文件时的选项

Spark CSV dataset provides multiple options to work with CSV files, all these options 
delimiter

delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character us this option.


val df2 = spark.read.options(Map("delimiter"->","))
  .csv("src/main/resources/zipcodes.csv")

inferSchema

The default value set to this option is false, when set to true it automatically infer column types based on the data. It requires to read the data one more time to infer the schema.


val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->","))
  .csv("src/main/resources/zipcodes.csv")

header

This option is used to read the first line of the CSV file as column names. By default the value of this option is false , and all column types are assumed to be a string.


val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
  .csv("src/main/resources/zipcodes.csv")

quotes

When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
nullValues

Using nullValues option you can specify the string in a CSV to consider as null. For example, if you want to consider a date column with a value “1900-01-01” set null on DataFrame.
dateFormat

dateFormat option to used to set the format of the input DateType and TimestampType columns. Supports all java.text.SimpleDateFormat formats.

Note: Besides the above options, Spark CSV dataset also supports many other options, please refer to this article for details.

使用用户指定的自定义架构读取 CSV 文件

If you know the schema of the file ahead and do not want to use the inferSchema option for column names and types, use user-defined custom column names and type using schema option.


    val schema = new StructType()
      .add("RecordNumber",IntegerType,true)
      .add("Zipcode",IntegerType,true)
      .add("City",StringType,true)
      .add("State",StringType,true)
      .add("Notes",StringType,true)
    val df_with_schema = spark.read.format("csv")
      .option("header", "true")
      .schema(schema)
      .load("src/main/resources/zipcodes.csv")
    df_with_schema.printSchema()
    df_with_schema.show(false)

https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/

【讨论】：

以上是关于为啥有两个选项可以在 PySpark 中读取 CSV 文件？我应该使用哪一个？的主要内容，如果未能解决你的问题，请参考以下文章

如何在 pyspark 数据框中读取 csv 文件时读取选定的列？

如何在 pyspark 中读取大的 zip 文件

为啥它显示警告：从不读取字段：`group`？是不是有一些选项可以解决此警告？

当我在 pyspark 中收集它们时，为啥我的 `binaryFiles` 是空的？

为啥在 pyspark 中加入两个临时视图后删除列不起作用，但它适用于数据框连接？

在 pyspark 中加载 SQL 查询？