为啥有两个选项可以在 PySpark 中读取 CSV 文件?我应该使用哪一个?
Posted
技术标签:
【中文标题】为啥有两个选项可以在 PySpark 中读取 CSV 文件?我应该使用哪一个?【英文标题】:Why are there two options to read a CSV file in PySpark? Which one should I use?为什么有两个选项可以在 PySpark 中读取 CSV 文件?我应该使用哪一个? 【发布时间】:2019-10-06 22:42:20 【问题描述】:火花 2.4.4:
我想导入 CSV 文件,但有两个选项。这是为什么?哪个更好?我应该使用哪一个?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[2]") \
.config('spark.cores.max', '3') \
.config('spark.executor.memory', '2g') \
.config('spark.executor.cores', '2') \
.config('spark.driver.memory','1g') \
.getOrCreate()
选项 1
df = spark.read \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("data/myfile.csv")
选项 2
df = spark.read.load("data/myfile.csv", format="csv", inferSchema="true", header="true")
【问题讨论】:
他们都做同样的事情 还有第三个选项spark.read.csv("data/myfile.csv", inferSchema=True, header=True)
谢谢!为什么有这么多方法来做同样的事情?还有一个问题:Python 的语法和 Scala 一样吗?例如调用函数时。例如这里:spark.apache.org/docs/latest/… 当我在 Scala 和 Python 之间切换时,它总是基本相同的代码。
就像他们写的那样......不,Python没有相同的语法,。例如,如果您在数据帧上.map()
但是像90-95%一样吗?你能解释一下为什么调用.map()
函数的语法不同吗?例如这里:spark.apache.org/examples.html Python 和 Scala 的语法是一样的:textFile.flatMap(...).map(...).reduceByKey(...)
或者你的意思是rdd.map()
?
【参考方案1】:
从 Spark 2 开始,com.databricks.spark.csv
不需要完全写出,因为包含 CSV 阅读器。因此,选项 2 将是首选。
或者稍微短一点,
spark.read.csv("data/myfile.csv", inferSchema=True, header=True)
但是如果将输入格式提取到某个配置文件中,选项2会更好
【讨论】:
【参考方案2】:在所有语言(天气编程或会话)中,总是有几种不同的方式达到同一目的。
读取 CSV 文件时的选项
Spark CSV dataset provides multiple options to work with CSV files, all these options
delimiter
delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character us this option.
val df2 = spark.read.options(Map("delimiter"->","))
.csv("src/main/resources/zipcodes.csv")
inferSchema
The default value set to this option is false, when set to true it automatically infer column types based on the data. It requires to read the data one more time to infer the schema.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->","))
.csv("src/main/resources/zipcodes.csv")
header
This option is used to read the first line of the CSV file as column names. By default the value of this option is false , and all column types are assumed to be a string.
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")
quotes
When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
nullValues
Using nullValues option you can specify the string in a CSV to consider as null. For example, if you want to consider a date column with a value “1900-01-01” set null on DataFrame.
dateFormat
dateFormat option to used to set the format of the input DateType and TimestampType columns. Supports all java.text.SimpleDateFormat formats.
Note: Besides the above options, Spark CSV dataset also supports many other options, please refer to this article for details.
使用用户指定的自定义架构读取 CSV 文件
If you know the schema of the file ahead and do not want to use the inferSchema option for column names and types, use user-defined custom column names and type using schema option.
val schema = new StructType()
.add("RecordNumber",IntegerType,true)
.add("Zipcode",IntegerType,true)
.add("City",StringType,true)
.add("State",StringType,true)
.add("Notes",StringType,true)
val df_with_schema = spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("src/main/resources/zipcodes.csv")
df_with_schema.printSchema()
df_with_schema.show(false)
https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/
【讨论】:
以上是关于为啥有两个选项可以在 PySpark 中读取 CSV 文件?我应该使用哪一个?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 pyspark 数据框中读取 csv 文件时读取选定的列?
为啥它显示警告:从不读取字段:`group`?是不是有一些选项可以解决此警告?
当我在 pyspark 中收集它们时,为啥我的 `binaryFiles` 是空的?