转换为 csv 文件后读取数据帧会在 Scala 中呈现不正确的数据帧
Posted
技术标签:
【中文标题】转换为 csv 文件后读取数据帧会在 Scala 中呈现不正确的数据帧【英文标题】:Reading a dataframe after converting to csv file renders incorrect dataframe in Scala 【发布时间】:2018-07-15 22:24:45 【问题描述】:我正在尝试将以下数据框写入 csv 文件:
df
:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description| genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...|Computer|44.95| 2000-10-01|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16|
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
我正在使用此代码写入 csv 文件:
df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
使用它,它会在文件夹 hdfsOut
中创建 3 个不同的 csv
文件。当我尝试使用
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv("hdfsOut")
csvdf.show()
它以不正确的形式显示数据框,如下所示:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description|genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| null| null| null|
| society in ...| the young surviv...| null| null| null| null| null| null| null| null| null|
| foundation ...| Fantasy| 5.95| 2000-11-17| null| null| null| null| null| null| null|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| null| null| null|
| an evil sor...| and her own chil...| null| null| null| null| null| null| null| null| null|
| of the world."| Fantasy| 5.95| 2000-12-16| null| null| null| null| null| null| null|
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...| null| null| null|
| with XML...| Computer| 44.95| 2000-10-01| null| null| null| null| null| null| null|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
我需要这个 csv
文件才能将其提供给 Amazon Athena。当我这样做时,Athena 还会以与第二个输出中所示相同的格式呈现数据。理想情况下,从转换后的 csv
文件中读取后,它应该只显示 3 行。
知道为什么会发生这种情况吗?如何解决这个问题,以正确的形式呈现 csv 数据,如第一个输出中所示?
【问题讨论】:
在写入 csv 之前,紧接在“society in”和“foundation”等之前的字符是什么? 这些基本上是描述标题中的内容,如下所示:After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
【参考方案1】:
description
列中的数据应包含new line characters
和commas
的数据,如下所示
"After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society"
所以为了测试目的,我创建了一个数据框
val df = Seq(
("Maeve Ascendant", "_CONFIG_CONTEXT", "#id13", "dfdf", "20", "bk103", "Corets, Eva", "After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society", "Fantasy", "5.95", "2000-11-17")
).toDF("title", "UserData.UserValue._title", "UserData.UserValue._valueRef", "UserData.UserValue._valuegiven", "UserData._idUser", "_id", "author", "description", "genre", "price", "publish_date")
df.show()
向我展示了与您的问题相同的数据框格式
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description| genre|price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
|Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103|Corets, Eva|After the collaps...|Fantasy| 5.95| 2000-11-17|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
但是df.show(false)
给出了准确的值
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|title |UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|_id |author |description |genre |price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|Maeve Ascendant|_CONFIG_CONTEXT |#id13 |dfdf |20 |bk103|Corets, Eva|After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society|Fantasy|5.95 |2000-11-17 |
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
当您将其保存为 csv 时,spark 将其保存为带有换行符和逗号的文本文件,以被视为简单的文本 csv 文件。而在 csv 格式中,换行生成一个新行,逗号生成一个新字段。 这是数据中的罪魁祸首。
解决方案 1
您可以使用 parquet 格式将数据框保存为 parquet 保存数据框的属性并将其读取为 parquet as
df.write.parquet("hdfsOut")
var csvdf = spark.read.parquet("hdfsOut")
解决方案 2
将其保存为 csv 格式并在阅读时使用multiLine
选项
df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
var csvdf = spark.read.format("org.apache.spark.csv").option("multiLine", "true").option("header", true).csv("hdfsOut")
希望回答对你有帮助
【讨论】:
感谢您的回答。是的,这是数据格式不正确并且具有换行符。标记为已接受。以上是关于转换为 csv 文件后读取数据帧会在 Scala 中呈现不正确的数据帧的主要内容,如果未能解决你的问题,请参考以下文章