Spark 无法读取 CSV 文件并转换为数据集

Posted

技术标签:

【中文标题】Spark 无法读取 CSV 文件并转换为数据集【英文标题】:Spark can't read CSV file and convert to Dataset 【发布时间】:2018-12-05 08:47:23 【问题描述】:

当我使用 Spark 读取 CSV 文件并将其转换为数据集时,出现以下错误。我想不出原因。下面提供了我的代码。也可以http://eforexcel.com/wp/wp-content/uploads/2017/07/10000-Sales-Records.zip 下载 CSV 文件。

我正在使用 Scala:2.12.3,Spark:2.4.0。

错误信息:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`itemType`' given input columns: [Order ID, Total Profit, Country, Total Revenue, Ship Date, Unit Cost, Sales Channel, Unit Price, Total Cost, Units Sold, Order Date, Order Priority, Region, Item Type];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:278)
...
...

这是我的代码:

import spark.implicits._
case class Sales(region: String, 
              country: String, 
              itemType: String, 
              salesChannel: String, 
              orderPriority: String, 
              orderDate: String, 
              orderId: Long, 
              shipDate: String, 
              unitsSold: Integer, 
              unitsPrice: Double,
              unitCost: Double,
              totalRevenue: Double, 
              totalCost: Double, 
              totalProfit: Double
              )
 val ds = spark.read 
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/datasets/10000 Sales Records.csv")
.as[Sales] 

【问题讨论】:

【参考方案1】:

您的 csv 标题列与案例类不匹配。 从 csv 标头中,您需要处理数据以使其与您的案例类匹配。那就是您需要删除空格,将第二个单词大写。以下解决方法对您有用。

请注意,我在您的案例类中将 unitsPrice: Double 更改为 unitPrice。

val ds = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("in/10000_Sales_Records.csv")

ds.printSchema()
val sch1 = ds.columns.map( x=> x match  case a if a.contains(" ") =>  val q=a.split(" ");q(0)+q(1).capitalize  case a => a.toLowerCase  )
val ds2 = ds.toDF(sch1:_*)
ds2.printSchema()

val ds3 = ds2.as[Sales]
ds3.show(false)

结果:

root
 |-- Region: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Item Type: string (nullable = true)
 |-- Sales Channel: string (nullable = true)
 |-- Order Priority: string (nullable = true)
 |-- Order Date: string (nullable = true)
 |-- Order ID: integer (nullable = true)
 |-- Ship Date: string (nullable = true)
 |-- Units Sold: integer (nullable = true)
 |-- Unit Price: double (nullable = true)
 |-- Unit Cost: double (nullable = true)
 |-- Total Revenue: double (nullable = true)
 |-- Total Cost: double (nullable = true)
 |-- Total Profit: double (nullable = true)

root
 |-- region: string (nullable = true)
 |-- country: string (nullable = true)
 |-- ItemType: string (nullable = true)
 |-- SalesChannel: string (nullable = true)
 |-- OrderPriority: string (nullable = true)
 |-- OrderDate: string (nullable = true)
 |-- OrderID: integer (nullable = true)
 |-- ShipDate: string (nullable = true)
 |-- UnitsSold: integer (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- UnitCost: double (nullable = true)
 |-- TotalRevenue: double (nullable = true)
 |-- TotalCost: double (nullable = true)
 |-- TotalProfit: double (nullable = true)

+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
|region                           |country                         |ItemType       |SalesChannel|OrderPriority|OrderDate |OrderID  |ShipDate  |UnitsSold|UnitPrice|UnitCost|TotalRevenue|TotalCost |TotalProfit|
+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
|Sub-Saharan Africa               |Chad                            |Office Supplies|Online      |L            |1/27/2011 |292494523|2/12/2011 |4484     |651.21   |524.96  |2920025.64  |2353920.64|566105.0   |
|Europe                           |Latvia                          |Beverages      |Online      |C            |12/28/2015|361825549|1/23/2016 |1075     |47.45    |31.79   |51008.75    |34174.25  |16834.5    |
|Middle East and North Africa     |Pakistan                        |Vegetables     |Offline     |C            |1/13/2011 |141515767|2/1/2011  |6515     |154.06   |90.93   |1003700.9   |592408.95 |411291.95  |
|Sub-Saharan Africa               |Democratic Republic of the Congo|Household      |Online      |C            |9/11/2012 |500364005|10/6/2012 |7683     |668.27   |502.54  |5134318.41  |3861014.82|1273303.59 |
|Europe                           |Czech Republic                  |Beverages      |Online      |C            |10/27/2015|127481591|12/5/2015 |3491     |47.45    |31.79   |165647.95   |110978.89 |54669.06   |
|Sub-Saharan Africa               |South Africa                    |Beverages      |Offline     |H            |7/10/2012 |482292354|8/21/2012 |9880     |47.45    |31.79   |468806.0    |314085.2  |154720.8   |
|Asia                             |Laos                            |Vegetables     |Online      |L            |2/20/2011 |844532620|3/20/2011 |4825     |154.06   |90.93   |743339.5    |438737.25 |304602.25  |
|Asia                             |China                           |Baby Food      |Online      |C            |4/10/2017 |564251220|5/12/2017 |3330     |255.28   |159.42  |850082.4    |530868.6  |319213.8   |
|Sub-Saharan Africa               |Eritrea                         |Meat           |Online      |L            |11/21/2014|411809480|1/10/2015 |2431     |421.89   |364.69  |1025614.59  |886561.39 |139053.2   |
|Central America and the Caribbean|Haiti                           |Office Supplies|Online      |C            |7/4/2015  |327881228|7/20/2015 |6197     |651.21   |524.96  |4035548.37  |3253177.12|782371.25  |
|Sub-Saharan Africa               |Zambia                          |Cereal         |Offline     |M            |7/26/2016 |773452794|8/24/2016 |724      |205.7    |117.11  |148926.8    |84787.64  |64139.16   |
|Europe                           |Bosnia and Herzegovina          |Baby Food      |Offline     |M            |10/20/2012|479823005|11/15/2012|9145     |255.28   |159.42  |2334535.6   |1457895.9 |876639.7   |
|Europe                           |Germany                         |Office Supplies|Online      |C            |2/22/2015 |498603188|2/27/2015 |6618     |651.21   |524.96  |4309707.78  |3474185.28|835522.5   |
|Asia                             |India                           |Household      |Online      |C            |8/27/2016 |151717174|9/2/2016  |5338     |668.27   |502.54  |3567225.26  |2682558.52|884666.74  |
|Middle East and North Africa     |Algeria                         |Clothes        |Offline     |C            |6/21/2011 |181401288|7/21/2011 |9527     |109.28   |35.84   |1041110.56  |341447.68 |699662.88  |
|Australia and Oceania            |Palau                           |Snacks         |Offline     |L            |9/19/2013 |500204360|10/4/2013 |441      |152.58   |97.44   |67287.78    |42971.04  |24316.74   |
|Central America and the Caribbean|Cuba                            |Beverages      |Online      |H            |11/15/2015|640987718|11/30/2015|1365     |47.45    |31.79   |64769.25    |43393.35  |21375.9    |
|Europe                           |Vatican City                    |Beverages      |Online      |L            |4/6/2015  |206925189|4/27/2015 |2617     |47.45    |31.79   |124176.65   |83194.43  |40982.22   |
|Middle East and North Africa     |Lebanon                         |Personal Care  |Offline     |H            |4/12/2010 |221503102|5/19/2010 |6545     |81.73    |56.67   |534922.85   |370905.15 |164017.7   |
|Europe                           |Lithuania                       |Snacks         |Offline     |H            |9/26/2011 |878520286|10/2/2011 |2530     |152.58   |97.44   |386027.4    |246523.2  |139504.2   |
+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
only showing top 20 rows

【讨论】:

你能解释一下val ds2 = ds.toDF(sch1:_*)中的sch1:_*吗? @stack0114106 它是一个可变参数赋值..sch1 包含来自 csv 标题的重新格式化的列名,我们用这种语法替换它.. 这非常简洁易用【参考方案2】:

可能在输入文件中包含标题 - 地区、国家、商品类型、销售渠道、订单优先级、订单日期、订单 ID、发货日期、已售数量、单价、单价、总计收入、总成本、总利润

在输入文件或案例类中编辑标题

Name : Item Type (with space) where as in case class without space

【讨论】:

以上是关于Spark 无法读取 CSV 文件并转换为数据集的主要内容,如果未能解决你的问题,请参考以下文章

Spark SQL Java GenericRowWithSchema无法强制转换为java.lang.String

无法在 azure databricks 中使用 spark 读取 csv 文件

Spark - 转换复杂的数据类型

将excel转换为csv后无法读取头部

无法从 synapse spark scala notebook 读取 csv 文件

将 CSV 数据加载到 Dataframe 并使用 Apache Spark (Java) 转换为 Array