Spark 无法读取 CSV 文件并转换为数据集
Posted
技术标签:
【中文标题】Spark 无法读取 CSV 文件并转换为数据集【英文标题】:Spark can't read CSV file and convert to Dataset 【发布时间】:2018-12-05 08:47:23 【问题描述】:当我使用 Spark 读取 CSV 文件并将其转换为数据集时,出现以下错误。我想不出原因。下面提供了我的代码。也可以http://eforexcel.com/wp/wp-content/uploads/2017/07/10000-Sales-Records.zip 下载 CSV 文件。
我正在使用 Scala:2.12.3,Spark:2.4.0。
错误信息:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`itemType`' given input columns: [Order ID, Total Profit, Country, Total Revenue, Ship Date, Unit Cost, Sales Channel, Unit Price, Total Cost, Units Sold, Order Date, Order Priority, Region, Item Type];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:110)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:107)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:278)
...
...
这是我的代码:
import spark.implicits._
case class Sales(region: String,
country: String,
itemType: String,
salesChannel: String,
orderPriority: String,
orderDate: String,
orderId: Long,
shipDate: String,
unitsSold: Integer,
unitsPrice: Double,
unitCost: Double,
totalRevenue: Double,
totalCost: Double,
totalProfit: Double
)
val ds = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/datasets/10000 Sales Records.csv")
.as[Sales]
【问题讨论】:
【参考方案1】:您的 csv 标题列与案例类不匹配。 从 csv 标头中,您需要处理数据以使其与您的案例类匹配。那就是您需要删除空格,将第二个单词大写。以下解决方法对您有用。
请注意,我在您的案例类中将 unitsPrice: Double 更改为 unitPrice。
val ds = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("in/10000_Sales_Records.csv")
ds.printSchema()
val sch1 = ds.columns.map( x=> x match case a if a.contains(" ") => val q=a.split(" ");q(0)+q(1).capitalize case a => a.toLowerCase )
val ds2 = ds.toDF(sch1:_*)
ds2.printSchema()
val ds3 = ds2.as[Sales]
ds3.show(false)
结果:
root
|-- Region: string (nullable = true)
|-- Country: string (nullable = true)
|-- Item Type: string (nullable = true)
|-- Sales Channel: string (nullable = true)
|-- Order Priority: string (nullable = true)
|-- Order Date: string (nullable = true)
|-- Order ID: integer (nullable = true)
|-- Ship Date: string (nullable = true)
|-- Units Sold: integer (nullable = true)
|-- Unit Price: double (nullable = true)
|-- Unit Cost: double (nullable = true)
|-- Total Revenue: double (nullable = true)
|-- Total Cost: double (nullable = true)
|-- Total Profit: double (nullable = true)
root
|-- region: string (nullable = true)
|-- country: string (nullable = true)
|-- ItemType: string (nullable = true)
|-- SalesChannel: string (nullable = true)
|-- OrderPriority: string (nullable = true)
|-- OrderDate: string (nullable = true)
|-- OrderID: integer (nullable = true)
|-- ShipDate: string (nullable = true)
|-- UnitsSold: integer (nullable = true)
|-- UnitPrice: double (nullable = true)
|-- UnitCost: double (nullable = true)
|-- TotalRevenue: double (nullable = true)
|-- TotalCost: double (nullable = true)
|-- TotalProfit: double (nullable = true)
+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
|region |country |ItemType |SalesChannel|OrderPriority|OrderDate |OrderID |ShipDate |UnitsSold|UnitPrice|UnitCost|TotalRevenue|TotalCost |TotalProfit|
+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
|Sub-Saharan Africa |Chad |Office Supplies|Online |L |1/27/2011 |292494523|2/12/2011 |4484 |651.21 |524.96 |2920025.64 |2353920.64|566105.0 |
|Europe |Latvia |Beverages |Online |C |12/28/2015|361825549|1/23/2016 |1075 |47.45 |31.79 |51008.75 |34174.25 |16834.5 |
|Middle East and North Africa |Pakistan |Vegetables |Offline |C |1/13/2011 |141515767|2/1/2011 |6515 |154.06 |90.93 |1003700.9 |592408.95 |411291.95 |
|Sub-Saharan Africa |Democratic Republic of the Congo|Household |Online |C |9/11/2012 |500364005|10/6/2012 |7683 |668.27 |502.54 |5134318.41 |3861014.82|1273303.59 |
|Europe |Czech Republic |Beverages |Online |C |10/27/2015|127481591|12/5/2015 |3491 |47.45 |31.79 |165647.95 |110978.89 |54669.06 |
|Sub-Saharan Africa |South Africa |Beverages |Offline |H |7/10/2012 |482292354|8/21/2012 |9880 |47.45 |31.79 |468806.0 |314085.2 |154720.8 |
|Asia |Laos |Vegetables |Online |L |2/20/2011 |844532620|3/20/2011 |4825 |154.06 |90.93 |743339.5 |438737.25 |304602.25 |
|Asia |China |Baby Food |Online |C |4/10/2017 |564251220|5/12/2017 |3330 |255.28 |159.42 |850082.4 |530868.6 |319213.8 |
|Sub-Saharan Africa |Eritrea |Meat |Online |L |11/21/2014|411809480|1/10/2015 |2431 |421.89 |364.69 |1025614.59 |886561.39 |139053.2 |
|Central America and the Caribbean|Haiti |Office Supplies|Online |C |7/4/2015 |327881228|7/20/2015 |6197 |651.21 |524.96 |4035548.37 |3253177.12|782371.25 |
|Sub-Saharan Africa |Zambia |Cereal |Offline |M |7/26/2016 |773452794|8/24/2016 |724 |205.7 |117.11 |148926.8 |84787.64 |64139.16 |
|Europe |Bosnia and Herzegovina |Baby Food |Offline |M |10/20/2012|479823005|11/15/2012|9145 |255.28 |159.42 |2334535.6 |1457895.9 |876639.7 |
|Europe |Germany |Office Supplies|Online |C |2/22/2015 |498603188|2/27/2015 |6618 |651.21 |524.96 |4309707.78 |3474185.28|835522.5 |
|Asia |India |Household |Online |C |8/27/2016 |151717174|9/2/2016 |5338 |668.27 |502.54 |3567225.26 |2682558.52|884666.74 |
|Middle East and North Africa |Algeria |Clothes |Offline |C |6/21/2011 |181401288|7/21/2011 |9527 |109.28 |35.84 |1041110.56 |341447.68 |699662.88 |
|Australia and Oceania |Palau |Snacks |Offline |L |9/19/2013 |500204360|10/4/2013 |441 |152.58 |97.44 |67287.78 |42971.04 |24316.74 |
|Central America and the Caribbean|Cuba |Beverages |Online |H |11/15/2015|640987718|11/30/2015|1365 |47.45 |31.79 |64769.25 |43393.35 |21375.9 |
|Europe |Vatican City |Beverages |Online |L |4/6/2015 |206925189|4/27/2015 |2617 |47.45 |31.79 |124176.65 |83194.43 |40982.22 |
|Middle East and North Africa |Lebanon |Personal Care |Offline |H |4/12/2010 |221503102|5/19/2010 |6545 |81.73 |56.67 |534922.85 |370905.15 |164017.7 |
|Europe |Lithuania |Snacks |Offline |H |9/26/2011 |878520286|10/2/2011 |2530 |152.58 |97.44 |386027.4 |246523.2 |139504.2 |
+---------------------------------+--------------------------------+---------------+------------+-------------+----------+---------+----------+---------+---------+--------+------------+----------+-----------+
only showing top 20 rows
【讨论】:
你能解释一下val ds2 = ds.toDF(sch1:_*)
中的sch1:_*
吗? @stack0114106
它是一个可变参数赋值..sch1 包含来自 csv 标题的重新格式化的列名,我们用这种语法替换它.. 这非常简洁易用【参考方案2】:
可能在输入文件中包含标题 - 地区、国家、商品类型、销售渠道、订单优先级、订单日期、订单 ID、发货日期、已售数量、单价、单价、总计收入、总成本、总利润
在输入文件或案例类中编辑标题
Name : Item Type (with space) where as in case class without space
【讨论】:
以上是关于Spark 无法读取 CSV 文件并转换为数据集的主要内容,如果未能解决你的问题,请参考以下文章
Spark SQL Java GenericRowWithSchema无法强制转换为java.lang.String
无法在 azure databricks 中使用 spark 读取 csv 文件