将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常
Posted
技术标签:
【中文标题】将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常【英文标题】:ArrayIndexOutOfBoundsException exception while writing data into Hive Spark SQL 【发布时间】:2018-03-30 07:11:19 【问题描述】:我正在尝试处理文本并将其写入 Hive 表。在插入过程中出现以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
... 8 more
这是我的代码:
object maintenance
case class event(Entity_Status_Code:String,Entity_Status_Description:String,Status:String,Event_Date:String,Event_Date2:String,Event_Date3:String,Event_Description:String)
def main(args: Array[String]): Unit =
val conf = new SparkConf().setAppName("maintenance").setMaster("local")
conf.set("spark.debug.maxToStringFields", "10000000")
val context = new SparkContext(conf)
val sqlContext = new SQLContext(context)
val hiveContext = new HiveContext(context)
sqlContext.clearCache()
//hiveContext.clearCache()
//sqlContext.clearCache()
import hiveContext.implicits._
val rdd = context.textFile("file:///Users/hadoop/Downloads/sample.txt").map(line => line.split(" ")).map(x => event(x(0),x(1),x(2),x(3),x(4),x(5),x(6)))
val personDF = rdd.toDF()
personDF.show(10)
personDF.registerTempTable("Maintenance")
hiveContext.sql("insert into table default.maintenance select Entity_Status_Code,Entity_Status_Description,Status,Event_Date,Event_Date2,Event_Date3,Event_Description from Maintenance")
当我评论与 hiveContext 相关的所有行并在本地运行时(我的意思是 personDF.show()),它工作正常。但是当我在 spark-submit 上运行并启用 hiveContext 时,会出现错误。
这是我的示例数据:
4287053 06218896 N 19801222 19810901 19881222 M171
4287053 06218896 N 19801222 19810901 19850211 M170
4289713 06222552 Y 19810105 19810915 19930330 SM02
4289713 06222552 Y 19810105 19810915 19930303 M285
4289713 06222552 Y 19810105 19810915 19921208 RMPN
4289713 06222552 Y 19810105 19810915 19921208 ASPN
4289713 06222552 Y 19810105 19810915 19881116 ASPN
4289713 06222552 Y 19810105 19810915 19881107 M171
【问题讨论】:
这显然是一个开发者问题 "Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)" 【参考方案1】:将 -1 添加到拆分,这应该可以解决您的问题(在您计算 val rdd = ... 的行上): line.split(" ",-1)
在导致arrayindexoutofbound的拆分中将省略空字段。
【讨论】:
以上是关于将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常的主要内容,如果未能解决你的问题,请参考以下文章
使用 Spark insertInto 时出现 FileAlreadyExistsException