将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常

Posted

技术标签:

【中文标题】将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常【英文标题】:ArrayIndexOutOfBoundsException exception while writing data into Hive Spark SQL 【发布时间】:2018-03-30 07:11:19 【问题描述】:

我正在尝试处理文本并将其写入 Hive 表。在插入过程中出现以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    ... 8 more

这是我的代码:

object maintenance 
  case class event(Entity_Status_Code:String,Entity_Status_Description:String,Status:String,Event_Date:String,Event_Date2:String,Event_Date3:String,Event_Description:String)
  def main(args: Array[String]): Unit = 

    val conf = new SparkConf().setAppName("maintenance").setMaster("local")
    conf.set("spark.debug.maxToStringFields", "10000000")
    val context = new SparkContext(conf)
    val sqlContext = new SQLContext(context)
    val hiveContext = new HiveContext(context)
    sqlContext.clearCache()
    //hiveContext.clearCache()
    //sqlContext.clearCache()

    import hiveContext.implicits._
    val rdd = context.textFile("file:///Users/hadoop/Downloads/sample.txt").map(line => line.split(" ")).map(x => event(x(0),x(1),x(2),x(3),x(4),x(5),x(6)))

    val personDF = rdd.toDF()
    personDF.show(10)
    personDF.registerTempTable("Maintenance")
    hiveContext.sql("insert into table default.maintenance select Entity_Status_Code,Entity_Status_Description,Status,Event_Date,Event_Date2,Event_Date3,Event_Description from Maintenance")


  

当我评论与 hiveContext 相关的所有行并在本地运行时(我的意思是 personDF.show()),它工作正常。但是当我在 spark-submit 上运行并启用 hiveContext 时,会出现错误。

这是我的示例数据:

4287053 06218896 N 19801222 19810901 19881222 M171 
4287053 06218896 N 19801222 19810901 19850211 M170 
4289713 06222552 Y 19810105 19810915 19930330 SM02 
4289713 06222552 Y 19810105 19810915 19930303 M285 
4289713 06222552 Y 19810105 19810915 19921208 RMPN 
4289713 06222552 Y 19810105 19810915 19921208 ASPN 
4289713 06222552 Y 19810105 19810915 19881116 ASPN 
4289713 06222552 Y 19810105 19810915 19881107 M171

【问题讨论】:

这显然是一个开发者问题 "Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)" 【参考方案1】:

将 -1 添加到拆分,这应该可以解决您的问题(在您计算 val rdd = ... 的行上): line.split(" ",-1)

在导致arrayindexoutofbound的拆分中将省略空字段。

【讨论】:

以上是关于将数据写入 Hive Spark SQL 时出现 ArrayIndexOutOfBoundsException 异常的主要内容,如果未能解决你的问题,请参考以下文章

使用 Spark insertInto 时出现 FileAlreadyExistsException

写入 Synapse DWH 池时出现 Spark 错误

在 Spark 2.0 中从 AVRO 写入镶木地板时出现 NullPointerException

如何将数据写入 Hive 表?

Spark SQL - 无法将所有记录写入配置单元表

Hive:Spark中如何实现将rdd结果插入到hive1.3.1表中