执行程序心跳超时:Spark 作业中的错误

Posted

技术标签:

【中文标题】执行程序心跳超时:Spark 作业中的错误【英文标题】:Executor heartbeat timed Out : Error in Spark Job 【发布时间】:2020-05-06 23:00:47 【问题描述】:

我在使用 Python 编程的 Spark Job 中遇到错误。错误说“执行程序心跳超时”。下面附上错误日志:

Py4JJavaError: An error occurred while calling o152.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:549)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

以下是错误:

原因:org.apache.spark.SparkException:作业因阶段失败而中止:阶段 9.0 中的任务 38 失败 4 次,最近一次失败:阶段 9.0 中丢失任务 38.3(TID 532,alp-pos -005.unix.cosng.net, executor 24):ExecutorLostFailure(其中一个正在运行的任务导致executor 24退出)原因:Executor heartbeat 154863 ms后超时

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
    ... 31 more

保持 spark.network.timeout 10000000 后,就会出现下面的错误。

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 59 in stage 19.0 failed 4 times, most recent failure: Lost task 59.3 in stage 19.0 (TID 3614, alp-pos-004.unix.cosng.net, executor 29): ExecutorLostFailure (executor 29 exited caused by one of the running tasks) Reason: Container marked as failed: container_e150_1579619385046_0042_01_000038 on host: alp-pos-004.unix.cosng.net. Exit status: 143. Diagnostics: [2020-01-23 12:42:08.482]Container killed on request. Exit code is 143
[2020-01-23 12:42:08.482]Container exited with a non-zero exit code 143. 
[2020-01-23 12:42:08.482]Killed by external signal

在这一点上,任何帮助都将受到高度赞赏。

谢谢 完成

【问题讨论】:

【参考方案1】:

通常与这种情况相关的问题是内存,但解决该问题的一种简单方法是增加spark.network.timeout。这有帮助,但这不是长期解决方案。

所以试试这个:

spark-submit --conf spark.network.timeout 10000000 python_script.py

【讨论】:

我在 zeppelin 上使用 pyspark,如果我在 zeppelin 属性文件下添加这个属性可以吗? 是的,您可以使用它来设置 spark 超时的默认配置。不建议对每个作业都这样做,因为它会隐藏您在 Spark 执行期间可能面临的未来问题。 遇到另一个错误。请检查上面。 [2020-01-23 12:42:08.482]容器以非零退出代码 143 退出。[2020-01-23 12:42:08.482]被外部信号杀死

以上是关于执行程序心跳超时:Spark 作业中的错误的主要内容,如果未能解决你的问题,请参考以下文章

线上 hive on spark 作业执行超时问题排查案例分享

本地主机上丢失的执行程序驱动程序:执行程序心跳超时

如何克服 AWS Glue 作业中的 Spark“设备上没有剩余空间”错误

使用 Spark 作业服务器的 Spark SQL 作业中的错误“此上下文的作业类型无效”

使用火花动作在 Oozie 中的 python Spark 作业

什么是 Spark 作业?