Spark不断向死去的执行者提交任务

Posted 2023-04-18

技术标签:

【中文标题】Spark不断向死去的执行者提交任务【英文标题】：Spark keeps submitting tasks to dead executor 【发布时间】：2020-09-30 07:43:57 【问题描述】：

我正在研究 apache spark，我正面临一个非常奇怪的问题。其中一个执行程序因 OOM 而失败，其关闭挂钩清除了所有存储（内存和磁盘），但显然驱动程序由于 PROCESS_LOCAL 任务而不断在同一执行程序上提交失败的任务。

现在那台机器上的存储被清除了，所有重试的任务也失败了，导致整个阶段失败（重试4次后）

我不明白的是，驱动程序怎么不知道执行器处于关闭状态并且无法执行任何任务。

配置：

心跳间隔：60s

网络超时：600s

记录以确认执行程序正在接受关闭后的任务

20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] Executor: Exception in task 6391.0 in stage 17.0 (TID 138513)
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 138513,5,main]
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 INFO [pool-8-thread-1] DiskBlockManager: Shutdown hook called
20/09/29 20:26:35 ERROR [Executor task launch worker for task 138295] Executor: Exception in task 6239.0 in stage 17.0 (TID 138295)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/35/temp_shuffle_b5df90ac-78de-48e3-9c2d-891f8b2ce1fa (No such file or directory)
20/09/29 20:26:36 ERROR [Executor task launch worker for task 139484] Executor: Exception in task 6587.0 in stage 17.0 (TID 139484)
org.apache.spark.SparkException: Block rdd_3861_6587 was not found even though it's read-locked
20/09/29 20:26:42 WARN [Thread-2] ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_172]
20/09/29 20:26:44 ERROR [Executor task launch worker for task 140256] Executor: Exception in task 6576.3 in stage 17.0 (TID 140256)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/30/rdd_3861_6576 (No such file or directory)
20/09/29 20:26:44 INFO [dispatcher-event-loop-0] Executor: Executor is trying to kill task 6866.1 in stage 17.0 (TID 140329), reason: stage cancelled
20/09/29 20:26:47 INFO [pool-8-thread-1] ShutdownHookManager: Shutdown hook called
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: removing client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: Stopping client
20/09/29 20:26:47 DEBUG [Thread-2] ShutdownHookManager: ShutdownHookManger complete shutdown
20/09/29 20:26:55 INFO [dispatcher-event-loop-14] CoarseGrainedExecutorBackend: Got assigned task 141510
20/09/29 20:26:55 INFO [Executor task launch worker for task 141510] Executor: Running task 545.1 in stage 26.0 (TID 141510)

（我已经削减了堆栈跟踪，因为这些只是 spark RDD shuffle 读取方法）

如果我们检查时间戳，则关闭从20/09/29 20:26:32 开始并在20/09/29 20:26:47 结束，在此期间，驱动程序将所有重试任务发送到同一个执行程序，并且它们都失败导致阶段取消。

有人可以帮我理解这种行为吗？如果还有其他需要，也请告诉我

【问题讨论】：

【参考方案1】：

spark中有一个配置，spark.task.maxFailures默认是4。所以spark在失败时重试任务。 Taskrunner 会将任务状态更新为驱动程序。并且驱动程序会将此状态转发给 taskschedulerimpl。在您的情况下，只有执行程序 OOM 和 DiskBlockManager 关闭，但驱动程序还活着。我认为执行人也还活着。并将使用相同的 taskSetManager 重试该任务。此任务的失败次数到达4次，此阶段将被取消，执行者将被杀死。

【讨论】：

以上是关于Spark不断向死去的执行者提交任务的主要内容，如果未能解决你的问题，请参考以下文章