为啥 Spark 作业失败并显示“退出代码:52”

Posted

技术标签:

【中文标题】为啥 Spark 作业失败并显示“退出代码:52”【英文标题】:Why does Spark job fail with "Exit code: 52"为什么 Spark 作业失败并显示“退出代码:52” 【发布时间】:2016-02-17 09:04:04 【问题描述】:

我的 Spark 作业失败并出现类似这样的跟踪:

./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container id: container_1455622885057_0016_01_000008
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Exit code: 52
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr:Stack trace: ExitCodeException exitCode=52: 
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.util.Shell.run(Shell.java:456)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-      at java.lang.Thread.run(Thread.java:745)
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-
./containers/application_1455622885057_0016/container_1455622885057_0016_01_000001/stderr-Container exited with a non-zero exit code 52

我花了一段时间才弄清楚“退出代码 52”是什么意思,所以我把它放在这里是为了让其他可能正在搜索的人受益

【问题讨论】:

好的,今天我遇到了同样的错误。我检查了 spark 的配置文件,在Spark.memory.fraction 0,默认值为 0.2。所以我删除了那条线。或者尝试更新到 0.8 这在 spark 1.6.0 中应该没关系 - 它应该自动调整内存分数 【参考方案1】:

退出代码 52 来自 org.apache.spark.util.SparkExitCode,它是 val OOM=52 - 即 OutOfMemoryError。这是有道理的,因为我也在容器日志中找到了这一点:

16/02/16 17:09:59 ERROR executor.Executor: Managed memory leak detected; size = 4823704883 bytes, TID = 3226
16/02/16 17:09:59 ERROR executor.Executor: Exception in task 26.0 in stage 2.0 (TID 3226)
java.lang.OutOfMemoryError: Unable to acquire 1248 bytes of memory, got 0
        at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
        at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:354)
        at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:375)
        at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
        at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

(请注意,我现在不确定问题是在我的代码中还是由于 Tungsten 内存泄漏,但这是另一个问题)

【讨论】:

使用更少的内存,或者不同(更大)的机器:) 增加对我有用的分区数量。 200 个 sql 默认分区对于我的数据集来说太小了。 详情请见***.com/questions/45428145/… :) 谢谢

以上是关于为啥 Spark 作业失败并显示“退出代码:52”的主要内容,如果未能解决你的问题,请参考以下文章

为啥停止独立 Spark 主服务器失败并显示“没有 org.apache.spark.deploy.master.Master 停止”?

为啥 Spark 应用程序失败并显示“IOException: (null) entry in command string: null chmod 0644”? [复制]

为啥执行“sbt 程序集”失败并显示“不是有效的命令:程序集”?

Spark 作业在显示所有作业已完成然后失败后重新启动(TimeoutException: Futures timed out after [300 seconds])

为啥从 Redshift 读取到 Spark 如此缓慢?

为啥只有一个 spark 作业只使用一个执行器运行?