Apache Spark:广播挂起

Posted

技术标签:

【中文标题】Apache Spark:广播挂起【英文标题】:Apache Spark: Hangs on Broadcast 【发布时间】:2017-04-01 03:16:56 【问题描述】:

我很难在 Yarn 上调试我的 Spark 1.6.2 应用程序。它以客户端模式运行。本质上它是在锁定而不会崩溃,并且当它锁定时控制台中的日志如下所示。

17/03/31 20:12:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh007.prod.phx3.gdg:47579 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:03 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on p3plcdsh011.prod.phx3.gdg:63228 (size: 5.4 KB, free: 511.1 MB)
    17/03/31 20:12:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on p3plcdsh015.prod.phx3.gdg:9377 (size: 5.4 KB, free: 511.1 MB)
    17/03/31 20:12:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on p3plcdsh015.prod.phx3.gdg:61897 (size: 5.4 KB, free: 511.1 MB)
    17/03/31 20:12:03 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh002.prod.phx3.gdg:23170 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:03 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on p3plcdsh016.prod.phx3.gdg:16649 (size: 5.4 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh003.prod.phx3.gdg:55147 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on p3plcdsh008.prod.phx3.gdg:7619 (size: 5.4 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh003.prod.phx3.gdg:40830 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh011.prod.phx3.gdg:20056 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh008.prod.phx3.gdg:47385 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh003.prod.phx3.gdg:2063 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh011.prod.phx3.gdg:63228 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:04 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh008.prod.phx3.gdg:64036 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh016.prod.phx3.gdg:16649 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh013.prod.phx3.gdg:31979 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh013.prod.phx3.gdg:18407 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh004.prod.phx3.gdg:45536 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh008.prod.phx3.gdg:50826 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:06 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh015.prod.phx3.gdg:36247 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:06 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh015.prod.phx3.gdg:22848 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:06 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh015.prod.phx3.gdg:9377 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:06 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh015.prod.phx3.gdg:61897 (size: 26.7 KB, free: 511.1 MB)
    17/03/31 20:12:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on p3plcdsh008.prod.phx3.gdg:7619 (size: 26.7 KB, free: 511.1 MB)

在 Spark UI 中,锁定发生在地图或过滤器函数处。

以前有没有人看到过这种情况或知道如何调试这种情况?

看起来可能是由于内存问题或空间问题,但没有明确的迹象。我可以尝试增加内存,看看是否有帮助,但有人有调试技巧吗?

谢谢

【问题讨论】:

你在播什么? 调试问题它看起来像一个相当大的 Java 对象(由 300mb 未压缩文件支持的东西)......但它会序列化,否则我会看到关于序列化 @Vidya 的崩溃问题。可以序列化的对象的大小是否有限制或增加对象的最大大小的方法? 看到同样的问题 .. 广播对象对我来说很小。 【参考方案1】:

仅仅可序列化是不够的。问题可能有很多:您的序列化机制(Java 序列化很糟糕;Kryo 更好;等等),您的机器内存,确保您使用广播值而不是包装值等。

还有Spark配置spark.sql.autoBroadcastJoinThreshold

"配置表的最大大小(以字节为单位),该表将在执行连接时广播到所有工作节点。通过将此值设置为 -1,可以禁用广播。请注意,当前仅 Hive 支持统计信息运行了 ANALYZE TABLE COMPUTE STATISTICS noscan 命令的 Metastore 表。"

默认为 10MB 序列化。

最后,如果你去掉这个默认限制并且你有足够的内存,你仍然希望大小小于你最大的 RDD/DataFrame,你可以通过SizeEstimator 来检查:

import org.apache.spark.util.SizeEstimator._

logInfo(estimate(rdd))

最后,如果情况变得更糟,我会考虑在您的转换中从闪电般快速的缓存数据存储中进行查找,而不是广播此文件。

【讨论】:

以上是关于Apache Spark:广播挂起的主要内容,如果未能解决你的问题,请参考以下文章

Spark-广播变量

Apache Spark:广播连接不适用于缓存的数据帧

如何在 spark 2(java) 中创建广播变量?

Spark IMF传奇行动第18课:RDD持久化广播累加器总结

sparksql缓存表能做广播变量吗

Spark流处理中的广播变量