Spark表联接-资源分配问题

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark表联接-资源分配问题相关的知识,希望对你有一定的参考价值。

我在使用Spark群集处理蜂巢表时陷入困境(放置了纱线)。我有7个表需要加入,然后替换一些空值并将结果写回到Hive final DF。

我使用spark SQL(Scala),首先创建6个不同的数据帧。然后加入所有数据框并将结果写回配置单元表。

[五分钟后,我的代码抛出以下错误,我知道这是由于未正确设置资源分配。

19/10/13 06:46:53 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from /100.66.0.1:36467 is closed
19/10/13 06:46:53 ERROR cluster.YarnScheduler: Lost executor 401 on aaaa-bd10.pq.internal.myfove.com: Container container_e33_1570683426425_4555_01_000414 exited from explicit termination request.
19/10/13 06:47:02 ERROR cluster.YarnScheduler: Lost executor 391 on aaaa-bd10.pq.internal.myfove.com: Container marked as failed: container_e33_1570683426425_4555_01_000403 on host: aaaa-bd10.pq.internal.myfove.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

我的硬件规格

HostName    Memory in GB    CPU Memory for Yarn CPU For Yarn
Node 1      126             32  90               26
Node 2      126             32  90               26
Node 3      126             32  90               26
Node 4      126             32  90               26

如何正确设置以下变量,以使我的代码不会引发错误(容器标记为失败-被请求143杀死)?

我正在尝试不同的配置,但没有任何帮助。

val spark = (SparkSession.builder
             .appName("Final Table")
             .config("spark.driver.memory", "5g") 
             .config("spark.executor.memory", "15g") 
             .config("spark.dynamicAllocation.maxExecutors","6")
             .config("spark.executor.cores", "5")
             .enableHiveSupport()
             .getOrCreate())



DF1 = spark.sqk("Select * from table_1") //1.4 million records and 10 var 
DF2 = spark.sqk("Select * from table_2") //1.4 million records and 3000 
DF3 = spark.sqk("Select * from table_3") //1.4 million records and 300 
DF4 = spark.sqk("Select * from table_4") //1.4 million records and 600 
DF5 = spark.sqk("Select * from table_5") //1.4 million records and 150 
DF6 = spark.sqk("Select * from table_6") //1.4 million records and 2 
DF7 = spark.sqk("Select * from table_7") //1.4 million records and 12 


val joinDF1 = df1.join(df2, df1("number") === df2("number"), "left_outer").drop(df2("number")) 
val joinDF2 = joinDF1.join(df3,joinDF1("number") === df3("number"), "left_outer").drop(df3("number")) 
val joinDF3 = joinDF2.join(df4,joinDF2("number") === df4("number"), "left_outer").drop(df4("number")) 
val joinDF4 = joinDF3.join(df5,joinDF3("number") === df5("number"), "left_outer").drop(df5("number")) 
val joinDF5 = joinDF4.join(df6,joinDF4("number") === df6("number"), "left_outer").drop(df6("number")).drop("Dt") 
val joinDF6 = joinDF5.join(df7,joinDF5("number") === df7("number"), "left_outer").drop(df7("number")).drop("Dt") 
joinDF6.createOrReplaceTempView("joinDF6")

spark.sql("create table hive table as select * from joinDF6")
答案

如果使用的是Ambari,请在Ambari中检查yarn.nodemanager.log-dirs。如果不是,请尝试查找此属性,并且如果它指向您的空间非常小的目录,请将其更改为具有更多空间的其他目录。

尽管正在运行的任务容器创建的块会存储在yarn.nodemanger.log-dirs位置,如果它不足以存储块容器,则会开始失败。

以上是关于Spark表联接-资源分配问题的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Spark sql 在 Databricks 中使用内部联接更新 Databricks Delta 表

HIVE:在分区表中映射联接

使用spark数据帧/数据集/ RDD使用内部联接进行更新

spark 资源大小分配与并行处理

每日一题说说Spark的动态资源分配?

Spark在实际项目中分配更多资源