火花 SQL 超时

Posted

技术标签:

【中文标题】火花 SQL 超时【英文标题】:Spark SQL Timeout 【发布时间】:2014-12-05 00:41:59 【问题描述】:

我正在尝试在 Spark 独立集群上运行一个相对简单的 Spark SQL 命令

select a.name, b.name, s.score
from score s
inner join A a on a.id = s.a_id
inner join B b on b.id = s.b_id
where pmod(a.id, 3) != 3 and pmod(b.id, 3) != 0

表格大小如下

A: 25,000
B: 2,500,000
score: 25,000,000

因此,我希望得到 25,000,000 行的结果。我想用 Spark SQL 运行这个查询,然后处理每一行。这是相关的火花代码

val sqlContext = new HiveContext(sc)
val sql = "<above SQL>"
sqlContext.sql(sql).first

该命令在表分数大小为 200,000 时运行良好,但现在不运行。以下是相关日志

14/12/04 16:35:14 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:35:43 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:36:24 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:37:11 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:38:13 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:39:19 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:39:48 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
14/12/04 16:40:08 WARN MemoryStore: Not enough space to store block broadcast_12 in memory! Free memory is 1938057068 bytes.
14/12/04 16:40:08 WARN MemoryStore: Persisting block broadcast_12 to disk instead.
java.util.concurrent.TimeoutException: Futures timed out after [5 minutes]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.sql.execution.BroadcastHashJoin.execute(joins.scala:431)
    at org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42)
    at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:111)
    at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
    at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:440)
    at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:103)
    at org.apache.spark.rdd.RDD.first(RDD.scala:1092)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
    at $iwC$$iwC$$iwC.<init>(<console>:25)
    at $iwC$$iwC.<init>(<console>:27)
    at $iwC.<init>(<console>:29)
    at <init>(<console>:31)
    at .<init>(<console>:35)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
    at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
    at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
    at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
    at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
    at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
    at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
    at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
    at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
    at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
    at org.apache.spark.repl.Main$.main(Main.scala:31)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我最初的想法是增加这个超时时间,但是如果不重新编译源代码如 show here,这看起来是不可能的。在父目录中,我还看到了一些不同的连接,但我不确定如何让 spark 使用其他类型的连接。

我还试图通过将 spark.executor.memory 增加到 10g 来修复我关于持久化到磁盘的第一个警告,但这并没有解决问题。

有人知道我如何实际运行此查询吗?

【问题讨论】:

看到类似的问题 - 你有没有找到解决这个问题的方法? 【参考方案1】:

也许您遇到了广播连接的超时。出于某种原因,它是一个未记录的配置选项,称为 spark.sql.broadcastTimeout(默认为 300 秒)。

所以你可以尝试增加这个(为我们工作),或者让 Spark 不进行广播连接(尽管建议将小表连接到大表,请参阅 https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html)。

【讨论】:

以上是关于火花 SQL 超时的主要内容,如果未能解决你的问题,请参考以下文章

sql 斯卡拉火花-SQL totalrevenuedaily.sql

HiveContext 与火花 sql

sql 斯卡拉火花创建-databases.sql

Hivecontext.sql 返回空结果火花

创建数据库火花sql

如何删除时间戳火花sql中的毫秒数