具有不同窗口规范的链式火花列表达式产生低效的 DAG

Posted 2023-02-23

技术标签:

【中文标题】具有不同窗口规范的链式火花列表达式产生低效的 DAG【英文标题】：Chained spark column expressions with distinct windows specs produce inefficient DAG 【发布时间】：2020-05-04 10:17:57 【问题描述】：

上下文

假设您处理时间序列数据。您想要的结果依赖于具有不同窗口规格的多个窗口函数。结果可能类似于单个 spark 列表达式，例如间隔标识符。

现状

通常，我不使用 df.withColumn 存储中间结果，而是使用链/堆栈列表达式并相信 Spark 可以找到最有效的 DAG（在处理 DataFrame 时）。

可重现的例子

但是，在以下示例中（独立的 PySpark 2.4.4），使用 df.withColumn 存储中间结果会降低 DAG 复杂性。让我们考虑以下测试设置：

import pandas as pd
import numpy as np

from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

dfp = pd.DataFrame(
    
        "col1": np.random.randint(0, 5, size=100),
        "col2": np.random.randint(0, 5, size=100),
        "col3": np.random.randint(0, 5, size=100),
        "col4": np.random.randint(0, 5, size=100),        
    
)

df = spark.createDataFrame(dfp)
df.show(5)

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   2|   4|   1|
|   0|   2|   3|   0|
|   2|   0|   1|   0|
|   4|   1|   1|   2|
|   1|   3|   0|   4|
+----+----+----+----+
only showing top 5 rows

计算是任意的。基本上我们有 2 个窗口规格和 3 个计算步骤。这 3 个计算步骤相互依赖并使用交替的窗口规格：

w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")

# first step, arbitrary window func over 1st window
step1 = F.lag("col3").over(w1)

# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)

# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))

df_result = df.withColumn("result", step3)

通过df_result.explain()查看实物计划会发现4个交换和排序！但是，这里只需要 3 个，因为我们只更改了两次窗口规格。

df_result.explain()

== Physical Plan ==
*(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) THEN _we1#26L END AS result#22L]
+- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], [col4#3L ASC NULLS FIRST]
   +- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col3#2L, 200)
         +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _we1#26L]
            +- Window [max(_w1#24L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we1#26L], [col1#0L], [col2#1L ASC NULLS FIRST]
               +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(col1#0L, 200)
                     +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _w1#24L]
                        +- Window [lag(_w0#27L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w1#24L], [col3#2L], [col4#3L ASC NULLS FIRST]
                           +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
                              +- Exchange hashpartitioning(col3#2L, 200)
                                 +- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#27L, lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#23L], [col1#0L], [col2#1L ASC NULLS FIRST]
                                    +- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                                       +- Exchange hashpartitioning(col1#0L, 200)
                                          +- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]

改进

为了得到更好的 DAG，我们稍微修改了代码，将 step2 的列表达式存储为 withColumn 并传递此列的引用。新的逻辑计划确实只需要 3 次洗牌！

w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")

# first step, arbitrary window func
step1 = F.lag("col3").over(w1)

# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)

# save temporary
df = df.withColumn("tmp_variable", step2)
step2 = F.col("tmp_variable")

# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))

df_result = df.withColumn("result", step3).drop("tmp_variable")
df_result.explain()

== Physical Plan ==
*(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (tmp_variable#33L > 1) THEN _we0#42L END AS result#41L]
+- Window [max(tmp_variable#33L) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS _we0#42L], [col1#0L], [col2#1L ASC NULLS FIRST]
   +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col1#0L, 200)
         +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, tmp_variable#33L]
            +- Window [lag(_w0#34L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS tmp_variable#33L], [col3#2L], [col4#3L ASC NULLS FIRST]
               +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
                  +- Exchange hashpartitioning(col3#2L, 200)
                     +- Window [lag(col3#2L, 1, null) windowspecdefinition(col1#0L, col2#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _w0#34L], [col1#0L], [col2#1L ASC NULLS FIRST]
                        +- *(1) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], false, 0
                           +- Exchange hashpartitioning(col1#0L, 200)
                              +- Scan ExistingRDD[col1#0L,col2#1L,col3#2L,col4#3L]

问题

有人对这种奇怪的行为有答案吗？我认为堆叠/链接列表达式是最佳实践，因为它允许 Spark 最有效地优化中间步骤（与为中间结果创建引用相反）。

【问题讨论】：

“我认为堆叠/链接列表达式是最佳实践，因为它允许 Spark 最有效地优化中间步骤” - 这不是真的。 withColumn 相当于子查询 - 在某些情况下（非分析查询）它没有区别。然而，分析查询是另一回事。在许多情况下，子查询是强制性的。 Spark 应该能够对此进行优化吗？可能......为什么结果不同？因为局部表达式和跨逻辑节点（这里是子查询）的优化是不一样的。 @10465355saysReinstateMonica 感谢您的回复。你有关于这个主题的进一步文档/解释的提示吗？我发现很难在网上找到合适的最新资源。我认为它实际上并没有记录在案，不包括 SQL 标准（Spark 试图遵循的标准）之类的东西。最好的办法是检查源代码，因为它是内部组件，它可以从一个版本到另一个版本。 【参考方案1】：

如果我们查看 Analyzed Logical Plan (by=df_result.explain(True)) 我们可以看到，虽然我们没有 tmp_variable，但由于 **lazy evaluation** 上的数据集/数据帧/表创建逻辑计划的方式，Analyzer 假设该列存在（惰性）对该列执行分析。由于这个假设，现在它需要比以前的情况少建造 2 个腋窗才能达到相同的结果。实际上，通过遵循Parsed Logical Plan，我们看到分析器在创建tmp_variable 时需要构建较少未评估的窗口tmp_variable，而不是在它的下推方式上构建窗口，它主要执行简单的项目（选择）。

【讨论】：

以上是关于具有不同窗口规范的链式火花列表达式产生低效的 DAG的主要内容，如果未能解决你的问题，请参考以下文章

具有不同窗口规范的链式火花列表达式产生低效的 DAG

上下文

现状

可重现的例子

改进

相关性

问题