SPARK SQL中 Grouping sets转Expand怎么实现的(逻辑计划级别)
Posted 鸿乃江边鸟
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SPARK SQL中 Grouping sets转Expand怎么实现的(逻辑计划级别)相关的知识,希望对你有一定的参考价值。
背景
本文基于spark 3.1.2
之前在做bug调试的时候遇到了expand的问题,在此记录一下
分析
运行该sql:
create table test_a_pt(col1 int, col2 int,pt string) USING parquet PARTITIONED BY (pt);
insert into table test_a_pt values(1,2,'20220101'),(3,4,'20220101'),(1,2,'20220101'),(3,4,'20220101'),(1,2,'20220101'),(3,4,'20220101');
select count(*),col1 as alias
from test_a_pt
group by col1,col2
grouping sets (col1,col2)
order by col1,col2 ;
可以看到如下逻辑计划的变化(只截取grouping sets相关的):
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
'Sort ['col1 ASC NULLS FIRST], true 'Sort ['col1 ASC NULLS FIRST], true
+- 'GroupingSets [ArrayBuffer('col1), ArrayBuffer('col2)], ['col1, 'col2], ['col1, 'count(1) AS alias#221] +- 'GroupingSets [ArrayBuffer('col1), ArrayBuffer('col2)], ['col1, 'col2], ['col1, 'count(1) AS alias#221]
! +- 'UnresolvedRelation [test_table], [], false +- 'SubqueryAlias spark_catalog.default.test_table
! +- 'UnresolvedCatalogRelation `default`.`test_table`, [], false
对于GroupingSets里面的信息做一下解释:
'GroupingSets [ArrayBuffer('col1), ArrayBuffer('col2)], ['col1, 'col2], ['col1, 'count(1) AS alias#221]
-
*`*表示还未解析的计划,
-
[ArrayBuffer('col1), ArrayBuffer('col2)]
是grouping sets里面的两个值col1和col2 -
['col1, 'col2]
是group by后面的值col1和col2 -
['col1, 'count(1) AS alias#221]
是聚合表达式的值,也就是select后面的值count(*)
,col1 as alias
接下来就是:
ResolveGroupingAnalytics计划:
06:49:07.323 WARN org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics ===
'Sort ['col1 ASC NULLS FIRST], true 'Sort ['col1 ASC NULLS FIRST], true
!+- 'GroupingSets [ArrayBuffer(col1#223), ArrayBuffer(col2#224)], [col1#223, col2#224], [col1#223, count(1) AS alias#221L] +- Aggregate [col1#229, col2#230, spark_grouping_id#228L], [col1#229, count(1) AS alias#221L]
! +- SubqueryAlias spark_catalog.default.test_table +- Expand [List(col1#223, col2#224, pt#225, col1#226, null, 1), List(col1#223, col2#224, pt#225, null, col2#227, 2)], [col1#223, col2#224, pt#225, col1#229, col2#230, spark_grouping_id#228L]
! +- Relation[col1#223,col2#224,pt#225] parquet +- Project [col1#223, col2#224, pt#225, col1#223 AS col1#226, col2#224 AS col2#227]
! +- SubqueryAlias spark_catalog.default.test_table
! +- Relation[col1#223,col2#224,pt#225] parquet
代码自己可以去看,我们从逻辑来上分析:
'GroupingSets [ArrayBuffer(col1#223), ArrayBuffer(col2#224)], [col1#223, col2#224], [col1#223, count(1) AS alias#221L]
||
\\/
+- Aggregate [col1#229, col2#230, spark_grouping_id#228L], [col1#229, count(1) AS alias#221L]
+- Expand [List(col1#223, col2#224, pt#225, col1#226, null, 1), List(col1#223, col2#224, pt#225, null, col2#227, 2)], [col1#223, col2#224, pt#225, col1#229, col2#230, spark_grouping_id#228L]
+- Project [col1#223, col2#224, pt#225, col1#223 AS col1#226, col2#224 AS col2#227]
把最重要的转换提取出来做解释:
+- Project [col1#223, col2#224, pt#225, col1#223 AS col1#226, col2#224 AS col2#227]
-
前三个expression
col1#223, col2#224, pt#225
是根据 Relation(也就是从表test_a_pt直接获取到的,和表的字段保持一致) -
后面的expression
col1#223 AS col1#226, col2#224 AS col2#227
是根据grouping sets和group by的值整合过来的(并且会加上别名,取别名是为了Expand用的),如果没有group by 这个表达式才会取grouping sets的值,否则就取group by后面的值(目前spark 3.1.2的做法是group by的属性肯定包含了grouping sets里面的属性,SPARK-33229可以支持):
如:group by col1,col2 grouping sets (col1,col2)
则取 col1,col2
如:grouping sets (col1,col2)
则取 col1,col2
对于Expand:
Expand [List(col1#223, col2#224, pt#225, col1#226, null, 1), List(col1#223, col2#224, pt#225, null, col2#227, 2)], [col1#223, col2#224, pt#225, col1#229, col2#230, spark_grouping_id#228L]
List(col1#223, col2#224, pt#225, col1#226, null, 1), List(col1#223, col2#224, pt#225, null, col2#227, 2)
这些是expand的输入expression,其中
List(col1#223, col2#224, pt#225, col1#226, null, 1)
中的
col1#223, col2#224, pt#225
也是从表test_a_pt直接获取到的字段,和表的字段保持一致
col1#226
是从Project的col1#223 AS col1#226
取到的(作为Expand的输入表达式),
null
根据grouping sets的特性而增加的一行值(作为Expand的输入表达式)
1
也是增加的一行值(作为Expand的输入表达式)List(col1#223, col2#224, pt#225, null, col2#227, 2)
解释也和上面一样,只不过null的位置发生了变化,而1变成了2,这是为了做聚合的时候进行区分
[col1#223, col2#224, pt#225, col1#229, col2#230, spark_grouping_id#228L]
这些是expand的输出expression,其中
col1#223, col2#224, pt#225
和表test_a_pt的字段值一样col1#229, col2#230, spark_grouping_id#228L
是expand做的的扩展字段,
因为col1和col2的值可能为null,所以exprId和表test_a_pt不一致,
spark_grouping_id#228L 纯属于虚拟字段
而且expand的输入字段是一个Seq(Seq),这在ExpandExec的时候,会进行row的倍数扩大,Seq里的元素有几个,就会扩展多少倍。
对于Aggregate
Aggregate [col1#229, col2#230, spark_grouping_id#228L], [col1#229, count(1) AS alias#221L]
其中,
[col1#229, col2#230, spark_grouping_id#228L]
就是把Expand的输出字段,按照这三个表达式进行group by 聚合[col1#229, count(1) AS alias#221L]
是聚合表达式,包括聚合的部分字段和部分聚合函数,也就是select语句count(*),col1 as alias
至此Grouping sets 转Expand就分析完了。
以上是关于SPARK SQL中 Grouping sets转Expand怎么实现的(逻辑计划级别)的主要内容,如果未能解决你的问题,请参考以下文章
sql GROUP BY,GROUPING SETS,ROLLUP,CUBE,GROUPING_ID
Oracle分组小计总计示例(grouping sets的使用)
[解决方案]spark 2.4 报错:grouping expressions sequence is empty, *** is not an aggregate function.