Hive中的Predicate Pushdown Rules(谓词下推规则)
Posted strongyoung88
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive中的Predicate Pushdown Rules(谓词下推规则)相关的知识,希望对你有一定的参考价值。
谓词下推概念
谓词下推 Predicate Pushdown(PPD)
:简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。
PPD 配置
PPD
控制参数:hive.optimize.ppd
- Default Value: true
- Added In: Hive 0.4.0
相关定义
- Preserved Row table
The table in an Outer Join that must return all rows.
For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.
- Null Supplying table
This is the table that has nulls filled in for its columns in unmatched rows.
In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.
- During Join predicate
A predicate that is in the JOIN ON clause.
For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.
- After Join predicate
A predicate that is in the WHERE clause.
PPD规则:
规则的逻辑描述如下:
- During Join predicates cannot be pushed past Preserved Row tables.
- After Join predicates cannot be pushed past Null Supplying tables.
以表格的形式描述如下:
- | Preserved Row tables | Null Supplying tables |
---|---|---|
Join Predicate | Case J1: Not Pushed | Case J2: Pushed |
Where Predicate | Case W1: Pushed | Case W2: Not Pushed |
Push
:谓词下推,可以理解为被优化
Not Push
:谓词没有下推,可以理解为没有被优化
实验
实验结果列表形式:
Pushed or Not | SQL |
---|---|
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Pushed | select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Pushed | select ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001' ; |
Not Pushed | select ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001') ; |
Not Pushed | select ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001' ; |
实验结果表格形式:
Join(inner join) | Left Outer Join | Right Outer Join | Full Outer Join | |||||
---|---|---|---|---|---|---|---|---|
Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | Left Table | Right Table | |
Join Predicate | Pushed | Pushed | Not Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Not Pushed |
Where Predicate | Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Pushed | Not Pushed | Not Pushed |
此表实际上就是上述PPD规则表
结论
1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4、当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:
SQL | 过滤时机 |
---|---|
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001'); | dept_id在map端过滤,eid在reduce端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001'; | dept_id,eid都在map端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001'; | dept_id,eid都在reduce端过滤 |
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001'; | dept_id在reduce端过滤,eid在map端过滤 |
注意:如果在表达式中含有不确定函数,整个表达式的谓词将不会被pushed,例如
select a.*
from a join b on a.id = b.id
where a.ds = '2019-10-09' and a.create_time = unix_timestamp();
因为unix_timestamp
是不确定函数,在编译的时候无法得知,所以,整个表达式不会被pushed,即ds='2019-10-09'
也不会被提前过滤。类似的不确定函数还有rand()
等。
参考文献:
[1] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
以上是关于Hive中的Predicate Pushdown Rules(谓词下推规则)的主要内容,如果未能解决你的问题,请参考以下文章
Kylin 下压查询 (Pushdown) 到 Impala
hive 错误 FAILED: SemanticException [Error 10041]: No partition predicate found for
MySQL 5.6 Index Condition Pushdown
为啥 LINQ 中的 LastOrDefault(predicate) 比 FirstOrDefault(predicate) 快