Hive中的Predicate Pushdown Rules(谓词下推规则)

Posted strongyoung88

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Hive中的Predicate Pushdown Rules(谓词下推规则)相关的知识,希望对你有一定的参考价值。

谓词下推概念

谓词下推 Predicate Pushdown(PPD):简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。

PPD 配置

PPD控制参数:hive.optimize.ppd

  • Default Value: true
  • Added In: Hive 0.4.0

相关定义

  • Preserved Row table

The table in an Outer Join that must return all rows.
For left outer joins this is the Left table, for right outer joins it is the Right table, and for full outer joins both tables are Preserved Row tables.

  • Null Supplying table

This is the table that has nulls filled in for its columns in unmatched rows.
In the non-full outer join case, this is the other table in the Join. For full outer joins both tables are also Null Supplying tables.

  • During Join predicate

A predicate that is in the JOIN ON clause.
For example, in ‘R1 join R2 on R1.x = 5’ the predicate ‘R1.x = 5’ is a During Join predicate.

  • After Join predicate

A predicate that is in the WHERE clause.

PPD规则:

规则的逻辑描述如下:

  • During Join predicates cannot be pushed past Preserved Row tables.
  • After Join predicates cannot be pushed past Null Supplying tables.

以表格的形式描述如下:

-Preserved Row tablesNull Supplying tables
Join PredicateCase J1: Not PushedCase J2: Pushed
Where PredicateCase W1: PushedCase W2: Not Pushed

Push:谓词下推,可以理解为被优化
Not Push:谓词没有下推,可以理解为没有被优化

实验

实验结果列表形式:

Pushed or NotSQL
Pushedselect ename,dept_name from E join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushedselect ename,dept_name from E join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushedselect ename,dept_name from E join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushedselect ename,dept_name from E join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushedselect ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Pushedselect ename,dept_name from E left outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Pushedselect ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushedselect ename,dept_name from E left outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Pushedselect ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushedselect ename,dept_name from E right outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushedselect ename,dept_name from E right outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Pushedselect ename,dept_name from E right outer join D on E.dept_id = D.dept_id where D.dept_id='D001';
Not Pushedselect ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001');
Not Pushedselect ename,dept_name from E full outer join D on E.dept_id = D.dept_id where E.eid='HZ001';
Not Pushedselect ename,dept_name from E full outer join D on ( E.dept_id = D.dept_id and D.dept_id='D001');
Not Pushedselect ename,dept_name from E full outer join D on E.dept_id = D.dept_id where D.dept_id='D001';

实验结果表格形式:

Join(inner join)Left Outer JoinRight Outer JoinFull Outer Join
Left TableRight TableLeft TableRight TableLeft TableRight TableLeft TableRight Table
Join PredicatePushedPushedNot PushedPushedPushedNot PushedNot PushedNot Pushed
Where PredicatePushedPushedPushedNot PushedNot PushedPushedNot PushedNot Pushed

此表实际上就是上述PPD规则表

结论

1、对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2、对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3、对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4、当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:

SQL过滤时机
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001');dept_id在map端过滤,eid在reduce端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001';dept_id,eid都在map端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001';dept_id,eid都在reduce端过滤
select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001';dept_id在reduce端过滤,eid在map端过滤

注意:如果在表达式中含有不确定函数,整个表达式的谓词将不会被pushed,例如

select a.* 
from a join b on a.id = b.id
where a.ds = '2019-10-09' and a.create_time = unix_timestamp();

因为unix_timestamp是不确定函数,在编译的时候无法得知,所以,整个表达式不会被pushed,即ds='2019-10-09'也不会被提前过滤。类似的不确定函数还有rand()等。

参考文献:
[1] https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

以上是关于Hive中的Predicate Pushdown Rules(谓词下推规则)的主要内容,如果未能解决你的问题,请参考以下文章

谓词下推 vs On 子句

Kylin 下压查询 (Pushdown) 到 Impala

hive 错误 FAILED: SemanticException [Error 10041]: No partition predicate found for

MySQL 5.6 Index Condition Pushdown

为啥 LINQ 中的 LastOrDefault(predicate) 比 FirstOrDefault(predicate) 快

MySQL Index Condition Pushdown