谓词下推

Posted 2023-03-20

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了谓词下推相关的知识，希望对你有一定的参考价值。

参考技术A 谓词下推概念：

谓词下推 Predicate Pushdown（PPD）：简而言之，就是在不影响结果的情况下，尽量将过滤条件提前执行。谓词下推后，过滤条件在map端执行，减少了map端的输出，降低了数据在集群上传输的量，节约了集群的资源，也提升了任务的性能。

PPD 配置：PPD控制参数：hive.optimize.ppd

Push:谓词下推，可以理解为被优化

Not Push:谓词没有下推，可以理解为没有被优化

push:谓词下推，可以理解为被优化

not push:谓词没有下推，可以理解为没有被优化

实验结果：

此表实际上就是上述PPD规则表

结论

1、对于Join(Inner Join)、Full outer Join，条件写在on后面，还是where后面，性能上面没有区别；

2、对于Left outer Join ，右侧的表写在on后面、左侧的表写在where后面，性能上有提高；

3、对于Right outer Join，左侧的表写在on后面、右侧的表写在where后面，性能上有提高；

4、当条件分散在两个表时，谓词下推可按上述结论2和3自由组合，情况如下：

SQL   过滤时机

select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001' and D.dept_id = 'D001');   dept_id在map端过滤，eid在reduce端过滤

select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and D.dept_id = 'D001') where E.eid='HZ001';   dept_id，eid都在map端过滤

select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id and E.eid='HZ001') where D.dept_id = 'D001';   dept_id，eid都在reduce端过滤

select ename,dept_name from E left outer join D on ( E.dept_id = D.dept_id ) where E.eid='HZ001' and D.dept_id = 'D001';   dept_id在reduce端过滤，eid在map端过滤

注意：如果在表达式中含有不确定函数，整个表达式的谓词将不会被pushed，例如

select a.*

from a join b on a.id = b.id

where a.ds = '2019-10-09' and a.create_time = unix_timestamp();

1

2

3

因为unix_timestamp是不确定函数，在编译的时候无法得知，所以，整个表达式不会被pushed，即ds='2019-10-09'也不会被提前过滤。类似的不确定函数还有rand()等。

PPD控制参数：hive.optimize.ppd

聊聊谓词下推的事

对于数仓开发来说，写好一条SQL，需要熟读Hive 源码。

FilterPPD 会把可以下推的谓词抽取出来，存入OpWalkerInfo.opToPushdownPredMap.pushdownPreds 中

JoinPPD 的主要作用就是把能够下推的谓词和不能够下推的谓词分开，将不能够下推的谓词重新生成FilterOperator –> FIL[8]

TableScanPPD 将能够下推的谓词生成FIL[9] 并置于TS[0]之后

以上是关于谓词下推的主要内容，如果未能解决你的问题，请参考以下文章