一文弄懂Hive中谓词下推(on与where的区别)

文章目录

场景模拟

数仓实际开发中经常会涉及到多表关联，这个时候就会涉及到on与where的使用。如果对这两者在数仓中的作用比较混乱的，读完这一文就可以理解透彻了。

先来说一下where与on在SQL中最直观的区别

on 在筛选条件的时候，on会显示所有满足 | 不满足条件的数据(补NULL)，而 where 只显示满足条件的数据。

on对join类型(内外连接)的改变而会有反应而where没有，对where来说只是当个连接作用。

上面的说法就不具体举例验证了，这里我们主要研究where与on在hive中对性能的影响，有条件的小伙伴可以手动试一下，贴上数据源

CREATE TABLE a (id string,name string) PARTITIONED BY (dt STRING);
CREATE TABLE b (id string,dept string) PARTITIONED BY (dt STRING);
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("1","Daniel");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("2","Andy");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("3","Marc");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("1","BD");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("2","BE");
SELECT * from a where dt = '2022-09-08';
SELECT * from b where dt = '2022-09-08';
1
2
3
4
5
6
7
8
9

先上一个实际的需求，关联a，b两表，取a表最新日期的数据

SELECT *
FROM a
JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';
1
2
3
4

相信绝大多数人会这么写，先说结论，这样写没有任何问题

问题描述

可能有的小伙伴会这样尝试

SELECT *
FROM a
JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';
1
2
3
4

这样与上面的效果是等同的，也没有问题，那么问题在哪里？

如果需要以a表为主表，关联查询b表，也就是左外连接，这个时候两种写法就有问题了

写法一

SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';
1
2
3
4

高效写法，hive会只取指定日期的数据

写法二

SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';
1
2
3
4

缓慢写法，hive会先查出所有数据做关联，然后再去关联指定日期的数据

写法三

SELECT *
FROM
  (SELECT *
   FROM a
   WHERE dt = '2022-09-08') t1
LEFT JOIN b ON t1.id = b.id;
1
2
3
4
5
6

高效写法，hive会只取指定日期的数据。虽然写法看着比较low，但是效果是等同于1的，为了写出不那么low的sql，这里先介绍一下Hive中的谓词下推

这里拿写法一和写法二的执行计划来简单说明证明一下这个观点，我这里引擎为hive on spark

写法一

Explain	
STAGE DEPENDENCIES:	
  Stage-2 is a root stage	
  Stage-1 depends on stages: Stage-2	
  Stage-0 depends on stages: Stage-1	
	
STAGE PLANS:	
  Stage: Stage-2	
    Spark	
      DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53374	
      Vertices:	
        Map 2 	
            Map Operator Tree:	
                TableScan	
                  alias: b	
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                  // 无需过滤
                  Spark HashTable Sink Operator	
                    keys:	
                      0 id (type: string)	
                      1 id (type: string)	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-1	
    Spark	
      DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53373	
      Vertices:	
        Map 1 	
            Map Operator Tree:	
                TableScan	
                  alias: a
                  // 可以看到在表扫描的时候就做了过滤，所以在后面的HashTable Sink Operator就不需要过滤了
                  filterExpr: (dt = '2022-09-08') (type: boolean)	
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                  Filter Operator	
                    predicate: (dt = '2022-09-08') (type: boolean)	
                    Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE	
                    Map Join Operator	
                      condition map:	
                           Left Outer Join0 to 1	
                      keys:	
                        0 id (type: string)	
                        1 id (type: string)	
                      outputColumnNames: _col0, _col1, _col6, _col7, _col8	
                      input vertices:	
                        1 Map 2	
                      Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                      Select Operator	
                        expressions: _col0 (type: string), _col1 (type: string), '2022-09-08' (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                        Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                        File Output Operator	
                          compressed: false	
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                          table:	
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-0	
    Fetch Operator	
      limit: -1	
      Processor Tree:	
        ListSink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

写法二

Explain	
STAGE DEPENDENCIES:	
  Stage-2 is a root stage	
  Stage-1 depends on stages: Stage-2	
  Stage-0 depends on stages: Stage-1	
	
STAGE PLANS:	
  Stage: Stage-2	
    Spark	
      DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53395	
      Vertices:	
        Map 2 	
            Map Operator Tree:	
                TableScan	
                  alias: b	
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE	
                  Spark HashTable Sink Operator	
                  // 过滤一次
                    filter predicates:	
                      0 {(dt = '2022-09-08')}	
                      1 	
                    keys:	
                      0 id (type: string)	
                      1 id (type: string)	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-1	
    Spark	
      DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53394	
      Vertices:	
        Map 1 	
            Map Operator Tree:	
                TableScan
                  // 可以看到表扫描的时候没有过滤，所以需要在每个stage HashTable Sink Operator的进行过滤
                  alias: a	
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                  Map Join Operator	
                    condition map:	
                         Left Outer Join0 to 1	
                    // 过滤两次                         
                    filter predicates:	
                      0 {(dt = '2022-09-08')}	
                      1 	
                    keys:	
                      0 id (type: string)	
                      1 id (type: string)	
                    outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8	
                    input vertices:	
                      1 Map 2	
                    Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                    Select Operator	
                      expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                      Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                      File Output Operator	
                        compressed: false	
                        Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                        table:	
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-0	
    Fetch Operator	
      limit: -1	
      Processor Tree:	
        ListSink	
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

从上面的注释可以看出，在写法一的谓词下推后，数据在一开始扫描的时候就已经被过滤掉了。而在写法的不推的情况下，会拿所有的数据进行查询，最后再进行多次过滤。

Hive谓词下推

谓词下推概念

谓词下推 Predicate Pushdown（PPD）：简而言之，就是在不影响结果的情况下，尽量将过滤条件提前执行。谓词下推后，过滤条件在map端执行，减少了map端的输出，降低了数据在集群上传输的量，节约了集群的资源，也提升了任务的性能。

PPD 配置

PPD控制参数：hive.optimize.ppd 默认开启

基本概念

Name	名称	解释
Preserved Row table	保留表	在outer join中需要返回所有数据的表叫做保留表; left outer join中，左表是保留表； right outer join中，右表则是保留表； full outer join中左表和右表都要返回所有数据，则左右表都是保留表。
Null Supplying table	空表	相对来讲，在outer join中对于没有匹配到的行需要用NULL来填充的表称为空表； left outer join中，左表的数据全返回，对于左表在右表中无法匹配的数据的列用NULL表示，则此时右表是空表； right outer join中，左表是空表； full outer join中左表和右表都是Null Supplying table，因为左表和右表都会用NULL来填充无法匹配的数据。
During Join predicate	Join中的谓词	Join中的谓词是指Join On语句中的谓词; 如：a join b on a.id=1 那么a.id=1是Join中的谓词。
After Join predicate	Join之后的谓词	where语句中的谓词称之为Join之后的谓词。

官网解释

The logic can be summarized by these two rules:

During Join predicates cannot be pushed past Preserved Row tables.(保留表的谓词写在join中不能下推)
After Join predicates cannot be pushed past Null Supplying tables.(空表的谓词写在join之后不能下推)

This captured in the following table:

Preserved Row Table Null Supplying Table
Join Predicate Case J1: Not Pushed Case J2: Pushed
Where Predicate Case W1: Pushed Case W2: Not Pushed

	Preserved Row Table	Null Supplying Table
Join Predicate	Case J1: Not Pushed	Case J2: Pushed
Where Predicate	Case W1: Pushed	Case W2: Not Pushed

具体case见官网，这里有比较详细的执行计划分析https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

规则总结

保留表的谓词写在join中不能下推，需要用where；
空表的谓词写在join之后不能下推，需要用on；
在 join关联情况下，过滤条件无论在join中还是where中谓词下推都生效；
在full join关联情况下，过滤条件无论在join中还是where中谓词下推都不生效。

具体案例

Pushed or Not	SQL
Pushed	select * from a join b on a.id = b.id and a.dt = ‘2022-09-08’;
Pushed	select * from a join b on a.id = b.id where a.dt = ‘2022-09-08’;
Pushed	select * from a join b on a.id = b.id and b.dt = ‘2022-09-08’;
Pushed	select * from a join b on a.id = b.id where b.dt = ‘2022-09-08’;
Not Pushed	select * from a left join b on a.id = b.id and a.dt = ‘2022-09-08’;
Pushed	select * from a left join b on a.id = b.id where a.dt = ‘2022-09-08’;
Pushed	select * from a left join b on a.id = b.id and b.dt = ‘2022-09-08’;
Not Pushed	select * from a left join b on a.id = b.id where b.dt = ‘2022-09-08’;
Pushed	select * from a right join b on a.id = b.id and a.dt = ‘2022-09-08’;
Not Pushed	select * from a right join b on a.id = b.id where a.dt = ‘2022-09-08’;
Not Pushed	select * from a right join b on a.id = b.id and b.dt = ‘2022-09-08’;
Pushed	select * from a right join b on a.id = b.id where b.dt = ‘2022-09-08’;
Not Pushed	select * from a full join b on a.id = b.id and a.dt = ‘2022-09-08’;
Not Pushed	select * from a full join b on a.id = b.id where a.dt = ‘2022-09-08’;
Not Pushed	select * from a full join b on a.id = b.id and b.dt = ‘2022-09-08’;
Not Pushed	select * from a full join b on a.id = b.id where b.dt = ‘2022-09-08’;

规则表

	join(inner join)		left outer join		right outer join		full outer join
	left table	right table	left table	right table	left table	right table	left table	right table
join	Pushed	Pushed	Not Pushed	Pushed	Pushed	Not Pushed	Not Pushed	Not Pushed
where	Pushed	Pushed	Pushed	Not Pushed	Not Pushed	Pushed	Not Pushed	Not Pushed

特殊说明

不确定函数之类的函数的是不能下推的，例如rand()类，但是unix_timestamp()除外，观察它的执行计划可以知，它可以下推

EXPLAIN
SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = unix_timestamp();
1
2
3
4
5

Explain	
STAGE DEPENDENCIES:	
  Stage-2 is a root stage	
  Stage-1 depends on stages: Stage-2	
  Stage-0 depends on stages: Stage-1	
	
STAGE PLANS:	
  Stage: Stage-2	
    Spark	
      DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53432	
      Vertices:	
        Map 2 	
            Map Operator Tree:	
                TableScan	
                  alias: b	
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE	
                  Spark HashTable Sink Operator	
                    // 无需过滤
                    keys:	
                      0 id (type: string)	
                      1 id (type: string)	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-1	
    Spark	
      DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53431	
      Vertices:	
        Map 1 	
            Map Operator Tree:	
                TableScan	
                  alias: a	
                  // 表扫描时已过滤
                  filterExpr: (dt = 1662522398) (type: boolean)	
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                  Filter Operator	
                    predicate: (dt = 1662522398) (type: boolean)	
                    Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE	
                    Map Join Operator	
                      condition map:	
                           Left Outer Join0 to 1	
                      keys:	
                        0 id (type: string)	
                        1 id (type: string)	
                      outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8	
                      input vertices:	
                        1 Map 2	
                      Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                      Select Operator	
                        expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                        Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                        File Output Operator	
                          compressed: false	
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                          table:	
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
            Local Work:	
              Map Reduce Local Work	
	
  Stage: Stage-0	
    Fetch Operator	
      limit: -1	
      Processor Tree:	
        ListSink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

结论

1. 对于Join(Inner Join)、Full outer Join，条件写在on后面，还是where后面，性能上面没有区别；
2. 对于Left outer Join ，右侧的表写在on后面、左侧的表写在where后面，性能上有提高；
3. 对于Right outer Join，左侧的表写在on后面、右侧的表写在where后面，性能上有提高；
4. 当条件分散在两个表时，谓词下推可按上述结论2和3自由组合，情况如下：

SQL	过滤时机
select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’);	id在map端过滤，dt在reduce端过滤，低效
select * from a left outer join b on ( a.id = b.id and b.id = ‘2022-09-08’) where a.dt=‘2022-09-08’;	id，dt都在map端过滤，高效
select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’) where b.id = ‘2022-09-08’;	id，dt都在reduce端过滤，极低效
select * from a left outer join b on ( a.id = b.id ) where a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’;	id在reduce端过滤，dt在map端过滤，低效

相关阅读:
【css】创建一个带有上矩形和下倒三角角标
 makefile之目标文件生成
 人体神经元是哪个层次的,神经元属于器官层次吗
 MATLAB环境下简单的基于双向长短时记忆网络的时间序列预测
 java毕业设计人职匹配推荐系统mybatis+源码+调试部署+系统+数据库+lw
Python图像处理库Pillow(PIL)的简单使用
 poium测试库之JavaScript API封装原理
 浏览器的本地存储
 鸿蒙自定义侧滑菜单布局（DrawerLayout）
Docker容器之compose容器集群的快速编排
原文地址：https://blog.csdn.net/a805814077/article/details/126777345