• 一文弄懂Hive中谓词下推(on与where的区别)


    场景模拟

    数仓实际开发中经常会涉及到多表关联,这个时候就会涉及到on与where的使用。如果对这两者在数仓中的作用比较混乱的,读完这一文就可以理解透彻了。

    先来说一下where与on在SQL中最直观的区别

    1. on 在筛选条件的时候,on会显示所有满足 | 不满足条件的数据(补NULL),而 where 只显示满足条件的数据。

    2. on对join类型(内外连接)的改变而会有反应而where没有,对where来说只是当个连接作用。

    上面的说法就不具体举例验证了,这里我们主要研究where与on在hive中对性能的影响,有条件的小伙伴可以手动试一下,贴上数据源

    CREATE TABLE a (id string,name string) PARTITIONED BY (dt STRING);
    CREATE TABLE b (id string,dept string) PARTITIONED BY (dt STRING);
    INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("1","Daniel");
    INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("2","Andy");
    INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("3","Marc");
    INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("1","BD");
    INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("2","BE");
    SELECT * from a where dt = '2022-09-08';
    SELECT * from b where dt = '2022-09-08';
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    先上一个实际的需求,关联a,b两表,取a表最新日期的数据

    SELECT *
    FROM a
    JOIN b ON a.id = b.id
    WHERE a.dt = '2022-09-08';
    
    • 1
    • 2
    • 3
    • 4

    相信绝大多数人会这么写,先说结论,这样写没有任何问题


    问题描述

    可能有的小伙伴会这样尝试

    SELECT *
    FROM a
    JOIN b ON a.id = b.id
    AND a.dt = '2022-09-08';
    
    • 1
    • 2
    • 3
    • 4

    这样与上面的效果是等同的,也没有问题,那么问题在哪里?

    如果需要以a表为主表,关联查询b表,也就是左外连接,这个时候两种写法就有问题了

    • 写法一
    SELECT *
    FROM a
    LEFT JOIN b ON a.id = b.id
    WHERE a.dt = '2022-09-08';
    
    • 1
    • 2
    • 3
    • 4

    高效写法,hive会只取指定日期的数据

    • 写法二
    SELECT *
    FROM a
    LEFT JOIN b ON a.id = b.id
    AND a.dt = '2022-09-08';
    
    • 1
    • 2
    • 3
    • 4

    缓慢写法,hive会先查出所有数据做关联,然后再去关联指定日期的数据

    • 写法三
    SELECT *
    FROM
      (SELECT *
       FROM a
       WHERE dt = '2022-09-08') t1
    LEFT JOIN b ON t1.id = b.id;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    高效写法,hive会只取指定日期的数据。虽然写法看着比较low,但是效果是等同于1的,为了写出不那么low的sql,这里先介绍一下Hive中的谓词下推

    这里拿写法一和写法二的执行计划来简单说明证明一下这个观点,我这里引擎为hive on spark

    • 写法一
    Explain	
    STAGE DEPENDENCIES:	
      Stage-2 is a root stage	
      Stage-1 depends on stages: Stage-2	
      Stage-0 depends on stages: Stage-1	
    	
    STAGE PLANS:	
      Stage: Stage-2	
        Spark	
          DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53374	
          Vertices:	
            Map 2 	
                Map Operator Tree:	
                    TableScan	
                      alias: b	
                      Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                      // 无需过滤
                      Spark HashTable Sink Operator	
                        keys:	
                          0 id (type: string)	
                          1 id (type: string)	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-1	
        Spark	
          DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53373	
          Vertices:	
            Map 1 	
                Map Operator Tree:	
                    TableScan	
                      alias: a
                      // 可以看到在表扫描的时候就做了过滤,所以在后面的HashTable Sink Operator就不需要过滤了
                      filterExpr: (dt = '2022-09-08') (type: boolean)	
                      Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                      Filter Operator	
                        predicate: (dt = '2022-09-08') (type: boolean)	
                        Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE	
                        Map Join Operator	
                          condition map:	
                               Left Outer Join0 to 1	
                          keys:	
                            0 id (type: string)	
                            1 id (type: string)	
                          outputColumnNames: _col0, _col1, _col6, _col7, _col8	
                          input vertices:	
                            1 Map 2	
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                          Select Operator	
                            expressions: _col0 (type: string), _col1 (type: string), '2022-09-08' (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                            Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                            File Output Operator	
                              compressed: false	
                              Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                              table:	
                                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-0	
        Fetch Operator	
          limit: -1	
          Processor Tree:	
            ListSink
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 写法二
    Explain	
    STAGE DEPENDENCIES:	
      Stage-2 is a root stage	
      Stage-1 depends on stages: Stage-2	
      Stage-0 depends on stages: Stage-1	
    	
    STAGE PLANS:	
      Stage: Stage-2	
        Spark	
          DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53395	
          Vertices:	
            Map 2 	
                Map Operator Tree:	
                    TableScan	
                      alias: b	
                      Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE	
                      Spark HashTable Sink Operator	
                      // 过滤一次
                        filter predicates:	
                          0 {(dt = '2022-09-08')}	
                          1 	
                        keys:	
                          0 id (type: string)	
                          1 id (type: string)	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-1	
        Spark	
          DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53394	
          Vertices:	
            Map 1 	
                Map Operator Tree:	
                    TableScan
                      // 可以看到表扫描的时候没有过滤,所以需要在每个stage HashTable Sink Operator的进行过滤
                      alias: a	
                      Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                      Map Join Operator	
                        condition map:	
                             Left Outer Join0 to 1	
                        // 过滤两次                         
                        filter predicates:	
                          0 {(dt = '2022-09-08')}	
                          1 	
                        keys:	
                          0 id (type: string)	
                          1 id (type: string)	
                        outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8	
                        input vertices:	
                          1 Map 2	
                        Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                        Select Operator	
                          expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                          Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                          File Output Operator	
                            compressed: false	
                            Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE	
                            table:	
                                input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-0	
        Fetch Operator	
          limit: -1	
          Processor Tree:	
            ListSink	
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70

    从上面的注释可以看出,在写法一的谓词下推后,数据在一开始扫描的时候就已经被过滤掉了。而在写法的不推的情况下,会拿所有的数据进行查询,最后再进行多次过滤。

    Hive谓词下推

    谓词下推概念

    谓词下推 Predicate Pushdown(PPD):简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。

    PPD 配置

    PPD控制参数:hive.optimize.ppd 默认开启

    基本概念
    Name名称解释
    Preserved Row table保留表在outer join中需要返回所有数据的表叫做保留表;
    left outer join中,左表是保留表;
    right outer join中,右表则是保留表;
    full outer join中左表和右表都要返回所有数据,则左右表都是保留表。
    Null Supplying table空表相对来讲,在outer join中对于没有匹配到的行需要用NULL来填充的表称为空表;
    left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的列用NULL表示,则此时右表是空表;
    right outer join中,左表是空表;
    full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用NULL来填充无法匹配的数据。
    During Join predicateJoin中的谓词Join中的谓词是指Join On语句中的谓词; 如:a join b on a.id=1 那么a.id=1是Join中的谓词。
    After Join predicateJoin之后的谓词where语句中的谓词称之为Join之后的谓词。
    官网解释

    The logic can be summarized by these two rules:

    1. During Join predicates cannot be pushed past Preserved Row tables.(保留表的谓词写在join中不能下推)
    2. After Join predicates cannot be pushed past Null Supplying tables.(空表的谓词写在join之后不能下推)

    This captured in the following table:

    Preserved Row TableNull Supplying Table
    Join PredicateCase J1: Not PushedCase J2: Pushed
    Where PredicateCase W1: PushedCase W2: Not Pushed

    具体case见官网,这里有比较详细的执行计划分析https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

    规则总结
    1. 保留表的谓词写在join中不能下推,需要用where;
    2. 空表的谓词写在join之后不能下推,需要用on;
    3. 在 join关联情况下,过滤条件无论在join中还是where中谓词下推都生效;
    4. 在full join关联情况下,过滤条件无论在join中还是where中谓词下推都不生效。

    具体案例

    Pushed or NotSQL
    Pushedselect * from a join b on a.id = b.id and a.dt = ‘2022-09-08’;
    Pushedselect * from a join b on a.id = b.id where a.dt = ‘2022-09-08’;
    Pushedselect * from a join b on a.id = b.id and b.dt = ‘2022-09-08’;
    Pushedselect * from a join b on a.id = b.id where b.dt = ‘2022-09-08’;
    Not Pushedselect * from a left join b on a.id = b.id and a.dt = ‘2022-09-08’;
    Pushedselect * from a left join b on a.id = b.id where a.dt = ‘2022-09-08’;
    Pushedselect * from a left join b on a.id = b.id and b.dt = ‘2022-09-08’;
    Not Pushedselect * from a left join b on a.id = b.id where b.dt = ‘2022-09-08’;
    Pushedselect * from a right join b on a.id = b.id and a.dt = ‘2022-09-08’;
    Not Pushedselect * from a right join b on a.id = b.id where a.dt = ‘2022-09-08’;
    Not Pushedselect * from a right join b on a.id = b.id and b.dt = ‘2022-09-08’;
    Pushedselect * from a right join b on a.id = b.id where b.dt = ‘2022-09-08’;
    Not Pushedselect * from a full join b on a.id = b.id and a.dt = ‘2022-09-08’;
    Not Pushedselect * from a full join b on a.id = b.id where a.dt = ‘2022-09-08’;
    Not Pushedselect * from a full join b on a.id = b.id and b.dt = ‘2022-09-08’;
    Not Pushedselect * from a full join b on a.id = b.id where b.dt = ‘2022-09-08’;
    规则表
    join(inner join)left outer joinright outer joinfull outer join
    left tableright tableleft tableright tableleft tableright tableleft tableright table
    joinPushedPushedNot PushedPushedPushedNot PushedNot PushedNot Pushed
    wherePushedPushedPushedNot PushedNot PushedPushedNot PushedNot Pushed
    特殊说明

    不确定函数之类的函数的是不能下推的,例如rand()类,但是unix_timestamp()除外,观察它的执行计划可以知,它可以下推

    EXPLAIN
    SELECT *
    FROM a
    LEFT JOIN b ON a.id = b.id
    WHERE a.dt = unix_timestamp();
    
    • 1
    • 2
    • 3
    • 4
    • 5
    Explain	
    STAGE DEPENDENCIES:	
      Stage-2 is a root stage	
      Stage-1 depends on stages: Stage-2	
      Stage-0 depends on stages: Stage-1	
    	
    STAGE PLANS:	
      Stage: Stage-2	
        Spark	
          DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53432	
          Vertices:	
            Map 2 	
                Map Operator Tree:	
                    TableScan	
                      alias: b	
                      Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE	
                      Spark HashTable Sink Operator	
                        // 无需过滤
                        keys:	
                          0 id (type: string)	
                          1 id (type: string)	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-1	
        Spark	
          DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53431	
          Vertices:	
            Map 1 	
                Map Operator Tree:	
                    TableScan	
                      alias: a	
                      // 表扫描时已过滤
                      filterExpr: (dt = 1662522398) (type: boolean)	
                      Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE	
                      Filter Operator	
                        predicate: (dt = 1662522398) (type: boolean)	
                        Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE	
                        Map Join Operator	
                          condition map:	
                               Left Outer Join0 to 1	
                          keys:	
                            0 id (type: string)	
                            1 id (type: string)	
                          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8	
                          input vertices:	
                            1 Map 2	
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                          Select Operator	
                            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)	
                            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5	
                            Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                            File Output Operator	
                              compressed: false	
                              Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE	
                              table:	
                                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat	
                                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	
                                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
                Local Work:	
                  Map Reduce Local Work	
    	
      Stage: Stage-0	
        Fetch Operator	
          limit: -1	
          Processor Tree:	
            ListSink
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67

    结论

    1. 对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
    2. 对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
    3. 对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
    4. 当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:

    SQL过滤时机
    select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’);id在map端过滤,dt在reduce端过滤,低效
    select * from a left outer join b on ( a.id = b.id and b.id = ‘2022-09-08’) where a.dt=‘2022-09-08’;id,dt都在map端过滤,高效
    select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’) where b.id = ‘2022-09-08’;id,dt都在reduce端过滤,极低效
    select * from a left outer join b on ( a.id = b.id ) where a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’;id在reduce端过滤,dt在map端过滤,低效
  • 相关阅读:
    【css】创建一个带有上矩形和下倒三角角标
    makefile之目标文件生成
    人体神经元是哪个层次的,神经元属于器官层次吗
    MATLAB环境下简单的基于双向长短时记忆网络的时间序列预测
    java毕业设计人职匹配推荐系统mybatis+源码+调试部署+系统+数据库+lw
    Python图像处理库Pillow(PIL)的简单使用
    poium测试库之JavaScript API封装原理
    浏览器的本地存储
    鸿蒙自定义侧滑菜单布局(DrawerLayout)
    Docker容器之compose容器集群的快速编排
  • 原文地址:https://blog.csdn.net/a805814077/article/details/126777345