• HiveQL中left join误区和多join的执行顺序


    1. join中的on条件和where条件的区别

    --建表
    create table t1(id int, value int) partitioned by (ds string);
    create table t2(id int, value int) partitioned by (ds string);
    create table t3(c1 int, c2 int, c3 int);
    
    --数据装载,t1表
    insert overwrite table t1 partition(ds='20220120') select '1','2022';
    insert overwrite table t1 partition(ds='20220121') select '2','2022';
    insert overwrite table t1 partition(ds='20220122') select '2','2022';
    
    --数据装载,t2表
    insert overwrite table t2 partition(ds='20220120') select '1','120';
    insert overwrite table t2 partition(ds='20220121') select '1','120';
    insert into table t2 partition(ds='20220121') select '3','120';
    
    --数据装载,t3表
    insert into table t3 select '1','33','33';
    insert into table t3 select '1','34','33';
    insert into table t3 select '3','33','33';
    insert into table t3 select '4','33','33';
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    1.1. left join

    SQL案例

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id 
    where t1.ds = '20220120';
    -- 执行结果为
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    对于上述执行结果相信并没有任何的问题,实际执行的结果和我们的预期是完全一致的。

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id AND t1.ds = '20220120';
    -- 执行结果为,t1.ds = '20220120'条件未生效
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  |
    | 2      | 2022      | 20220121  | NULL   | NULL      | NULL      |
    | 2      | 2022      | 20220122  | NULL   | NULL      | NULL      |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    执行结果中返回了t1表中的全部数据(是不是和预期的结果并不一样),从结果看 join 条件中的 AND t1.ds = '20220120' 并没有“生效”,为什么这样?

    而如下所示,在on条件中对t2表进行筛选 t2.ds = '20220120',则返回结果中过滤了指定的数据。

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id AND t2.ds = '20220120';
    -- 执行结果为,t2.ds = '20220120'条件生效
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    | 2      | 2022      | 20220121  | NULL   | NULL      | NULL      |
    | 2      | 2022      | 20220122  | NULL   | NULL      | NULL      |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    分别查看上述三个SQL的执行计划

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id 
    where t1.ds = '20220120';
    
    == Physical Plan ==
    *(2) BroadcastHashJoin [id#24], [id#27], LeftOuter, BuildRight
    :- *(2) FileScan orc zhanglei.t1[id#24,value#25,ds#26] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1/ds=20220120], PartitionCount: 1, PartitionFilters: [isnotnull(ds#26), (ds#26 = 20220120)], PushedFilters: [], ReadSchema: struct
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
       +- *(1) Project [id#27, value#28, ds#29]
          +- *(1) Filter isnotnull(id#27)
             +- *(1) FileScan orc zhanglei.t2[id#27,value#28,ds#29] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2], PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    从上述执行计划中得知where条件中t1.ds = '20220120'生效,体现在对表t1的读取上只读取了ds=20220120分区。

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id AND t1.ds = '20220120';
    
    == Physical Plan ==
    *(2) BroadcastHashJoin [id#32], [id#35], LeftOuter, BuildRight, (ds#34 = 20220120)
    :- *(2) FileScan orc zhanglei.t1[id#32,value#33,ds#34] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1], PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
       +- *(1) Project [id#35, value#36, ds#37]
          +- *(1) Filter isnotnull(id#35)
             +- *(1) FileScan orc zhanglei.t2[id#35,value#36,ds#37] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2], PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    从上述执行计划中得知on条件中t1.ds = '20220120'并没有生效,因为并没有过滤条件,仅仅在BroadcastHashJoin中进行了体现。

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id AND t2.ds = '20220120';
    
    == Physical Plan ==
    *(2) BroadcastHashJoin [id#40], [id#43], LeftOuter, BuildRight
    :- *(2) FileScan orc zhanglei.t1[id#40,value#41,ds#42] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t1], PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema: struct
    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
       +- *(1) Project [id#43, value#44, ds#45]
          +- *(1) Filter isnotnull(id#43)
             +- *(1) FileScan orc zhanglei.t2[id#43,value#44,ds#45] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://bdptest/warehouse/tablespace/managed/hive/zhanglei.db/t2/ds=20220120], PartitionCount: 1, PartitionFilters: [isnotnull(ds#45), (ds#45 = 20220120)], PushedFilters: [IsNotNull(id)], ReadSchema: struct
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    从上述执行计划中得知where条件中t2.ds = '20220120'生效,体现在对表t2的读取上只读取了ds=20220120分区。

    1.2. right join

    SELECT  *
    FROM t1
    RIGHT JOIN t2
    ON t1.id = t2.id AND t1.ds = '20220121';
    -- 执行结果,t1.ds = '20220121'的条件生效
    +--------+-----------+--------+--------+-----------+-----------+
    | t1.id  | t1.value  | t1.ds  | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+--------+--------+-----------+-----------+
    | NULL   | NULL      | NULL   | 1      | 120       | 20220120  |
    | NULL   | NULL      | NULL   | 1      | 120       | 20220121  |
    | NULL   | NULL      | NULL   | 3      | 120       | 20220121  |
    +--------+-----------+--------+--------+-----------+-----------+
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    SELECT  *
    FROM t1
    RIGHT JOIN t2
    ON t1.id = t2.id AND t2.ds = '20220120';
    -- 执行结果为,t2.ds = '20220120' 的条件未生效
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    | NULL   | NULL      | NULL      | 1      | 120       | 20220121  |
    | NULL   | NULL      | NULL      | 3      | 120       | 20220121  |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    !!! note “”
    从上述执行结果中看,left join的on中t1(左表)的条件并不会生效,但是t2(右表)的条件会生效。right join的on中(右表)的条件并不会生效,但是左表的条件会生效

    问题:在left join的on条件中为什么左表的筛选条件不会生效,而右表的筛选条件会生效???

    1.3. inner join

    SELECT  *
    FROM t1
    JOIN t2
    ON t1.id = t2.id 
    where t1.ds = '20220120';
    --
    SELECT  *
    FROM t1
    JOIN t2
    ON t1.id = t2.id AND t1.ds = '20220120';
    -- 以上两个SQL执行结果相同
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    在inner join中 on 和 where 条件中的条件都会正常生效。

    SELECT  *
    FROM t1
    JOIN t2
    ON t1.id = t2.id AND t2.ds = '20220120';
    -- 执行结果如下
    +--------+-----------+-----------+--------+-----------+-----------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   |
    +--------+-----------+-----------+--------+-----------+-----------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  |
    +--------+-----------+-----------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    !!! note “”
    在inner join中on中的条件无论左右表的条件都会生效。

    2. 多个left join执行顺序

    继续给t2和t3表中插入测试数据

    insert into table t2 partition(ds='20220121') select '3','120';
    
    insert into table t3 select '1','33','33';
    insert into table t3 select '1','34','33';
    insert into table t3 select '3','33','33';
    insert into table t3 select '4','33','33';
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    案例SQL

    SELECT  *
    FROM t1
    LEFT JOIN t2
    ON t1.id = t2.id
    LEFT JOIN t3
    ON t2.id = c1;
    -- 执行结果为
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   | t3.c1  | t3.c2  | t3.c3  |
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  | 1      | 33     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  | 1      | 34     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  | 1      | 33     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  | 1      | 34     | 33     |
    | 2      | 2022      | 20220122  | NULL   | NULL      | NULL      | NULL   | NULL   | NULL   |
    | 2      | 2022      | 20220121  | NULL   | NULL      | NULL      | NULL   | NULL   | NULL   |
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    从结果得知,多个表进行left join时,是按照join顺序进行的。即先由t1和t2表left join形成一个虚拟表后,再和t3表进行left join成最终结果

    3. 同时有inner join和left join时的执行顺序和结果

    SELECT  *
    FROM t2
    JOIN t3
    ON t2.id = t3.c1;
    -- 执行结果如下
    +--------+-----------+-----------+--------+--------+--------+
    | t2.id  | t2.value  |   t2.ds   | t3.c1  | t3.c2  | t3.c3  |
    +--------+-----------+-----------+--------+--------+--------+
    | 1      | 120       | 20220121  | 1      | 33     | 33     |
    | 1      | 120       | 20220120  | 1      | 33     | 33     |
    | 1      | 120       | 20220121  | 1      | 34     | 33     |
    | 1      | 120       | 20220120  | 1      | 34     | 33     |
    | 3      | 120       | 20220121  | 3      | 33     | 33     |
    +--------+-----------+-----------+--------+--------+--------+
    
    SELECT  *
    FROM t1
    left join t2
    on t1.id = t2.id
    JOIN t3
    ON t2.id = t3.c1;
    -- 执行结果如下
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    | t1.id  | t1.value  |   t1.ds   | t2.id  | t2.value  |   t2.ds   | t3.c1  | t3.c2  | t3.c3  |
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  | 1      | 33     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  | 1      | 33     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220120  | 1      | 34     | 33     |
    | 1      | 2022      | 20220120  | 1      | 120       | 20220121  | 1      | 34     | 33     |
    +--------+-----------+-----------+--------+-----------+-----------+--------+--------+--------+
    
    SELECT  *
    FROM t2
    JOIN t3
    ON t2.id = t3.c1
    left join t1
    on t1.id = t2.id;
    -- 执行结果如下
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    | t2.id  | t2.value  |   t2.ds   | t3.c1  | t3.c2  | t3.c3  | t1.id  | t1.value  |   t1.ds   |
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    | 1      | 120       | 20220120  | 1      | 33     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220121  | 1      | 33     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220120  | 1      | 34     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220121  | 1      | 34     | 33     | 1      | 2022      | 20220120  |
    | 3      | 120       | 20220121  | 3      | 33     | 33     | NULL   | NULL      | NULL      |
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    
    SELECT  *
    FROM t2
    JOIN t3
    ON t2.id = t3.c1
    right join t1
    on t1.id = t2.id;
    -- 执行结果如下
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    | t2.id  | t2.value  |   t2.ds   | t3.c1  | t3.c2  | t3.c3  | t1.id  | t1.value  |   t1.ds   |
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    | 1      | 120       | 20220120  | 1      | 33     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220120  | 1      | 34     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220121  | 1      | 33     | 33     | 1      | 2022      | 20220120  |
    | 1      | 120       | 20220121  | 1      | 34     | 33     | 1      | 2022      | 20220120  |
    | NULL   | NULL      | NULL      | NULL   | NULL   | NULL   | 2      | 2022      | 20220121  |
    | NULL   | NULL      | NULL      | NULL   | NULL   | NULL   | 2      | 2022      | 20220122  |
    +--------+-----------+-----------+--------+--------+--------+--------+-----------+-----------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65

    4. SQL中存在多个join时的join执行顺序和结果

    从上述的示例中可以得知,当SQL中存在多个join(无论join类型))时,执行时按照join的前后顺序,前两个表join出一个虚拟的表,再和第三个表进行join,依次往后执行

  • 相关阅读:
    企业如何安全跨国传输30T文件数据
    【JavaWeb】数据库相关
    《Node.js 学习笔记 之 切换node版本》
    设计模式-工厂设计模式
    PGP软件安装文件加密&解密&签名实践记录
    EMQX Cloud更新:数据集成新增 HStreamDB & Tablestore
    深度学习中的激活函数 – 完整概述
    基于PTP的同步时钟同步
    PDF处理控件aspose.PDF功能演示:将 PDF 转换为Excel
    最新版本vscode 真正解决用vscode + unity搭配开发没有代码智能提示 OmniSharp服务启动 vscode调试unity准备
  • 原文地址:https://blog.csdn.net/qq_33446500/article/details/126448817