• MaxCompute(ODPS)实现笛卡尔积


    笛卡尔积会产生大量的重复记录,是引起主键重复的主要原因,但是对有些业务或者数据的处理笛卡尔积又是必不可少的技巧。如何在MaxCompute(ODPS)对表实现笛卡尔积,几种方案记录如下:

    方法1:条件内连接

    无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,且增加连接条件on 1=1 也是不被通过的。

    比如程序1:

    SELECT  *
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    会报错,这种情况则需要创造一个连接条件:

    ● Step1:在被关联的两个表同时加上一个link字段,用一个相同的值来赋值。
    表A -> 表A’ A’比A多一个link字段,赋值1
    表B -> 表B’ B’比B多一个link字段,赋值1
    ● Step2:将 A’表和B’表关联,关联条件为 A’.link = B’.link

    对程序1就可以进行修改,得到程序2:

    SELECT  *
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ON      a.orderID = b.orderID
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    可以得到笛卡尔积后的表。

    特别的,对于一些已经有关联条件的,比如程序3。
    无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,所以该方法仅适合有关联条件的。比如对于程序3:

    SELECT  business_district_id
            ,grid_id
    FROM    table
    WHERE   ds = '20211124'
    AND     business_district_id IN ('21673','21593','21705')
    AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    对应表输出为:
    business_district_id business_district_name grid_id
    21593 1商圈 3335
    21593 1商圈 1896
    21673 2商圈 17412
    21673 2商圈 24541
    21673 2商圈 21707
    21705 3商圈 125475
    21705 3商圈 125603
    21705 3商圈 125459
    21705 3商圈 125595

    需求是将属于同一business_district_id下的grid_id进行交叉连接,求笛卡尔积同样可以采用上述方法,见程序4:

    SELECT  *
    FROM    (
                SELECT  business_district_id
                        ,business_district_name
                        ,grid_id
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) a INNER
    JOIN    (
                SELECT  business_district_id
                        ,business_district_name
                        ,grid_id
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) b
    ON      a.business_district_id = b.business_district_id
    AND     a.orderID = b.orderID
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    因为已经有了关联条件a.business_district_id = b.business_district_id,所以也可以直接进行条件内连接,见程序5:

    SELECT  *
    FROM    (
                SELECT  business_district_id
                        ,grid_id
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) a INNER
    JOIN    (
                SELECT  business_district_id
                        ,grid_id
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) b
    ON      a.business_district_id = b.business_district_id
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    程序4和程序5均可以得到相同的笛卡尔积后表

    方法2:采用MAPJOIN

    采用MAPJOIN实现笛卡尔积的情况适用有小表的存在,对小表进行MAPJOIN。
    还是对程序2,可以采用MAPJOIN实现笛卡尔积:

    SELECT /*+ MAPJOIN(a) */
            a.*
            ,b.*
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    得到表:

    grid_id	business_district_name	grid_id2	business_district_name2
    3335	1商圈	125475	2商圈
    3335	1商圈	125603	2商圈
    3335	1商圈	125459	2商圈
    3335	1商圈	125595	2商圈
    1896	1商圈	125475	2商圈
    1896	1商圈	125603	2商圈
    1896	1商圈	125459	2商圈
    1896	1商圈	125595	2商圈
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
  • 相关阅读:
    ffmpeg-android studio创建jni项目
    【微服务】RestClient操作文档
    Android 音乐播放器悬浮窗
    【Linux】yum及vim
    Go 语言接口及使用接口实现链表插入
    Lecture 3 Process Concept(进程概念)
    duilib 入门基础十二 样式类
    2022年最新前端面试题,持续更新
    计算机毕业设计Java健康食谱系统服务器端(源码+系统+mysql数据库+lw文档)
    distinct 和 group by有什么区别
  • 原文地址:https://blog.csdn.net/yawei_liu1688/article/details/126563273