• MaxCompute(ODPS)实现笛卡尔积


    笛卡尔积会产生大量的重复记录,是引起主键重复的主要原因,但是对有些业务或者数据的处理笛卡尔积又是必不可少的技巧。如何在MaxCompute(ODPS)对表实现笛卡尔积,几种方案记录如下:

    方法1:条件内连接

    无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,且增加连接条件on 1=1 也是不被通过的。

    比如程序1:

    SELECT  *
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    会报错,这种情况则需要创造一个连接条件:

    ● Step1:在被关联的两个表同时加上一个link字段,用一个相同的值来赋值。
    表A -> 表A’ A’比A多一个link字段,赋值1
    表B -> 表B’ B’比B多一个link字段,赋值1
    ● Step2:将 A’表和B’表关联,关联条件为 A’.link = B’.link

    对程序1就可以进行修改,得到程序2:

    SELECT  *
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ON      a.orderID = b.orderID
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    可以得到笛卡尔积后的表。

    特别的,对于一些已经有关联条件的,比如程序3。
    无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,所以该方法仅适合有关联条件的。比如对于程序3:

    SELECT  business_district_id
            ,grid_id
    FROM    table
    WHERE   ds = '20211124'
    AND     business_district_id IN ('21673','21593','21705')
    AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    对应表输出为:
    business_district_id business_district_name grid_id
    21593 1商圈 3335
    21593 1商圈 1896
    21673 2商圈 17412
    21673 2商圈 24541
    21673 2商圈 21707
    21705 3商圈 125475
    21705 3商圈 125603
    21705 3商圈 125459
    21705 3商圈 125595

    需求是将属于同一business_district_id下的grid_id进行交叉连接,求笛卡尔积同样可以采用上述方法,见程序4:

    SELECT  *
    FROM    (
                SELECT  business_district_id
                        ,business_district_name
                        ,grid_id
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) a INNER
    JOIN    (
                SELECT  business_district_id
                        ,business_district_name
                        ,grid_id
                        ,'1' AS orderID
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) b
    ON      a.business_district_id = b.business_district_id
    AND     a.orderID = b.orderID
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    因为已经有了关联条件a.business_district_id = b.business_district_id,所以也可以直接进行条件内连接,见程序5:

    SELECT  *
    FROM    (
                SELECT  business_district_id
                        ,grid_id
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) a INNER
    JOIN    (
                SELECT  business_district_id
                        ,grid_id
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
            ) b
    ON      a.business_district_id = b.business_district_id
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    程序4和程序5均可以得到相同的笛卡尔积后表

    方法2:采用MAPJOIN

    采用MAPJOIN实现笛卡尔积的情况适用有小表的存在,对小表进行MAPJOIN。
    还是对程序2,可以采用MAPJOIN实现笛卡尔积:

    SELECT /*+ MAPJOIN(a) */
            a.*
            ,b.*
    FROM    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('3335','1896')
            ) a INNER
    JOIN    (
                SELECT  grid_id
                        ,business_district_name
                FROM    table
                WHERE   ds = '20211124'
                AND     business_district_id IN ('21673','21593','21705')
                AND     grid_id IN ('125475','125603','125459','125595')
            ) b
    ;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    得到表:

    grid_id	business_district_name	grid_id2	business_district_name2
    3335	1商圈	125475	2商圈
    3335	1商圈	125603	2商圈
    3335	1商圈	125459	2商圈
    3335	1商圈	125595	2商圈
    1896	1商圈	125475	2商圈
    1896	1商圈	125603	2商圈
    1896	1商圈	125459	2商圈
    1896	1商圈	125595	2商圈
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
  • 相关阅读:
    [附源码]计算机毕业设计JAVA火车票预订系统2022
    rabbitmq消息投递失败
    淘宝/天猫api 收货地址列表 API接口
    MySQL—约束—演示(基础)
    微信小程序:炫酷恶趣制作神器小程序源码
    Nacos注册中心1-nacos源码环境搭建
    解决go-micro与其它gRPC框架之间的通信问题
    6.qml中js的object,array数据更新不通知页面刷新问题解析
    2.NVIDIA Deepstream开发指南中文版--自述文件
    乔布斯时代的“老人”,一个个都离开苹果了
  • 原文地址:https://blog.csdn.net/yawei_liu1688/article/details/126563273