笛卡尔积会产生大量的重复记录,是引起主键重复的主要原因,但是对有些业务或者数据的处理笛卡尔积又是必不可少的技巧。如何在MaxCompute(ODPS)对表实现笛卡尔积,几种方案记录如下:
无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,且增加连接条件on 1=1 也是不被通过的。
比如程序1:
SELECT *
FROM (
SELECT grid_id
,business_district_name
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896')
) a INNER
JOIN (
SELECT grid_id
,business_district_name
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('125475','125603','125459','125595')
) b
;
会报错,这种情况则需要创造一个连接条件:
● Step1:在被关联的两个表同时加上一个link字段,用一个相同的值来赋值。
表A -> 表A’ A’比A多一个link字段,赋值1
表B -> 表B’ B’比B多一个link字段,赋值1
● Step2:将 A’表和B’表关联,关联条件为 A’.link = B’.link
对程序1就可以进行修改,得到程序2:
SELECT *
FROM (
SELECT grid_id
,business_district_name
,'1' AS orderID
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896')
) a INNER
JOIN (
SELECT grid_id
,business_district_name
,'1' AS orderID
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('125475','125603','125459','125595')
) b
ON a.orderID = b.orderID
;
可以得到笛卡尔积后的表。
特别的,对于一些已经有关联条件的,比如程序3。
无条件内连接结果与笛卡尔积(交叉连接)结果一致。但在odps里无条件内连接的话会认为是语法错误,所以该方法仅适合有关联条件的。比如对于程序3:
SELECT business_district_id
,grid_id
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
对应表输出为:
business_district_id business_district_name grid_id
21593 1商圈 3335
21593 1商圈 1896
21673 2商圈 17412
21673 2商圈 24541
21673 2商圈 21707
21705 3商圈 125475
21705 3商圈 125603
21705 3商圈 125459
21705 3商圈 125595
需求是将属于同一business_district_id下的grid_id进行交叉连接,求笛卡尔积同样可以采用上述方法,见程序4:
SELECT *
FROM (
SELECT business_district_id
,business_district_name
,grid_id
,'1' AS orderID
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
) a INNER
JOIN (
SELECT business_district_id
,business_district_name
,grid_id
,'1' AS orderID
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
) b
ON a.business_district_id = b.business_district_id
AND a.orderID = b.orderID
;
因为已经有了关联条件a.business_district_id = b.business_district_id,所以也可以直接进行条件内连接,见程序5:
SELECT *
FROM (
SELECT business_district_id
,grid_id
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
) a INNER
JOIN (
SELECT business_district_id
,grid_id
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896','17412','24541','21707','125475','125603','125459','125595')
) b
ON a.business_district_id = b.business_district_id
;
程序4和程序5均可以得到相同的笛卡尔积后表
采用MAPJOIN实现笛卡尔积的情况适用有小表的存在,对小表进行MAPJOIN。
还是对程序2,可以采用MAPJOIN实现笛卡尔积:
SELECT /*+ MAPJOIN(a) */
a.*
,b.*
FROM (
SELECT grid_id
,business_district_name
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('3335','1896')
) a INNER
JOIN (
SELECT grid_id
,business_district_name
FROM table
WHERE ds = '20211124'
AND business_district_id IN ('21673','21593','21705')
AND grid_id IN ('125475','125603','125459','125595')
) b
;
得到表:
grid_id business_district_name grid_id2 business_district_name2
3335 1商圈 125475 2商圈
3335 1商圈 125603 2商圈
3335 1商圈 125459 2商圈
3335 1商圈 125595 2商圈
1896 1商圈 125475 2商圈
1896 1商圈 125603 2商圈
1896 1商圈 125459 2商圈
1896 1商圈 125595 2商圈