• pg distinct 改写递归优化(德哥的思路)


    德哥的优化思路巨牛逼,这种递归思维真的太吊了,我目前就缺递归思路。

     

    下面SQL1000W行数据,列的选择性很低,只有两个值('1'和'11')都是字符串类型,'1'只有一条数据,'11'有9999999行数据。

    慢SQL:

    复制代码
    select distinct col from tt;
    
                                                          QUERY PLAN                                                      
    ----------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=169247.11..169247.12 rows=1 width=3) (actual time=5082.733..5082.735 rows=2 loops=1)
       Group Key: col
       ->  Seq Scan on tt  (cost=0.00..144247.29 rows=9999929 width=3) (actual time=0.005..275.906 rows=10000000 loops=1)
     Planning Time: 0.365 ms
     Execution Time: 5082.772 ms
    (5 行记录)
    复制代码

    CTE递归优化:

    复制代码
    WITH RECURSIVE t AS (
       (SELECT col FROM tt ORDER BY col LIMIT 1)  
       UNION ALL
       SELECT (SELECT col FROM tt WHERE col > t.col ORDER BY col LIMIT 1)
       FROM t
       WHERE t.col IS NOT NULL
       )
    SELECT col FROM t WHERE col IS NOT NULL;
    
                                                                            QUERY PLAN                                                                        
    ----------------------------------------------------------------------------------------------------------------------------------------------------------
     CTE Scan on t  (cost=50.84..52.86 rows=100 width=38) (actual time=0.024..0.079 rows=2 loops=1)
       Filter: (col IS NOT NULL)
       Rows Removed by Filter: 1
       CTE t
         ->  Recursive Union  (cost=0.43..50.84 rows=101 width=38) (actual time=0.022..0.076 rows=3 loops=1)
               ->  Limit  (cost=0.43..0.46 rows=1 width=3) (actual time=0.021..0.021 rows=1 loops=1)
                     ->  Index Only Scan using idx_1_2_tt on tt tt_1  (cost=0.43..260443.37 rows=9999929 width=3) (actual time=0.020..0.020 rows=1 loops=1)
                           Heap Fetches: 0
               ->  WorkTable Scan on t t_1  (cost=0.00..4.84 rows=10 width=38) (actual time=0.017..0.017 rows=1 loops=3)
                     Filter: (col IS NOT NULL)
                     Rows Removed by Filter: 0
                     SubPlan 1
                       ->  Limit  (cost=0.43..0.46 rows=1 width=3) (actual time=0.024..0.024 rows=0 loops=2)
                             ->  Index Only Scan using idx_1_2_tt on tt  (cost=0.43..95149.36 rows=3333310 width=3) (actual time=0.024..0.024 rows=0 loops=2)
                                   Index Cond: (col > (t_1.col)::text)
                                   Heap Fetches: 0
     Planning Time: 0.096 ms
     Execution Time: 0.096 ms
    (18 行记录)
    复制代码

     

    里面的逻辑是:

    (SELECT col FROM tt ORDER BY col LIMIT 1)

      根节点通过order by 升序 找到最小的一条数据作为起点。

     

    递归查询:

    SELECT (SELECT col FROM tt WHERE col > t.col ORDER BY col LIMIT 1)
    FROM t
    WHERE t.col IS NOT NULL

      在第一次迭代中,CTE t 包含值'1'。这个查询将在tt表中寻找col大于'1'的最小值。在数据集中,这将是'11'。

      在第二次迭代,CTE t 将包含'11'。此时,查询将尝试找到大于'11'的最小值,但没有这样的值,所以返回NULL。

     

    递归结束:
      当递归查询返回NULL时,递归结束。这时,CTE t 将包含'1'和'11',返回和distinct 一样逻辑的数据。

     

    理解了整个逻辑后我都吓尿了,就一道算法题,确实要跟巨佬学习才行,加深递归思维。

     

  • 相关阅读:
    数字称重传感器——电子秤方案传感器应用
    Ceph Iscsi GW
    Centos7 安装 Nginx及启动命令
    Redis 事务
    java高级--SpringBoot篇
    CAA DMU模块仿真
    Google Earth Engine(GEE)——火灾数据下载(添加图例信息)
    MySQL进阶4,常见函数
    【Apifox】国产测试工具雄起
    HDLbits : Module addsub
  • 原文地址:https://www.cnblogs.com/yuzhijian/p/18067341