• POSTGRESQL 一个“大” SQL 的优化历险记


    afff0cd90ef76f1aa6afd9f06e794d15.png

    POSTGRESQL 在SQL 的处理能力上,在开源数据库上面,难逢敌手, 而开始为难POSTGRESQL的语句也就开始层出不穷。下面的SQL可以称之为一个"BIG" SQL。(基于信息保护机制,部分语句中的名字被改变)

      WITH wmsj AS (SELECT wod.de_from AS deFrom,
                    SUM(wop.income) AS incomeMoney
                    FROM o2o_public AS wop INNER JOIN wuu_o2o_delivery AS wod ON wop.de_id = wod.de_id
                    WHERE wop.modify_time BETWEEN CAST('2022-xx-xx 06:00:00' AS TIMESTAMP) AND CAST('2022-xx-xx 13:23:05.35' AS TIMESTAMP) AND
                    wop.create_shop_id = 'xxxx' AND wop.receive_pos_id = '246000000000000001' AND
                    (wop.state NOT IN(3,6,11) AND
                    wop.refund_state   <>   2)
                    GROUP BY wod.de_from
                    UNION ALL
                    SELECT wod.de_from AS deFrom,
                    - SUM(wopb.pr_money) AS incomeMoney
                    FROM wuu_o2o_partback AS wopb INNER JOIN o2o_public AS wop ON wopb.de_id = wop.de_id AND wop.receive_pos_id = '246000000000000001' AND
                    (wop.state NOT IN(3,6,11) AND
                    wop.refund_state <> 2)
                    INNER JOIN o2o_delivery AS wod ON wop.de_id = wod.de_id
                    WHERE wopb.modify_time BETWEEN CAST('2022-07-07 06:00:00' AS TIMESTAMP) AND CAST('2022-07-07 13:23:05.35' AS TIMESTAMP) AND
                    wopb.create_shop_id = '2460' AND
                    wopb.pr_state = 2 AND
                    wopb.delflg = 0
                    GROUP BY wod.de_from)
                    SELECT wmsj.deFrom AS deFrom,
                    bos.name AS deName,
                    SUM(wmsj.incomeMoney) AS incomeMoney
                    FROM o2o_setting AS bos RIGHT OUTER JOIN wmsj ON bos.type = wmsj.deFrom
                    GROUP BY wmsj.deFrom,
                    bos.name
                    ORDER BY wmsj.deFrom;

    面对这样的SQL如何进行分析优化,手段在哪里,我们来捋一捋。

    1  拆

    SQL 都是一堆堆的逻辑进行组合的,通过组合达到整体的逻辑需求,而这样凭着逻辑撰写的SQL ,有一个问题,思路的问题,什么问题,一条SQL 解决所有问题的思路。那么优化时,必须将这些SQL 拆解,通过拆解,将问题缩小化,逐个核对和击破问题。

    1

    SELECT wod.de_from AS deFrom,
                    SUM(wop.income) AS incomeMoney
                    FROM o2o_public AS wop INNER JOIN wuu_o2o_delivery AS wod ON wop.de_id = wod.de_id
                    WHERE wop.modify_time BETWEEN CAST('2022-07-07 06:00:00' AS TIMESTAMP) AND CAST('2022-07-07 13:23:05.35' AS TIMESTAMP) AND
                    wop.create_shop_id = '2460' AND wop.receive_pos_id = '246000000000000001' AND
                    (wop.state NOT IN(3,6,11) AND
                    wop.refund_state   <>   2)
                    GROUP BY wod.de_from;
      2             
                    SELECT wod.de_from AS deFrom,
                    - SUM(wopb.pr_money) AS incomeMoney
                    FROM o2o_partback AS wopb INNER JOIN wuu_o2o_public AS wop ON wopb.de_id = wop.de_id AND wop.receive_pos_id = '246000000000000001' AND
                    (wop.state NOT IN(3,6,11) AND
                    wop.refund_state <> 2)
                    INNER JOIN o2o_delivery AS wod ON wop.de_id = wod.de_id
                    WHERE wopb.modify_time BETWEEN CAST('2022-07-07 06:00:00' AS TIMESTAMP) AND CAST('2022-07-07 13:23:05.35' AS TIMESTAMP) AND
                    wopb.create_shop_id = '2460' AND
                    wopb.pr_state = 2 AND
                    wopb.delflg = 0
                    GROUP BY wod.de_from;

    先拆出两个SQL 。

    然后我们开始针对这两个SQL 开始看他们的执行计划

    c5eaf9f74d1f7a4f176b698a0f8b8b1a.png

    GroupAggregate  (cost=532.70..532.72 rows=1 width=36)
       Group Key: wod.de_from
       ->  Sort  (cost=532.70..532.70 rows=1 width=10)
             Sort Key: wod.de_from
             ->  Nested Loop  (cost=1.00..532.69 rows=1 width=10)
                   ->  Index Scan using idx_o2o_public_3 on o2o_public wop  (cost=0.56..530.02 rows=1 width=10)
                         Index Cond: ((create_shop_id = 2460) AND (receive_pos_id = '246000000000000001'::bigint))
                         Filter: ((modify_time >= '2022-xx-xx 06:00:00'::timestamp without time zone) AND (modify_time <= '2022-07-07 

    13:23:05.35'::timestam
    p without time zone) AND (refund_state <> 2) AND (state <> ALL ('{3,6,11}'::integer[])))
                   ->  Index Scan using de_id_indix on wuu_o2o_delivery wod  (cost=0.44..2.66 rows=1 width=8)
                         Index Cond: (de_id = wop.de_id)


    2  拿到SQL 中表的数据量,以及索引,和采样率。

    f6aa49b89d437a308930601762d6b75f.png

    2000万的表

    表中的索引

        "pk_o2o_public" PRIMARY KEY, btree (id)
        "de_id2_indix" btree (de_id)
        "idx_op_createdate" btree (create_date)
        "idx_public_de_no" btree (de_no)
        "idx_o2o_public_1" btree (create_shop_id, create_time, de_id, state, refund_state, delflg)
        "idx_o2o_public_2" btree (modify_time, receive_pos_id, create_shop_id)
        "idx_o2o_public_3" btree (create_shop_id, receive_pos_id, create_time)
        "idx_o2o_public_de_id" btree (de_id)

    这里关注两个索引  idx_wuu_o2o_public_2  和  idx_wuu_o2o_public_3

    在分析索引前,我们先看看查询条件里面的字段有哪些,这些字段的分布情况如何

     modify_time  
     create_shop_id
     receive_pos_id
     state
     refund_state
     de_from

    1  modify_time

    correlation            | 0.999602

    e0e99ea7208418af14ff96701e8f42f3.png

    2  create_shop_id

     correlation            | 0.786221

    1e5c7cf8d958a52dd11208368498de78.png

    3  receive_pos_id 

    correlation            | 0.784747

    99d3a1af7f90ac179507f94dcf3d68ea.png

    4  state

    correlation            | 0.858308

    83b4fea566626a8d17f5283a522c7ab0.png

    5  refund_state 

    correlation            | 0.963647

    1b1b7b3c603de66c67d354591a37bb96.png

    ecee8139a15b64640f6389fbf00a0ebd.png

    3 发现问题解决问题

    按照原理,我们希望我们的cost 是低的,因为查询中 “开发” 的同学建立索引是按照查询语句的条件的顺序建立的,但是并没有走他们要的索引。随即我们根据POSTGRESQL 查询语句与索引之间的规范,添加了更合适的索引,整体的COST 值降低了不少。

    create index idx_wuu_o2o_public_create_receive_modify on wuu_o2o_public(create_shop_id,receive_pos_id,modify_time);

    cb4dd02a5b2a1277668c2577f9f46775.png

    验证结果

    打开timing 我们查看添加索引和取消索引后的执行情况。我们分别针对,添加索引,和去掉我们索引的情况进行了分析,整体的结果在不使用合适的索引的情况下,整体的查询在100毫秒左右 ,在使用了对应的索引后,整体查询的时间在9 毫秒左右,性能提高了10倍。

    15b351336a571b559e1d55f1a0a6cf7a.png

    6f9b2d92799a9dd84fdebf277fa6f1fa.png

    探究原因

    按照开发的思路,他建立的索引 idx_wuu_o2o_public_2 是符合他思路的索引    "idx_wuu_o2o_public_2" btree (modify_time, receive_pos_id, create_shop_id)  而实际上走的索引,缺失下面的索引    "idx_wuu_o2o_public_3" btree (create_shop_id, receive_pos_id, create_time)

    这里提出一个问题,为什么不走 2 号索引,研其原因,

    1  对于一个算法来说,一个不确定的事情,是不可能被使用的,范围这样的字段就是一个不确定的事情,你的范围可大可小,算法无法判断你每次的查询的范围,所以这不能作为一个 “恒定的量”

    2  基于数据库的分析器,是要算成本的,基于范围的成本估算困难,这里建议,产出索引不是简单就凭着,查询的顺序就可以建立的,一个索引的建立

    1  查看统计信息中的 n_distinct

    2  查看统计信息中的correlation 

    3  根据你查询的符号方式来确认

    基于以上三个部分,来初步确认你的POSTGRESQL 的索引该怎么建立。

    73927f09cb923b56e0e9a65240ee9c67.png

  • 相关阅读:
    TBOX开发需求说明
    【JDBC笔记】获取数据库连接
    小米躺枪,联发科被曝芯片漏洞:全球37%安卓用户被监听
    高防IP是什么意思?
    入门C++程序员需要怎么下手?
    基于ssm框架的农产品扶农商农平台的设计与实现
    元宇宙到底是什么?全球各行业纷纷入局,为什么会成为一种趋势?
    【前端】yarn介绍和使用
    JS的加法规则
    基于STM32单片机的蓝牙智能手环系统
  • 原文地址:https://blog.csdn.net/liuhuayang/article/details/126093162