• 记录一些 PostgreSQL问题分析思路


    原文来自微信公众号文章 生产案例 | 损坏的索引

    本文主要学习记录一些其中的问题分析思路和方法。

    一、 问题现象

    "有个SQL在数据库中执行1ms就能出结果,但是业务压测发起的SQL就会一直执行,查不出结果,进程也没有锁,CPU占用也很高"

    二、 文中排查思路

    1. pg_stat_activity查看

    主要看是否处于活跃状态及等待事件,即state 和 wait_event两个字段,一般还会看看执行计划。

    select * from pg_stat_activity where pid = '105430';

    2. GDB抓运行堆栈

    步骤是:

    • gdb 跟踪问题SQL进程 105430(问题SQL需要正在运行中)
    • gdb执行bt命令显示堆栈,这里 LWLockAcquire 是轻量锁的获取函数
    1. #0 0x000000000071f5e3 in pg_atomic_read_u32_impl (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics/generic.h:48
    2. #1 pg_atomic_read_u32 (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics.h:247
    3. #2 LWLockAttemptLock (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:750
    4. #3 0x00000000007201b7 in LWLockAcquire (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:1187

    根据这个步骤可以自己模拟抓取一下慢SQL运行堆栈

    会话1

     会话2

    gdb -p 13411

    会话1执行慢sql

    select count(*) from t_insert a,t_insert b;

    会话2,执行c及bt。就可以看到该会话当前正在执行ExecNestLoop函数,以及整个调用栈。

    3. strace分析

     常用命令参考 strace分析sqlplus登录慢问题_Hehuyi_In的博客-CSDN博客

    1. strace -p 进程号
    2. # 输出如下
    3. futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
    4. futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
    5. futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0
    6. futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
    7. futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
    8. futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
    9. futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0

           可以看到此时在不断地在唤醒等待,唤醒等待,如此重复,仿佛陷入到了一个死循环里面,自旋锁自死锁。而自旋锁本身又没有等待队列、死锁检测机制,需要手动释放,因此看现象貌似遇到了死锁。

    1. //uaddr指向一个地址,val代表这个地址期待的值,当*uaddr==val时,才会进行wait
    2. int futex_wait(int *uaddr, int val);
    3. //唤醒n个在uaddr指向的锁变量上挂起等待的进程
    4. int futex_wake(int *uaddr, int n);

    4. perf抓取热点函数

    perf record -g -p pid 抓取30s数据

           可以看到热点函数有PinBuffer、LWLockAttempLock、LWLockRelease等,这些函数都会涉及到spinlock。但是到目前为止,我们只发现了基本现象,但是不知道根因,即便知道了锁的特征,也无法知悉是哪一个环节出了问题。

    5. 开启全日志 log_statement = all

    观察发现,同一类SQL有些SQL可以正确执行,但是唯独有一个参数无法执行,只有BIND的过程,而没有EXECUTE的过程!

    1. 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6102,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
    2. 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6103,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
    3. 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2648,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,""
    4. # 就是下面这条
    5. 2022-08-05 17:36:29.235
    6. CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2649,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,",""
    7. 2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6104,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
    8. 2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2650,"UPDATE",",,,,,,,,""
    9. 2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2651,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
    10. 2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2652,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"'
    11. 2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2653,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
    12. 2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2654,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""

           可以看到,当最后一次17:36:29.235执行了这个SQL之后,241835进程经过BIND之后,就没有后续日志了!而其他同类SQL都可以正常执行,于是把这个参数抓取到数据库中去执行一下,可以看到其他参数都可以瞬间执行,唯独这一个SQL一跑就死,ctrl + c都停止不了!

    可以看到第一次51.927ms就出来了,而第二次一直无法结束,执行计划显示是走的索引,根据我之前的经验,隐约感觉是索引出了问题。

    6. amcheck插件检测索引

    1. SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
    2. c.relname,
    3. c.relpages
    4. FROM pg_index i
    5. JOIN pg_opclass op ON i.indclass[0] = op.oid
    6. JOIN pg_am am ON op.opcmethod = am.oid
    7. JOIN pg_class c ON i.indexrelid = c.oid
    8. JOIN pg_namespace n ON c.relnamespace = n.oid
    9. WHERE am.amname = 'btree' AND n.nspname = 'public'
    10. -- Don't check temp tables, which may be from another session:
    11. AND c.relpersistence != 't'
    12. AND c.relname = '索引名'
    13. -- Function may throw an error when this is omitted:
    14. AND c.relkind = 'i' AND i.indisready AND i.indisvalid
    15. ORDER BY c.relpages DESC LIMIT 10;

    果然。。索引又双叒叕出错了!

    另外这里还显示了报错的函数  static void  bt_target_page_check(BtreeCheckState *state)

    三、 小结

           这个问题的现象很费解,当传入了某个值之后,就会一直运行停不下来,现象就是一直在自旋,等待给数据页加锁。

           根据结果反推的话,查询读取到了这个错误的索引页之后,尝试加锁,但是数据异常,加锁失败,加之自旋锁显著的特点就是"死等",需要获取SpinLock的进程会一直主动检查能否获取得到锁,直到获取到锁后才会继续执行下面的逻辑,这把锁会一直被这个线程持有,直到线程自己主动释放,这也是这个进程停不下来的原因,以及传入相同参数的进程都在等待,越堆越多!当然解决方式也好解决,reindex重建索引即可。

    参考

    生产案例 | 损坏的索引

  • 相关阅读:
    详解memcpy和memmove函数的使用
    FLASK 模板,渲染变量
    华为MateBook X Pro笔记本录屏办法
    java-php-python-ssm点餐系统计算机毕业设计
    基于FPGA的图像坏点像素修复算法实现,包括tb测试文件和MATLAB辅助验证
    requests 技术问题与解决方案:解决字典值中列表在URL编码时的问题
    【深度学习】图像修复的一些模型
    MySQL数据库期末考试试题及参考答案(01)
    【蓝桥】健身
    Arxiv 2209 | Switchable Self-attention Module
  • 原文地址:https://blog.csdn.net/Hehuyi_In/article/details/126196003