原文来自微信公众号文章 生产案例 | 损坏的索引
本文主要学习记录一些其中的问题分析思路和方法。
"有个SQL在数据库中执行1ms就能出结果,但是业务压测发起的SQL就会一直执行,查不出结果,进程也没有锁,CPU占用也很高"
主要看是否处于活跃状态及等待事件,即state 和 wait_event两个字段,一般还会看看执行计划。
select * from pg_stat_activity where pid = '105430';
步骤是:
- #0 0x000000000071f5e3 in pg_atomic_read_u32_impl (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics/generic.h:48
- #1 pg_atomic_read_u32 (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics.h:247
- #2 LWLockAttemptLock (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:750
- #3 0x00000000007201b7 in LWLockAcquire (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:1187
根据这个步骤可以自己模拟抓取一下慢SQL运行堆栈
会话1
会话2
gdb -p 13411
会话1执行慢sql
select count(*) from t_insert a,t_insert b;
会话2,执行c及bt。就可以看到该会话当前正在执行ExecNestLoop函数,以及整个调用栈。
常用命令参考 strace分析sqlplus登录慢问题_Hehuyi_In的博客-CSDN博客
- strace -p 进程号
-
- # 输出如下
- futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
- futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
- futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0
- futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
- futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
- futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
- futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0
可以看到此时在不断地在唤醒等待,唤醒等待,如此重复,仿佛陷入到了一个死循环里面,自旋锁自死锁。而自旋锁本身又没有等待队列、死锁检测机制,需要手动释放,因此看现象貌似遇到了死锁。
- //uaddr指向一个地址,val代表这个地址期待的值,当*uaddr==val时,才会进行wait
- int futex_wait(int *uaddr, int val);
-
- //唤醒n个在uaddr指向的锁变量上挂起等待的进程
- int futex_wake(int *uaddr, int n);
perf record -g -p pid 抓取30s数据
可以看到热点函数有PinBuffer、LWLockAttempLock、LWLockRelease等,这些函数都会涉及到spinlock。但是到目前为止,我们只发现了基本现象,但是不知道根因,即便知道了锁的特征,也无法知悉是哪一个环节出了问题。
观察发现,同一类SQL有些SQL可以正确执行,但是唯独有一个参数无法执行,只有BIND的过程,而没有EXECUTE的过程!
- 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6102,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
- 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6103,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
- 2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2648,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,""
-
- # 就是下面这条
- 2022-08-05 17:36:29.235
- CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2649,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,",""
- 2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6104,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
- 2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2650,"UPDATE",",,,,,,,,""
- 2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2651,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
- 2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2652,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"'
- 2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2653,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
- 2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2654,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
可以看到,当最后一次17:36:29.235执行了这个SQL之后,241835进程经过BIND之后,就没有后续日志了!而其他同类SQL都可以正常执行,于是把这个参数抓取到数据库中去执行一下,可以看到其他参数都可以瞬间执行,唯独这一个SQL一跑就死,ctrl + c都停止不了!
可以看到第一次51.927ms就出来了,而第二次一直无法结束,执行计划显示是走的索引,根据我之前的经验,隐约感觉是索引出了问题。
- SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
- c.relname,
- c.relpages
- FROM pg_index i
- JOIN pg_opclass op ON i.indclass[0] = op.oid
- JOIN pg_am am ON op.opcmethod = am.oid
- JOIN pg_class c ON i.indexrelid = c.oid
- JOIN pg_namespace n ON c.relnamespace = n.oid
- WHERE am.amname = 'btree' AND n.nspname = 'public'
- -- Don't check temp tables, which may be from another session:
- AND c.relpersistence != 't'
- AND c.relname = '索引名'
- -- Function may throw an error when this is omitted:
- AND c.relkind = 'i' AND i.indisready AND i.indisvalid
- ORDER BY c.relpages DESC LIMIT 10;
果然。。索引又双叒叕出错了!
另外这里还显示了报错的函数 static void bt_target_page_check(BtreeCheckState *state)
这个问题的现象很费解,当传入了某个值之后,就会一直运行停不下来,现象就是一直在自旋,等待给数据页加锁。
根据结果反推的话,查询读取到了这个错误的索引页之后,尝试加锁,但是数据异常,加锁失败,加之自旋锁显著的特点就是"死等",需要获取SpinLock的进程会一直主动检查能否获取得到锁,直到获取到锁后才会继续执行下面的逻辑,这把锁会一直被这个线程持有,直到线程自己主动释放,这也是这个进程停不下来的原因,以及传入相同参数的进程都在等待,越堆越多!当然解决方式也好解决,reindex重建索引即可。
参考