• 某客户管理系统Oracle RAC节点异常重启问题详细分析记录


    一、故障概述

            某日10:58分左右客户管理系统数据库节点1所有实例异常重启,重启后业务恢复正常。经过分析发现,此次实例异常重启数据库节点1

    二、故障原因分析

    1、数据库日志分析

            从节点1的数据库日志来看,10:58:49的时候数据库进程开始被abort,最终PMON进程因为481错误而终止实例,这个报错一般表示网络问题

    alert_reportdb1.log:

    ***********************************************************************

    Sat Dec 07 10:58:49 XXXX

    ***********************************************************************

    Fatal NI connect error 12537, connecting to:

     (LOCAL=NO)

    Fatal NI connect error 12537, connecting to:

     (LOCAL=NO)

    Fatal NI connect error 12537, connecting to:

     (LOCAL=NO)

      VERSION INFORMATION:

    TNS for Linux: Version 11.2.0.4.0 - Production

    Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

    TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

    TNS-12537: TNS:connection closed

        ns main err code: 12537

        ns secondary err code: 12560

    TNS-12537: TNS:connection closed  

        nt main err code: 0

        ns secondary err code: 12560

        nt secondary err code: 0

        nt main err code: 0

    TNS-12537: TNS:connection closed

        nt OS err code: 0

        nt secondary err code: 0

        ns secondary err code: 12560

        nt OS err code: 0

        nt main err code: 0

        nt secondary err code: 0

        nt OS err code: 0

    opiodr aborting process unknown ospid (36742) as a result of ORA-609

    opiodr aborting process unknown ospid (36722) as a result of ORA-609

    opiodr aborting process unknown ospid (36738) as a result of ORA-609

    Sat Dec 07 10:58:49 2023

    ***********************************************************************

    Fatal NI connect error 12537, connecting to:

     (LOCAL=NO)

      VERSION INFORMATION:

    TNS for Linux: Version 11.2.0.4.0 - Production

    Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

    TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

      Time: 07-DEC-XXXX 10:58:49

      Tracing not turned on.

      Tns error struct:

        ns main err code: 12537

    TNS-12537: TNS:connection closed

        ns secondary err code: 12560

        nt main err code: 0

        nt secondary err code: 0

    Sat Dec 07 10:58:49 2023

    ***********************************************************************

        nt OS err code: 0

    Fatal NI connect error 12537, connecting to:

     (LOCAL=NO)

      VERSION INFORMATION:

    TNS for Linux: Version 11.2.0.4.0 - Production

    Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

    TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

      Time: 07-DEC-XXXX 10:58:49

      Tracing not turned on.

      Tns error struct:

        ns main err code: 12537

    TNS-12537: TNS:connection closed

    opiodr aborting process unknown ospid (36751) as a result of ORA-609

        ns secondary err code: 12560

        nt main err code: 0

        nt secondary err code: 0

        nt OS err code: 0

    opiodr aborting process unknown ospid (36761) as a result of ORA-609

    Sat Dec 07 10:58:49 2023

    。。。。。

    opiodr aborting process unknown ospid (36746) as a result of ORA-609

    opiodr aborting process unknown ospid (36777) as a result of ORA-609opiodr aborting process unknown ospid (36807) as a result of ORA-609

    opiodr aborting process unknown ospid (36819) as a result of ORA-609

    Sat Dec 07 10:58:49 2023

    PMON (ospid: 48234): terminating the instance due to error 481

    2、Crs alert日志分析

            从crsalertlog信息中可以知道,10:58:49的时候,所有数据库资源监测失败,这个和数据库实例abort时间点一致,应该是数据库中止后的表现。

    ------------------------------节点1 crs alert trace文件----------------------

    xxxx- 12-07 10:58:49.068 [CRSD(46493)]CRS-5825: Agent '/u01/app/grid/12.1.0.2/bin/oraagent_grid' is unresponsive and will be restarted. Details at (:CRSAGF00131:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd.trc.

    xxxx- 12-07 10:58:49.094 [ORAAGENT(47263)]CRS-5832: Agent '/u01/app/grid/12.1.0.2/bin/oraagent_grid' was unable to process commands. Details at (:CRSAGF00128:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc.

    xxxx- 12-07 10:58:49.094 [ORAAGENT(47263)]CRS-5818: Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:44542:2} in /u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc.

    xxxx- 12-07 10:58:50.173 [ORAAGENT(47494)]CRS-5011: Check of resource "reportdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:50.298 [ORAAGENT(47494)]CRS-5011: Check of resource "managedb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.029 [ORAAGENT(47494)]CRS-5011: Check of resource "hwxddb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.222 [ORAAGENT(47494)]CRS-5011: Check of resource "hwwlxtdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.284 [ORAAGENT(47494)]CRS-5011: Check of resource "hwyyxtdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.285 [ORAAGENT(47494)]CRS-5011: Check of resource "yxgldb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.297 [ORAAGENT(47494)]CRS-5011: Check of resource "mhlwyxdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:51.298 [ORAAGENT(47494)]CRS-5011: Check of resource "boarddb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:52.273 [ORAAGENT(47494)]CRS-5011: Check of resource "tyjgdb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:52.285 [ORAAGENT(47494)]CRS-5011: Check of resource "obsadb" failed: details at "(:CLSN00007:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_oracle.trc"

    xxxx- 12-07 10:58:52.969 [ORAAGENT(36712)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 36712

    xxxx- 12-07 10:58:54.741 [ORAAGENT(41064)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/ohasd_oraagent_grid.trc"

    xxxx- 12-07 10:58:55.406 [ORAAGENT(36712)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc"

    xxxx- 12-07 10:58:55.424 [ORAAGENT(36712)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/crsd_oraagent_grid.trc"

    xxxx- 12-07 10:58:55.455 [ORAAGENT(41064)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/12.1.0.2/diag/crs/mpc01dbadm01/crs/trace/ohasd_oraagent_grid.trc"

    xxxx- 12-07 10:58:55.527 [ORAAGENT(36712)]CRS-5017: The resource action "ora.RECOC1.dg start" encountered the following error:

    xxxx- 12-07 10:58:55.527+ORA-01092: ORACLE instance terminated. Disconnection forced

    3、Asm trace日志分析

            从1#asm的alertlog可以看到10:58:48,2#实例发起对1#asm实例的abort,需要通过2#alert和LMON trace分析,同时还伴有IPC Send timeout的信息,这个一般是心跳网络超时的报错。

    2#ASM的alertlog可以看到10:56:35就发生了2#核心后台进程发给1#LMD0(44844)的超时报错,随即判断1#asm实例僵死而发起kill!

             再看2#ASM的LMON日志,10:56:35开始尝试reconfig,并设置100s超时vote:

    ......

            随后在10:58:28通过选举驱逐1#asm实例:

    4、节点1 diag日志分析

            看看1#asm实例crash时的diag文件+ASM1_diag_44836_20231207105849.trc,看看2#asm进程接受进程LMD0(44844)的状态:

            可以看到故障前的等待都是“ges remote message”,最后1个历史等待7分29s,这个是典型的IPC网络等待:

            查看系统和网络丢包有关的参数,发现2个节点都会有大量的“packet reassembles failed”丢包发生:

    [root@mpc01dbadm01 trace]# netstat -s

    Ip:

    36764567053 total packets received

    70116 with invalid addresses

    0 forwarded

    0 incoming packets discarded

    24572526733 incoming packets delivered

    21770066525 requests sent out

    692241 outgoing packets dropped

    30980 fragments dropped after timeout

    15457160506 reassemblies required

    3265291587 packets reassembled ok

    226816 packet reassembles failed

    1796293625 fragments received ok

    664 fragments failed

    7885036302 fragments created

    [root@mpc01dbadm02 trace]# netstat -s

    Ip:

    30349664623 total packets received

    79036 with invalid addresses

    0 forwarded

    0 incoming packets discarded

    23893920057 incoming packets delivered

    23820631106 requests sent out

    295480 outgoing packets dropped

    186 dropped because of missing route

    28255 fragments dropped after timeout

    8368295089 reassemblies required

    1912747085 packets reassembled ok

    202513 packet reassembles failed

    3389250826 fragments received ok

    3337 fragments failed

    16013866546 fragments created

    5、OS内核设置

            当前数据库系统计算节点为RHEL6.8,存储节点为RHEL7.2,查看ipfrag参数为默认值:

            MOS有篇相关文档:RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1),现象和当前故障匹配,workaound是加大ipfrag相关参数:

              根据REDHAT官方文章说明,这种现象发生在如下场景:

    1. RHEL6.6/6.7,根据我们经验RHEL6/7都有类似故障发生;
    2. CPU较多(本机为56个);
    3. Oracle RAC环境

    三、结论

    1. 本次故障由于ASM进程间通讯超时,导致2#实例发起了对1#asm实例的驱逐;
    2. 发现2个节点网络均存在大量“packet reassembles failed”丢包,根据MOS文档RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (文档 ID 2008933.1),这个是由于RHEL6/7在主机存在大量CPU时,IP分片组包超出分片缓存区导致,处理方案是使用巨桢(jumbo frame)或者调整IPFRAG相关系统配置。

    四、处理建议

            1、所有节点按如下最佳实践调整系统内核参数:

            net.ipv4.ipfrag_high_thresh = 41943040

            net.ipv4.ipfrag_low_thresh = 40894464

            net.ipv4.ipfrag_time = 120

            net.ipv4.ipfrag_secret_interval = 600

            net.ipv4.ipfrag_max_dist = 1024

            2、为便于故障分析,所有节点部署OSW

  • 相关阅读:
    Netty源码剖析之Netty启动流程
    Nginx.conf设置nginx优化(二)
    【计算机组成&体系结构】整数的表示和运算 | 原码、反码和补码
    Java Maven项目打jar包方法
    YOLOv8蒸馏 | 知识蒸馏 | 利用模型蒸馏改进YOLOv8进行无损涨点 | MimicLoss(在线蒸馏 + 离线蒸馏)
    Qt中的单例模式
    EGL函数翻译--eglChooseConfig
    CAN bus总线静电保护方案
    Tomcat部署war包项目请求404
    网络原理(五):IP 协议
  • 原文地址:https://blog.csdn.net/DBDeep/article/details/140363807