• Oracle特殊恢复:异常掉电导致的ORA-600 [kfrValAcd30]故障处理


    一、 问题描述

    现象:硬件掉电后,oracle集群无法启动。

    [root@rac2 ~]# crsctl stat res -t
    CRS-4535: Cannot communicate with Cluster Ready Services
    CRS-4000: Command Status failed, or completed with errors.
    [root@rac2 ~]# crsctl start crs
    CRS-4640: Oracle High Availability Services is already active
    CRS-4000: Command Start failed, or completed with errors.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    二、 故障处理

    查看集群组件发现ora.asm状态为offline

    root@rac2 ~]# crsctl stat res -t -init
    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.asm
          1        ONLINE  OFFLINE                               Instance Shutdown   
    ora.cluster_interconnect.haip
          1        ONLINE  ONLINE       rac2                                         
    ora.crf
          1        ONLINE  ONLINE       rac2                                         
    ora.crsd
          1        ONLINE  OFFLINE                                                   
    ora.cssd
          1        ONLINE  ONLINE       rac2                                         
    ora.cssdmonitor
          1        ONLINE  ONLINE       rac2                                         
    ora.ctssd
          1        ONLINE  ONLINE       rac2                     OBSERVER            
    ora.diskmon
          1        OFFLINE OFFLINE                                                   
    ora.drivers.acfs
          1        ONLINE  ONLINE       rac2                                         
    ora.evmd
          1        ONLINE  INTERMEDIATE rac2                                         
    ora.gipcd
          1        ONLINE  ONLINE       rac2                                         
    ora.gpnpd
          1        ONLINE  ONLINE       rac2                                         
    ora.mdnsd
          1        ONLINE  ONLINE       rac2                                         
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32

    查看grid alert日志发现磁盘组没有mount

    [ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.
    2018-05-08 04:12:24.940:
    [cssd(4576)]CRS-1707:Lease acquisition for node rac2 number 2 completed
    2018-05-08 04:12:26.188:
    [cssd(4576)]CRS-1605:CSSD voting file is online: /dev/asmdisk/oraasm-OCR_0000; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
    2018-05-08 04:12:28.723:
    [cssd(4576)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
    2018-05-08 04:12:30.617:
    [ctssd(4660)]CRS-2401:The Cluster Time Synchronization Service started on host rac2.
    2018-05-08 04:12:30.617:
    [ctssd(4660)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1.
    2018-05-08 04:12:32.348:
    [ohasd(4329)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
    2018-05-08 04:12:32.348:
    [ohasd(4329)]CRS-2769:Unable to failover resource 'ora.diskmon'.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    查看asm_alert,出现ORA-00600 [kfrValAcd30]的报错

    NOTE: GMON heartbeating for grp 2
    GMON querying group 2 at 6 for pid 23, osid 5727
    NOTE: cache opening disk 0 of grp 2: DATA_0000 path:/dev/asmdisk/oraasm-ASM_0000
    NOTE: F1X0 found on disk 0 au 2 fcn 0.0
    NOTE: cache opening disk 1 of grp 2: DATA_0001 path:/dev/asmdisk/oraasm-ASM_0001
    NOTE: F1X0 found on disk 1 au 2 fcn 0.0
    NOTE: cache opening disk 2 of grp 2: DATA_0002 path:/dev/asmdisk/oraasm-ASM_0002
    NOTE: F1X0 found on disk 2 au 2 fcn 0.0
    NOTE: cache opening disk 3 of grp 2: DATA_0003 path:/dev/asmdisk/oraasm-ASM_0003
    NOTE: cache mounting (first) normal redundancy group 2/0x877A96CD (DATA)
    * allocate domain 2, invalid = TRUE
    NOTE: attached to recovery domain 2
    NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)
    Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc  (incident=50111):
    ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
    ORA-15017: diskgroup "ASM" cannot be mounted
    ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
    Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trc
    Use ADRCI or Support Workbench to package the incident.
    See Note 411.1 at My Oracle Support for error and packaging details.
    Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_5727.trc:
    ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
    ORA-15017: diskgroup "ASM" cannot be mounted
    ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
    Abort recovery for domain 2
    NOTE: crash recovery signalled OER-600
    ERROR: ORA-600 signalled during mount of diskgroup DATA
    NOTE: cache dismounting (clean) group 2/0x877A96CD (DATA)
    NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    查看ASM2_ora_5727_i50111.trc

    kfdp_query(DATA): 6
    ----- Abridged Call Stack Trace -----
    ksedsts()+465<-kfdp_query()+530<-kfdPstSyncPriv()+585<-kfgFinalizeMount()+1630<-kfgscFinalize()+1433<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558<-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842
    <-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
    ----- End of Abridged Call Stack Trace -----
    2018-05-17 21:54:44.321623 : Start recovery for domain=2, valid=0, flags=0x4
    NOTE: starting recovery of thread=1 ckpt=8.390 group=2 (DATA)
    Incident 50111 created, dump file: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_50111/+ASM2_ora_5727_i50111.trc
    ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
    ORA-15017: diskgroup "ASM" cannot be mounted
    ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
    
    recovery of group DATA failed due to the following error(s):
    ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8], [390], [9], [390], [], [], [], [], []
    ORA-15017: diskgroup "ASM" cannot be mounted
    ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ASM"
    2018-05-17 21:54:45.016907 : Abort recovery for domain 2, flags = 0x4
    2018-05-17 21:54:45.017799 : kjb_abort_recovery: domain flags=0x0, valid=0
    NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)
    kfdp_dismount(): 7
    ----- Abridged Call Stack Trace -----
    ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdel()+155<-kfgscFinalize()+1211<-kfgForEachKfgsc()+285<-kfgsoFinalize()+135<-kfgFinalize()+398<-kfxdrvMount()+5558
    <-kfxdrvEntry()+2207<-opiexe()+20624<-opiosq0()+3932<-kpooprx()+274<-kpoal8()+842<-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
    ----- End of Abridged Call Stack Trace -----
    ASM name of disk 0x9f1d9488 (2:0:DATA_0000:/dev/asmdisk/oraasm-ASM_0000) is being cleared
    ASM name of disk 0x9f1d9808 (2:1:DATA_0001:/dev/asmdisk/oraasm-ASM_0001) is being cleared
    ASM name of disk 0x9f1d9108 (2:2:DATA_0002:/dev/asmdisk/oraasm-ASM_0002) is being cleared
    ASM name of disk 0x9f1d8d88 (2:3:DATA_0003:/dev/asmdisk/oraasm-ASM_0003) is being cleared
    NOTE: messaging CKPT to quiesce pins Unix process pid: 5727, image: oracle@rac2 (TNS V1-V3)
    kfdp_dismount(): 8
    ----- Abridged Call Stack Trace -----
    ksedsts()+465<-kfdp_dismountInt()+389<-kfdp_dismount()+11<-kfgTermCache()+347<-kfgRecoverDismount()+373<-kfgRecoverMount()+277<-kfgscDelete()+2742<-kssdct()+258<-kfgsoDelete()+721<-kssdel()+155<-kssdch()+6849<-ksures()+52<-opiosq0()+4721<-kpooprx()+274<-kpoal8()+842
    <-opiodr()+917<-ttcpip()+2183<-opitsk()+1710<-opiino()+969<-opiodr()+917<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ssthrdmain()+265<-main()+201<-__libc_start_main()+253
    ----- End of Abridged Call Stack Trace -----
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    搜索mos也发现了几个Bug:

    Bug 27288230 - ORA-600: [kfrvalacd30] in ASM SMON (Doc ID 27288230.8)	
    Bug 19064132 : ORA-600 [KFRVALACD30] MOUNTING DISKGROUP AFTER RESTARTING SERVERS
    ORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)
    bug在11.2.0.1/11.2.0.3/112.0.4/12.1/18.1 都存在。
    
    • 1
    • 2
    • 3
    • 4

    Oracle认为这是存储或者OS问题导致asm acd block的元数据不一致了
    可能导致ASM元数据的主辅扩展区都发生损坏。这个损坏会导致rebalance挂起或不断尝试失败,或者阻止磁盘组被挂载。

    strace出现ORA 600 [kfrValAcd30]报错,但没发现其他可用的信息

    write(25, "string_value", 12)           = 12
    write(25, ">", 1)                       = 1
    write(25, "ORA 600 [kfrValAcd30]", 21)  = 21
    write(25, ", 2)                      = 2
    write(25, "string_value", 12)           = 12
    write(25, ">", 1)                       = 1
    write(25, ", 2)                      = 2
    write(25, "pkey_dbgristih", 14)         = 14
    write(25, ">", 1)                       = 1
    write(25, "\n", 1)                      = 1
    write(25, "<", 1)                       = 1
    write(25, "creatm_dbgristih", 16)       = 16
    write(25, ">", 1)                       = 1
    write(25, "<", 1)                       = 1
    write(25, "time", 4)                    = 4
    write(25, ">", 1)                       = 1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    ASM的Active Change Directory(ACD)简单来说就相当于asm元数据重做记录,需要说明的是,每一个asm实例,都用有其自己的ACD目录,也就说,如果你是双节点的rac,那么就有84m的ACD 目录信息。
    ACD主要是记录了这样一些数据:线程号,block号,opcode,sequences,记录长度等信息。
    当处于运行的asm实例,突然crash后,那么重启asm实例以后,asm可以根据ACD信息去进行”instance recover”,从而保证能够正常启动asm实例。
    如果acd的信息出现损坏,磁盘组将不能够被mount。

    使用kfed来读取元数据,首先定位到active change directory 所在AU

    [grid@rac2 ~]$  kfed read  /dev/dm-0  aun=2 blkn=3|grep .xptr.au |head -1
    kfffde[0].xptr.au:                    4 ; 0x4a0: 0x00000004
    
    • 1
    • 2

    从上面信息可以看出,ACD元数据和数据应该包含在其中一个AU 4

    查看au 4的信息

    grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=1|more
    kfbh.endian:                          1 ; 0x000: 0x01
    kfbh.hard:                          130 ; 0x001: 0x82
    kfbh.type:                            8 ; 0x002: KFBTYP_CHNGDIR
    kfbh.datfmt:                          1 ; 0x003: 0x01
    kfbh.block.blk:                       1 ; 0x004: blk=1
    kfbh.block.obj:                       3 ; 0x008: file=3
    kfbh.check:                    17400326 ; 0x00c: 0x01098206
    kfbh.fcn.base:                        0 ; 0x010: 0x00000000
    kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
    kfbh.spare1:                          0 ; 0x018: 0x00000000
    kfbh.spare2:                          0 ; 0x01c: 0x00000000
    kfracdb.aba.seq:                      2 ; 0x000: 0x00000002
    kfracdb.aba.blk:                      0 ; 0x004: 0x00000000
    kfracdb.ents:                         2 ; 0x008: 0x0002
    kfracdb.ub2spare:                     0 ; 0x00a: 0x0000
    kfracdb.lge[0].valid:                 1 ; 0x00c: V=1 B=0 M=0
    kfracdb.lge[0].chgCount:              1 ; 0x00d: 0x01
    kfracdb.lge[0].len:                  52 ; 0x00e: 0x0034
    kfracdb.lge[0].kfcn.base:             1 ; 0x010: 0x00000001
    kfracdb.lge[0].kfcn.wrap:             0 ; 0x014: 0x00000000
    kfracdb.lge[0].bcd[0].kfbl.blk:       0 ; 0x018: blk=0
    kfracdb.lge[0].bcd[0].kfbl.obj:       4 ; 0x01c: file=4
    kfracdb.lge[0].bcd[0].kfcn.base:      0 ; 0x020: 0x00000000
    kfracdb.lge[0].bcd[0].kfcn.wrap:      0 ; 0x024: 0x00000000
    kfracdb.lge[0].bcd[0].oplen:          4 ; 0x028: 0x0004  ---表示长度,类似logfile dump的LEN
    kfracdb.lge[0].bcd[0].blkIndex:       0 ; 0x02a: 0x0000
    kfracdb.lge[0].bcd[0].flags:         28 ; 0x02c: F=0 N=0 F=1 L=1 V=1 A=0 C=0
    kfracdb.lge[0].bcd[0].opcode:       212 ; 0x02e: 0x00d4  --opcode,类似数据库实例中的update/delete/insert操作的opcode编号
    kfracdb.lge[0].bcd[0].kfbtyp:         9 ; 0x030: KFBTYP_COD_BGO  --操作类型,类似数据库实例中的update/delete/insert等类型
    kfracdb.lge[0].bcd[0].redund:        17 ; 0x031: SCHE=0x1 NUMB=0x1  --这里表示冗余级别,17是unport,18是mirror,19表示high
    kfracdb.lge[0].bcd[0].pad:        63903 ; 0x032: 0xf99f
    kfracdb.lge[0].bcd[0].KFRCOD_CRASH:   1 ; 0x034: 0x00000001
    kfracdb.lge[0].bcd[0].au[0]:         46 ; 0x038: 0x0000002e
    kfracdb.lge[0].bcd[0].disks[0]:       0 ; 0x03c: 0x0000
    kfracdb.lge[1].valid:                 1 ; 0x040: V=1 B=0 M=0
    kfracdb.lge[1].chgCount:              1 ; 0x041: 0x01
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38

    check等信息属于hash值,每隔3s都会更新一次,可能是由于突然掉电,cache里的信息没有更新到磁盘中导致。

    [grid@rac2 ~]$  kfed read /dev/dm-2 aun=4 blkn=0|grep ckpt
    kfracdc.ckpt.seq:                     8 ; 0x018: 0x00000008
    kfracdc.ckpt.blk:                   390 ; 0x01c: 0x00000186
    [grid@rac2 ~]$  kfed read /dev/dm-0 aun=4 blkn=0|grep ckpt
    kfracdc.ckpt.seq:                    70 ; 0x018: 0x00000046
    kfracdc.ckpt.blk:                   255 ; 0x01c: 0x000000f
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    三、 问题解决

    通过kfed merge手工修改的方法恢复

    kfed read /dev/dm-2 aun=4 blkn=0 >acd.txt
    kfed merge /dev/dm-2 aun=4 blkn=0  text=acd.txt
    
    • 1
    • 2

    手工mount磁盘组

    SQL>  alter diskgroup data mount;
     alter diskgroup data mount
    *
    ERROR at line 1:
    ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [8],
    [390], [9], [390], [], [], [], [], []
    
    SQL> alter diskgroup data mount;
    Diskgroup altered.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    如果kfed merge无法修复,请及时联系,以通过odu对asm底层进行恢复数据文件,从而最大限度恢复数据.

    作者介绍:云和恩墨资深Oracle dba,专注于数据库运维、架构和行业发展,有12年左右的金融、保险、政府、地税、运营商等业务关键型系统的运维经验,曾担任公司异常恢复东区接口人,负责紧急异常恢复工作,技术二线专家。目前负责PG、openGauss/MogDB运维、国产化MogDB数据库的推广工作。

  • 相关阅读:
    Perl基础入门指南:从零开始掌握Perl编程
    UniApp video 使用(自定义进度条,及微信无法暂停播放设置进度问题)
    每日学习总结20240313
    dreamweaver网页设计作业制作 小米商城官网商城购物网页设计 WEB静态网页作业模板 大学生个人购物商城网页代码 dw个人网页作业
    微信小程序毕业设计健康食谱菜谱系统微信小程序+后台管理系统|前后分离VUE.js
    细说RTSP、RTMP和GB28181区别
    Flink日志采集-ELK可视化实现
    3Darray 修改array值然后保存图片
    JS高级:作用域与作用域链
    android中actionbar使用示例
  • 原文地址:https://blog.csdn.net/ORACLE_BBED/article/details/138114218