稳定性问题从表现来看有: 死机重启, 自动关机, 无法开机,冻屏,黑屏以及闪退, 无响应等情况;
从技术层面来划分无外乎两大类: 长时间无法执行完成(Timeout) 以及异常崩溃(crash). 主要分类如下:
kernel panic相关错误信息代码处理:
arch/arm64/mm/fault.c
arch/arm64/kernel/traps.c
kernel/panic.c
一些常见的错误信息打印如下:
"Data abort info:\n"
"Mem abort info:\n"
"Unable to handle kernel %s at virtual address %016lx\n"
"Kernel panic - not syncing: %s\n"
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[2022/2/27 21:26:18] [254094.838383] !(0)[675:tosMsgAdvTask] list_add corruption. prev->next should be next (ffffff802e905648), but was ffffff8016ca8540. (prev=ffffff8016ca81c0).
[2022/2/27 21:26:18] [254094.851569] -(0)[675:tosMsgAdvTask] ------------[ cut here ]------------
[2022/2/27 21:26:18] [254094.854755] -(0)[675:tosMsgAdvTask] kernel BUG at lib/list_debug.c:28!
[2022/2/27 21:26:18] [254094.861344] -(0)[675:tosMsgAdvTask] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
soft lockup是指内核中有BUG导致在内核模式下一直循环的时间超过n s(n为配置参数),而其他进程得不到运行的机会;实现方式:内核对于每一个cpu都有一个监控进程watchdog/x 每秒钟会对比进程时间戳,对比时间戳就可以知道运行情况了,进程长时间没有运行,时间戳没有更新,超过一定时间就报错。hard lockup的发生是由于禁止了CPU的所有中断超过一定时间(几秒)这种情况下,外部设备发生的中断无法处理,内核认为此时发生了所谓的hard lockup
Kernel panic - not syncing: workqueue lockup !!!
行 6243: 4,80762,12483088774,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:60527ms
行 6244: 4,80763,12484088772,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:61525ms
行 6245: 4,80764,12485088769,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:62522ms
行 6251: 4,80770,12486088775,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:996ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:63518ms
行 6252: 4,80771,12487088772,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:64515ms
行 6253: 4,80772,12488088771,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:65512ms
行 6258: 4,80777,12490088772,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:996ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:67506ms
行 6259: 4,80778,12491088776,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:68504ms
行 6264: 4,80783,12492088774,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:69501ms
行 6270: 4,80789,12495088774,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:1001ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:72492ms
行 6271: 4,80790,12496088770,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:998ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:73489ms
行 6292: 4,80811,12497088773,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:74487ms
行 6295: 4,80814,12498088770,-;-(0)[0:swapper/0] RT throttling on cpu:3 rt_time:997ms, curr:HwBinder:344_3/1408 prio:98 sum_runtime:75484ms
page allocation failure: order:0
Bad mode in Synchronous Abort handler detected on CPU2, code 0x96000006 -- DABT (current EL)
watchdog: BUG: soft lockup - CPU#0 stuck for 11s!
案列分析1:
从Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0和Mem abort info这两个地方看出panic的原因为空指针引起
ESR = 0x96000006, ESR为ARM异常寄存器错误信息
96000006 [#1] 后面的数字1表示发生的次数
Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp=0000000065733000
pgd=0000000062020003, pud=0000000062020003, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Modules linked in 后面是动态加载的ko模块,如果panic是模块中问题,会在堆栈后面打印对应的模块名,从下面这个log中发现出错的地方为pm_kara_release模块 ,同时会把PC等相关的寄存器信息打印出来
Modules linked in: adc_keypad(O) ad82584f(O) snd_soc_aml_t9015(O) snd_soc(O) snd_soc_dummy_codec(O) aml_media(O) aml_dvb_extern(O) aml_spicc snd_soc_notify(O) meson_clk_measure aml_gki_tool(O) aml_mailbox(O)
PC : ffffffc009c329f0, PFN:8d4a4 V
SP : ffffff809612ad70, PFN:9612a L
FAR : 00000000000002c0 U
R0 : ffffffc009c38a5f, PFN:8d4aa V
R1 : ffffffc009c38000, PFN:8d4aa V
R3 : ffffffc01649b000, PFN:***** V
R9 : ffffffc0113a5cb0, PFN: 23a5 V
R15 : ffffffc0110e56d7, PFN: 20e5 V
R16 : ffffffc010debb00, PFN:***** V
R18 : ffffff809610b840, PFN:9610b L
R19 : ffffffc009c53000, PFN:8d4c5 V
R29 : ffffff809612ada0, PFN:9612a L
R30 : ffffffc009c329d0, PFN:8d4a4 V
CPU: 1 PID: 2286 Comm: log_out Tainted: P C O 5.4.125-android11-2-ge3776401faa0-dirty-ab97 #1
Hardware name: Amlogic (DT)
pstate: 20400089 (nzCv daIf +PAN -UAO)
pc : _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
lr : _0d503df0a3108831179ecc8bd8eab8d7+0xfb0/0x1294 [pm_kara_release]
Call trace将panic的堆栈信息打印出来,panic出现的地方大概有两种情况,一种是出现在进程上下文和中断上下文中,进程上下文中发生的panic为用户空间通过系统调用驱动程序引起,中断上下文中发生的panic为处理中断程序时发生
下面的el0_svc_compat->el0_svc_compat_handler->el0_svc_common->这个过程发现引发的panic为用户空间调用write系统调用时触发
[ffffffc02162be20+ 32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+ 96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+ 0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
[ffffff809612ada0+ 96][] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
从上面这个信息可以清楚知道发生的模块为pm_kara_release,不能用add2line工具解析[ffffff809612ada0+ 96]这个地址,这个地址为vmlinux中的地址,如果直接转换可能会是?号或者行数不对应等
通过addr2line工具解析_pm_kara_release模块 中_0d503df0a3108831179ecc8bd8eab8d7这个地址,解析出对应的函数,大致发生的地方在哪个位置
然后通过objdump 反汇编出来,在0d503df0a3108831179ecc8bd8eab8d7地址的基础上+0xfd0为对应的行,这个地方为最终出问题的地方
也可以通过nm工具来搜索0d503df0a3108831179ecc8bd8eab8d7这个地址
Call trace:
[ffffff809612ada0+ 96][<ffffffc009c329f0>] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
[ffffff809612ae00+ 32][<ffffffc009c31650>] _d0571ed82a857ce697fd4a1931c7f2f9+0xad8/0xb18 [pm_kara_release]
[ffffff809612ae20+ 112][<ffffffc01018ef08>] __run_hrtimer+0x1b0/0x2ec
[ffffff809612ae90+ 96][<ffffffc01018db34>] hrtimer_interrupt+0x16c/0x3e4
[ffffff809612aef0+ 32][<ffffffc0109dfb30>] arch_timer_handler_phys+0x44/0x58
[ffffff809612af10+ 96][<ffffffc01016c754>] handle_percpu_devid_irq+0x134/0x2bc
[ffffff809612af70+ 64][<ffffffc010164508>] __handle_domain_irq+0x88/0xcc
[ffffff809612afb0+ 0][<ffffffc010081044>] gic_handle_irq+0x70/0xe8
[ffffffc02162bb40+ 32][<ffffffc01008305c>] el1_irq+0x15c/0x240
[ffffffc02162bb60+ 96][<ffffffc0105b7cc0>] fault_in_pages_readable+0xb4/0x324
[ffffffc02162bbc0+ 80][<ffffffc0105b7bb8>] iov_iter_fault_in_readable+0x50/0xa4
[ffffffc02162bc10+ 96][<ffffffc01027f444>] generic_perform_write+0x5c/0x1b0
[ffffffc02162bc70+ 80][<ffffffc01027f660>] __generic_file_write_iter+0xc8/0x170
[ffffffc02162bcc0+ 160][<ffffffc01027f81c>] generic_file_write_iter+0x114/0x184
[ffffffc02162bd60+ 48][<ffffffc01030b440>] __vfs_write+0x160/0x184
[ffffffc02162bd90+ 80][<ffffffc01030b608>] vfs_write+0xe8/0x1ac
[ffffffc02162bde0+ 64][<ffffffc01030b844>] ksys_write+0x78/0xcc
[ffffffc02162be20+ 32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+ 96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+ 0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
Code: f9408908 d0000020 d0000021 91297c00 (b942c104)
---[ end trace 589d34c66e567ecd ]---
Kernel panic - not syncing: Fatal exception in interrupt
解析ko地址时,首先要确认ko上基于32位编译的还是64位编译的,否则会报如下错误,ko文件需要使用带有符号的ko或者lib
arm-linux-androideabi-addr2line: out/android11-5.4/common/drivers/amlogic/media/aml_media.ko: File format not recognized
kernel panic常用的工具:
arm-linux-objdump
arm-linux-nm
arm-linux-add2line
aarch64-linux-gnu-addr2line
aarch64-none-elf
pid: 438, tid: 562, name: HwBinder:438_1 >>> /vendor/tvos/bin/sitatvservice <<<
pid : tombstoned的进程ID
tid :线程ID
name: 线程名
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1
Cause: null pointer dereference
进程异常对出的原因为空指针异常
backtrace为出错时的对应的堆栈打印信息:最后调用libamlhalcore.so中的removeCommandValue+146函数挂掉了,可以通过addr2line objdump nm等工具来debug
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'TCL/t982_ar31a8/t982_ar31a8:11/RQ3A.210705.001/15:userdebug/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2022-03-02 09:27:08+0800
pid: 438, tid: 562, name: HwBinder:438_1 >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1
Cause: null pointer dereference
r0 eae36ee8 r1 eae36ee0 r2 00000000 r3 00000200
r4 00000006 r5 eae36ee0 r6 eae36edc r7 00000001
r8 eae36ed8 r9 eadef2d8 r10 eadf0ced r11 eadf0cda
ip eca8da30 sp e8a5fd40 lr eca10a85 pc eae1495e
backtrace:
NOTE: Function names and BuildId information is missing for some frames due
NOTE: to unreadable libraries. For unwinds of apps, only shared libraries
NOTE: found under the lib/ directory are readable.
NOTE: On this device, run setenforce 0 to make the libraries readable.
#00 pc 0004895e /vendor/lib/libamlhalcore.so (removeCommandValue+146) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#01 pc 000491a9 /vendor/lib/libamlhalcore.so (Aml_Hal_SetMasterVol+68) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#02 pc 0005191f /vendor/lib/libamlhalcore.so (AML_HAL_CEC_setDeviceConnectionState+678) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#03 pc 000107a7 /vendor/lib/libsystemcontrolservice.so
tombstoned除了会将出问题的线程堆栈信息打印出来外,还会把其它的所有线程的线程堆栈信息打印出来吗
pid: 447, tid: 447, name: sitatvservice >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
r0 fff58968 r1 00000000 r2 000003e8 r3 f6099444
r4 f5e052b4 r5 00000001 r6 f5f90180 r7 000000a2
r8 fff58a54 r9 00000000 r10 00000000 r11 fff58a18
ip fff58998 sp fff58968 lr f5dba409 pc f5de65b0
backtrace:
NOTE: Function names and BuildId information is missing for some frames due
NOTE: to unreadable libraries. For unwinds of apps, only shared libraries
NOTE: found under the lib/ directory are readable.
NOTE: On this device, run setenforce 0 to make the libraries readable.
#00 pc 0009b5b0 /apex/com.android.runtime/lib/bionic/libc.so (nanosleep+8) (BuildId: 29bc0f9d4101e570b3fe12148d8fe2)
#01 pc 0006f405 /apex/com.android.runtime/lib/bionic/libc.so (usleep+52) (BuildId: 29bc0f9d4101e570b3fe12148d8fe2)
#02 pc 001cb680 /vendor/tvos/bin/sitatvservice
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
pid: 447, tid: 492, name: user_timer >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
r0 f5d47190 r1 00000000 r2 000003e8 r3 00000001
r4 f5e052b4 r5 00000000 r6 000001f4 r7 000000a2
r8 00000000 r9 f5ff53b8 r10 024a17a0 r11 f5d471b0
ip 00003573 sp f5d47190 lr f5dba409 pc f5de65b0
backtrace:
NOTE: Function names and BuildId information is missing for some frames due
NOTE: to unreadable libraries. For unwinds of apps, only shared libraries
NOTE: found under the lib/ directory are readable.
NOTE: On this device, run setenforce 0 to make the libraries readable.
#00 pc 0009b5b0 /apex/com.android.runtime/lib/bionic/libc.so (nanosleep+8) (BuildId: 29bc0f9d4101e570b3fe12148d8fe2)
#01 pc 0006f405 /apex/com.android.runtime/lib/bionic/libc.so (usleep+52) (BuildId: 29bc0f9d4101e570b3fe12148d8fe2)
#02 pc 00775158 /vendor/tvos/bin/sitatvservice
常见的错误码如下:
出现问题时,一帮是signal 11 (SIGSEGV) 和 signal 6 (SIGABRT)
#define SIGHUP 1
#define SIGINT 2
#define SIGQUIT 3
#define SIGILL 4
#define SIGTRAP 5
#define SIGABRT 6
#define SIGIOT 6
#define SIGBUS 7
#define SIGFPE 8
#define SIGKILL 9
#define SIGUSR1 10
#define SIGSEGV 11
#define SIGUSR2 12
#define SIGPIPE 13
#define SIGALRM 14
#define SIGTERM 15
#define SIGSTKFLT 16
#define SIGCHLD 17
#define SIGCONT 18
#define SIGSTOP 19
#define SIGTSTP 20
#define SIGTTIN 21
#define SIGTTOU 22
#define SIGURG 23
#define SIGXCPU 24
#define SIGXFSZ 25
#define SIGVTALRM 26
#define SIGPROF 27
#define SIGWINCH 28
#define SIGIO 29
#define SIGPOLL SIGIO
debuggerd -b [pid],可输出Native进程的调用栈
Android中Watchdog用来看护system_server进程,system_server进程运行着系统最终要的服务,譬如AMS、PKMS、WMS等, 当这些服务不能正常运转时,Watchdog可能会杀掉system_server,让系统重启。Watchdog的实现利用了锁和消息队列机制。当system_server发生死锁或消息队列一直处于忙碌状态时,则认为系统已经没有响应了(System Not Responding)。在分析Watchdog问题的时候,首先要有详尽的日志,其次要能定位出导致Watchdog超时的直接原因,最重要的是能还原出问题发生的场景。watchdog不是引发重启的原因,其它模块获取锁等相关导致系统卡住,才导致引发watchdog.
dropboxTag | 含义 |
---|---|
system_server_anr | system进程无响应 |
system_server_watchdog | system进程发生watchdog |
system_server_crash | system进程崩溃 |
system_server_native_crash | system进程native出现崩溃 |
system_server_wtf | system进程发生严重错误 |
system_server_lowmem | system进程内存不足 |
watchdog异常打印
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
system_server异常日志收集
/data/anr/*
/data/tombstones/*
/data/system/dropbox/*
bugreport也是收集android系统异常信息的小工具,常见用法如下:
bugreport bugreportz
adb bugreport > Bugreport.zip
adb bugreport > Bugreport.txt
ChkBugReport是索尼开发一个jar包用来解析bugreport的工具
Where to obtain it · sonyxperiadev/ChkBugReport Wiki (github.com)
https://github.com/sonyxperiadev/ChkBugReport
git clone git://github.com/sonyxperiadev/ChkBugReport.git
// 导出内存分配信息
echo m > /proc/sysrq-trigger
// 导出当前任务状态信息
echo t > /proc/sysrq-trigger
// 导出当前CPU寄存器和标志位信息
echo p > /proc/sysrq-trigger