一、背景
系统发生native crash时,针对内存异常访问、内存踩踏等疑难问题,由于tombstone信息量不足无法精确定位分析这类问题。
二、coredump介绍
2.1 什么是coredump
当用户程序运行过程中发生异常, 程序异常退出时, 由Linux内核把程序当前的内存状态信息(运行时的内存,寄存器状态,堆栈指针,各种函数调用堆栈信息等)存储在一个core文件中, 这个过程称作coredump.
2.2 coredump作用
coredump主要应用于解决NE问题(native exception)。用户进程发生native crash时,tombstone会抓取一些简单的backtrace信息,但是对于定位一些内存访问异常、内存被踩的疑难问题来说,tombstone信息量不充足导致无法精确定位分析问题,这个时候就需要使用到coredump分析这类问题。
2.3 什么情况下触发coredump
从进程发生异常类型维度来看,当native进程发生内存越界访问、堆栈溢出、非法指针等操作时,会触发coredump
从进程接收的信号类型来看,当native进程接收SIGQUIT、SIGABRT、SIGSEGV、SIGTRAP等信号时,会触发coredump
三、如何使用coredump
在Android平台默认关闭coredump,需要手动打开。
1.打开coredump开关
1) 检查系统 coredump 是否开启ulimit -c // 返回 0,则未启用
2) 打开coredump ulimit -c 1024 // 设置成 1024 byte 或者ulimit -c unlimited // 设置成无限大
2.设置coredump生成文件的路径
// 如果不设置文件路径,core文件生成的位置默认是可执行文件所在的位置
echo "/data/corefile/core-%e-%p-%t" > /proc/sys/kernel/core_pattern
3.当检测到进程异常退出时,会在指定的路径下生成core文件(格式为elf),可以结合gdb工具调试分析
1)将可执行文件和core文件放在一个目录下
2)执行gdb {binary_name} {core_file_name}命令,解析core文件,定位分析问题
详见第五章Demo案例。
四、coredump实现原理
4.1 基本原理
用户程序发生某些错误或异常时,在Linux内核会捕获到异常,并给用户进程发送signal异常信号,进程在返回用户空间之前处理信号,调用Linux内核coredump,生成elf格式的core文件,保存到指定的路径。

4.2 核心代码段
调用 do_coredump 函数来生成 core文件。如下:
void do_coredump(const kernel_siginfo_t *siginfo)
{......binfmt = mm->binfmt;if (!binfmt || !binfmt->core_dump)goto fail;if (!__get_dumpable(cprm.mm_flags))goto fail;......// 1.生成core文件名称ispipe = format_corename(&cn, &cprm, &argv, &argc);......// 2.创建core文件cprm.file = file_open_root(&root, cn.corename, open_flags, 0600);......// 3.将进程的内存信息写入core文件core_dumped = binfmt->core_dump(&cprm);......
}
elf_core_dump 函数负责将进程的内存状态信息写入elf格式的core文件,以便后续的gdb调试和分析。如下:
// kernel_platform/msm-kernel/fs/binfmt_elf.cstatic int elf_core_dump(struct coredump_params *cprm)
{....../** Collect all the non-memory information about the process for the* notes. This also sets up the file header.*/// 1.函数填充 ELF 头部和 notes 信息if (!fill_note_info(&elf, e_phnum, &info, cprm))goto end_coredump;has_dumped = 1;// 2.计算 ELF 头部、程序头部和 notes 节的大小,并分配相应的内存offset += sizeof(elf); /* Elf header */offset += segs * sizeof(struct elf_phdr); /* Program headers */....../* Write program headers for segments dump */for (i = 0; i < cprm->vma_count; i++) {struct core_vma_metadata *meta = cprm->vma_meta + i;struct elf_phdr phdr;phdr.p_type = PT_LOAD;phdr.p_offset = offset;phdr.p_vaddr = meta->start;phdr.p_paddr = 0;phdr.p_filesz = meta->dump_size;phdr.p_memsz = meta->end - meta->start;offset += phdr.p_filesz;phdr.p_flags = 0;if (meta->flags & VM_READ)phdr.p_flags |= PF_R;if (meta->flags & VM_WRITE)phdr.p_flags |= PF_W;if (meta->flags & VM_EXEC)phdr.p_flags |= PF_X;phdr.p_align = ELF_EXEC_PAGESIZE;if (!dump_emit(cprm, &phdr, sizeof(phdr)))goto end_coredump;}// 3.写入 ELF 头部和程序头部if (!elf_core_write_extra_phdrs(cprm, offset))goto end_coredump;/* write out the notes section */// 4.写入 notes信息if (!write_note_info(&info, cprm))goto end_coredump;/* For cell spufs */// 5.写入数据段if (elf_coredump_extra_notes_write(cprm))goto end_coredump;/* Align to page */dump_skip_to(cprm, dataoff);for (i = 0; i < cprm->vma_count; i++) {struct core_vma_metadata *meta = cprm->vma_meta + i;if (!dump_user_range(cprm, meta->start, meta->dump_size))goto end_coredump;}// 6.写入扩展编号if (!elf_core_write_extra_data(cprm))goto end_coredump;if (e_phnum == PN_XNUM) {if (!dump_emit(cprm, shdr4extnum, sizeof(*shdr4extnum)))goto end_coredump;}end_coredump:free_note_info(&info);kfree(shdr4extnum);kfree(phdr4note);return has_dumped;
}
4.3 代码时序
异常捕获、信号处理&生成core文件的功能逻辑的代码时序,如下:

4.4 core文件格式及内容
coredump抓取的core文件为elf格式,可以使用gdb调试,定位分析问题。
core文件内容,如下:
ELF Header:Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64Data: 2's complement, little endianVersion: 1 (current)OS/ABI: UNIX - System VABI Version: 0Type: CORE (Core file)Machine: AArch64Version: 0x1Entry point address: 0x0Start of program headers: 64 (bytes into file)Start of section headers: 0 (bytes into file)Flags: 0x0Size of this header: 64 (bytes)Size of program headers: 56 (bytes)Number of program headers: 138Size of section headers: 0 (bytes)Number of section headers: 0Section header string table index: 0Program Headers:Type Offset VirtAddr PhysAddrFileSiz MemSiz Flags AlignNOTE 0x0000000000001e70 0x0000000000000000 0x00000000000000000x00000000000018a8 0x0000000000000000 0x0LOAD 0x0000000000004000 0x000000560ca89000 0x00000000000000000x0000000000000000 0x0000000000002000 R 0x1000LOAD 0x0000000000004000 0x000000560ca8b000 0x00000000000000000x0000000000000000 0x0000000000003000 R E 0x1000LOAD 0x0000000000004000 0x000000560ca8e000 0x00000000000000000x0000000000001000 0x0000000000001000 R 0x1000
...Displaying notes found at file offset 0x00001e70 with length 0x000018a8:Owner Data size DescriptionCORE 0x00000188 NT_PRSTATUS (prstatus structure)CORE 0x00000088 NT_PRPSINFO (prpsinfo structure)CORE 0x00000080 NT_SIGINFO (siginfo_t data)CORE 0x00000150 NT_AUXV (auxiliary vector)CORE 0x00000f6e NT_FILE (mapped files)Page size: 4096Start End Page Offset0x000000560ca89000 0x000000560ca8b000 0x0000000000000000/system/bin/coredump-test-bin0x000000560ca8b000 0x000000560ca8e000 0x0000000000000002/system/bin/coredump-test-bin
...
CORE 0x00000210 NT_FPREGSET (floating point registers)LINUX 0x00000010 NT_ARM_TLS (AArch TLS registers)description data: 00 10 e4 45 7e 00 00 00 00 00 00 00 00 00 00 00 LINUX 0x00000108 NT_ARM_HW_BREAK (AArch hardware breakpoint registers)description data: 06 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 LINUX 0x00000108 NT_ARM_HW_WATCH (AArch hardware watchpoint registers)description data: 04 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 LINUX 0x00000004 Unknown note type: (0x00000404)description data: ff ff ff ff LINUX 0x00000010 Unknown note type: (0x00000406)description data: 00 00 00 00 80 ff 7f 00 00 00 00 00 80 ff 7f 00 LINUX 0x00000008 Unknown note type: (0x0000040a)description data: 0f 00 00 00 00 00 00 00 LINUX 0x00000008 Unknown note type: (0x00000409)description data: 01 00 00 00 00 00 00 00
core文件内容主要包括ELF Header、Program Headers、NOTE segment.
ELF Header:用于记录core文件的基本信息和结构。
Program Headers: 记录内存中映射文件的信息,以及segment的权限和属性。
NOTE segment:记录进程崩溃时刻的进程状态、寄存器、信号信息、辅助向量和映射文件的详细信息。通过这些信息,gdb调试工具可以重建崩溃时的内存布局,分析崩溃原因,并帮助开发者精确定位分析问题。
五、Demo案例
1)Demo程序
进程发生异常crash后,抓取tombstone和core文件。
2)生成的tombstone文件
从抓取的tombstone文件分析,只能看出大致的原因,无法精确定位到根本原因或哪句代码出错导致进程crash.因此,需要借助coredump,抓取core文件来精确定位分析这类问题。
Cmdline: ../../system/bin/coredump-test-bin use-after-free
pid: 11966, tid: 11966, name: coredump-test-b >>> ../../system/bin/coredump-test-bin <<<
uid: 0
...
backtrace:#01 pc 0000000000090088 /system/lib64/libc.so (__vfprintf+10416) (BuildId: 567e41669f1cb528e72fe319cd09033b)#02 pc 00000000000ac06c /system/lib64/libc.so (vsnprintf+192) (BuildId: 567e41669f1cb528e72fe319cd09033b)#03 pc 0000000000006afc /system/lib64/liblog.so (__android_log_print+184) (BuildId: 87ba6a9314f00fab650fb8fad7913d58)#04 pc 00000000000010a4 /system/bin/coredump-test-bin (main+80) (BuildId: c97bade065c198c12dcca74f107c513c)#05 pc 0000000000048768 /system/lib64/libc.so (__libc_init+96) (BuildId: 567e41669f1cb
...
3)生成的core文件
打开coredump功能,抓取core文件。core文件为elf格式,可以用gdb调试。
用gdb调试Demo程序和生成的core文件,执行gdb ./coredump-test-bin ./core-coredump-test-bin-11966-1720526041命令,可以精确定位到是源文件哪一行代码出错,如下:
---> ... Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000040053c in square (a=1, b=2) at test.c:7 7 *p = 666; # 可见在test.c中的第7行,出现了问题。
# (gdb) backtrace // 输入backtrace ---> #0 0x000000000040053c in square (a=1, b=2) at test.c:7 // 可见在test.c中的第7行,出现了问题。 #1 0x0000000000400564 in doCalc (num1=1, num2=2) at test.c:14 #2 0x0000000000400591 in main () at test.c:22
六、风险及解决方案
打开coredump功能,存在以下风险:
1)若系统中存在native进程反复crash自启,尤其在研发阶段这种现象很普遍,会导致持续不断产生core文件,磁盘空间很快被占满。
解决方案:结合quota机制,core文件路径存储空间分配project_id,设置quota阈值(存储空间上限),超过阈值就自动覆盖老的文件