Android 12 进程native crash流程分析

Posted pecuyu

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Android 12 进程native crash流程分析相关的知识,希望对你有一定的参考价值。

文章托管在gitee上 Android Notes , 同步csdn
本文基于android12 分析

概述

在Android中,crash大致可以做如下分类:

  • Java crash, 通常发生在Java虚拟机层面之上的,如 system_server/app java crash
  • Native crash,主要是C/C++ 层面发生的crash,system_server/app也可能发生native crash,因为它们都zygote fork而来,而zygote是运行 app_process 这个native 程序而来。
  • kernel crash, 通常会触发kernel panic 死机,通常是因为驱动或硬件导致。

本篇主要是看 Native crash 抓log流程。

实现机制介绍

实现机制主要是基于信号机制和ptrace机制,如下:

  • 对于Android中的应用或native程序而言,它在启动时会首先加载linker模块做一些初始化,之后控制权才会回到进程自身的逻辑,因此可以在linker初始化的时候做一些工作,以实现抓取native crash的log, 而在linker init过程,注册了一些 signal 的处理器(linux默认通常是直接kill进程)。
  • 当进程异常时收到相关信号,signal 处理器会对信号流程做拦截处理,此处异常进程fork出新进程crash_dump,通过crash_dump去ptrace到异常进程,获取其调用栈、内存等信息,将输出内容写入到tombstoned提供的fd(通过socket连接tombstoned获取输出fd)。
  • 当完成dump操作,会重新发送信号kill异常进程(在此操作之前会将signal 处理器重置为默认)。

流程概述图

流程大致如下图所示:

流程分析

接下来,从 linker 的入口_start开始看起。如何分析入口可见参考。

begin.S

// bionic/linker/arch/arm64/begin.S
ENTRY(_start)
  // Force unwinds to end in this function.
  .cfi_undefined x30

  mov x0, sp
  bl __linker_init  // 调用 __linker_init

  /* linker init returns the _entry address in the main image */
  br x0
END(_start)

__linker_init

/// bionic/linker/linker_main.cpp
/*
 * This is the entry point for the linker, called from begin.S. This
 * method is responsible for fixing the linker's own relocations, and
 * then calling __linker_init_post_relocation().
 *
 * Because this method is called before the linker has fixed it's own
 * relocations, any attempt to reference an extern variable, extern
 * function, or other GOT reference will generate a segfault.
 */
extern "C" ElfW(Addr) __linker_init(void* raw_args) 
  // Initialize TLS early so system calls and errno work.
  KernelArgumentBlock args(raw_args);
  bionic_tcb temp_tcb __attribute__((uninitialized));
  linker_memclr(&temp_tcb, sizeof(temp_tcb));
  __libc_init_main_thread_early(args, &temp_tcb);

  ...
  // Prelink the linker so we can access linker globals.
  if (!tmp_linker_so.prelink_image()) __linker_cannot_link(args.argv[0]);
  if (!tmp_linker_so.link_image(SymbolLookupList(&tmp_linker_so), &tmp_linker_so, nullptr, nullptr)) __linker_cannot_link(args.argv[0]);

  return __linker_init_post_relocation(args, tmp_linker_so);  // 此处

__linker_init_post_relocation

linker的一些初始化,主要看linker_main函数

/// bionic/linker/linker_main.cpp
/*
 * This code is called after the linker has linked itself and fixed its own
 * GOT. It is safe to make references to externs and other non-local data at
 * this point. The compiler sometimes moves GOT references earlier in a
 * function, so avoid inlining this function (http://b/80503879).
 */
static ElfW(Addr) __attribute__((noinline))
__linker_init_post_relocation(KernelArgumentBlock& args, soinfo& tmp_linker_so) 
  // Finish initializing the main thread.
  __libc_init_main_thread_late();

  // We didn't protect the linker's RELRO pages in link_image because we
  // couldn't make system calls on x86 at that point, but we can now...
  if (!tmp_linker_so.protect_relro()) __linker_cannot_link(args.argv[0]);

  // And we can set VMA name for the bss section now
  set_bss_vma_name(&tmp_linker_so);

  // Initialize the linker's static libc's globals
  __libc_init_globals();

  // Initialize the linker's own global variables
  tmp_linker_so.call_constructors();

  // Setting the linker soinfo's soname can allocate heap memory, so delay it until here.
  for (const ElfW(Dyn)* d = tmp_linker_so.dynamic; d->d_tag != DT_NULL; ++d) 
    if (d->d_tag == DT_SONAME) 
      tmp_linker_so.set_soname(tmp_linker_so.get_string(d->d_un.d_val));
    
  

  // When the linker is run directly rather than acting as PT_INTERP, parse
  // arguments and determine the executable to load. When it's instead acting
  // as PT_INTERP, AT_ENTRY will refer to the loaded executable rather than the
  // linker's _start.
  const char* exe_to_load = nullptr;
  if (getauxval(AT_ENTRY) == reinterpret_cast<uintptr_t>(&_start))  // 直接执行时
    if (args.argc == 3 && !strcmp(args.argv[1], "--list")) 
      // We're being asked to behave like ldd(1).
      g_is_ldd = true;
      exe_to_load = args.argv[2];
     else if (args.argc <= 1 || !strcmp(args.argv[1], "--help")) 
      async_safe_format_fd(STDOUT_FILENO,
         "Usage: %s [--list] PROGRAM [ARGS-FOR-PROGRAM...]\\n"
         "       %s [--list] path.zip!/PROGRAM [ARGS-FOR-PROGRAM...]\\n"
         "\\n"
         "A helper program for linking dynamic executables. Typically, the kernel loads\\n"
         "this program because it's the PT_INTERP of a dynamic executable.\\n"
         "\\n"
         "This program can also be run directly to load and run a dynamic executable. The\\n"
         "executable can be inside a zip file if it's stored uncompressed and at a\\n"
         "page-aligned offset.\\n"
         "\\n"
         "The --list option gives behavior equivalent to ldd(1) on other systems.\\n",
         args.argv[0], args.argv[0]);
      _exit(EXIT_SUCCESS);
     else 
      exe_to_load = args.argv[1];
      __libc_shared_globals()->initial_linker_arg_count = 1;
    
  

  // store argc/argv/envp to use them for calling constructors
  g_argc = args.argc - __libc_shared_globals()->initial_linker_arg_count;
  g_argv = args.argv + __libc_shared_globals()->initial_linker_arg_count;
  g_envp = args.envp;
  __libc_shared_globals()->init_progname = g_argv[0];

  // Initialize static variables. Note that in order to
  // get correct libdl_info we need to call constructors
  // before get_libdl_info().
  sonext = solist = solinker = get_libdl_info(tmp_linker_so);
  g_default_namespace.add_soinfo(solinker);
  // 进入 linker_main
  ElfW(Addr) start_address = linker_main(args, exe_to_load);

  if (g_is_ldd) _exit(EXIT_SUCCESS);

  INFO("[ Jumping to _start (%p)... ]", reinterpret_cast<void*>(start_address));

  // Return the address that the calling assembly stub should jump to.
  return start_address;

linker_main

/// bionic/linker/linker_main.cpp
static ElfW(Addr) linker_main(KernelArgumentBlock& args, const char* exe_to_load) 
  ...
  // Sanitize the environment.
  __libc_init_AT_SECURE(args.envp);

  // Initialize system properties
  __system_properties_init(); // may use 'environ'

  // Initialize platform properties.
  platform_properties_init();

  // Register the debuggerd signal handler.
  linker_debuggerd_init(); // 初始化 signal handler

  ...

linker_debuggerd_init

/// bionic/linker/linker_debuggerd_android.cpp
void linker_debuggerd_init() 
  // There may be a version mismatch between the bootstrap linker and the crash_dump in the APEX,
  // so don't pass in any process info from the bootstrap linker.
  debuggerd_callbacks_t callbacks = 
#if defined(__ANDROID_APEX__)
      .get_process_info = get_process_info,
#endif
      .post_dump = notify_gdb_of_libraries,
  ;
  debuggerd_init(&callbacks);  // 此处,调用库 libdebuggerd_handler_fallback

debuggerd_init

/// system/core/debuggerd/handler/debuggerd_handler.cpp
void debuggerd_init(debuggerd_callbacks_t* callbacks) 
  if (callbacks) 
    g_callbacks = *callbacks;
  
  // 预开辟了 debuggerd thread stack 并设置保护属性,生成一个共享其父地址空间但不共享其文件描述符表
  // 确保打log和连接tombstoned时所需的文件描述符
  size_t thread_stack_pages = 8;
  void* thread_stack_allocation = mmap(nullptr, PAGE_SIZE * (thread_stack_pages + 2), PROT_NONE,
                                       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  if (thread_stack_allocation == MAP_FAILED) 
    fatal_errno("failed to allocate debuggerd thread stack");
  

  char* stack = static_cast<char*>(thread_stack_allocation) + PAGE_SIZE;
  if (mprotect(stack, PAGE_SIZE * thread_stack_pages, PROT_READ | PROT_WRITE) != 0) 
    fatal_errno("failed to mprotect debuggerd thread stack");
  

  // Stack grows negatively, set it to the last byte in the page...
  stack = (stack + thread_stack_pages * PAGE_SIZE - 1);
  // and align it.
  stack -= 15;
  pseudothread_stack = stack; // clone pseudothread_stack 用到

  // 初始化 sigaction
  struct sigaction action;
  memset(&action, 0, sizeof(action));
  sigfillset(&action.sa_mask);
  action.sa_sigaction = debuggerd_signal_handler; // signal处理器
  action.sa_flags = SA_RESTART | SA_SIGINFO;

  // Use the alternate signal stack if available so we can catch stack overflows.
  action.sa_flags |= SA_ONSTACK; // 使用单独的栈,使之能抓栈溢出的异常

#define SA_EXPOSE_TAGBITS 0x00000800
  // Request that the kernel set tag bits in the fault address. This is necessary for diagnosing MTE
  // faults.
  action.sa_flags |= SA_EXPOSE_TAGBITS;

  debuggerd_register_handlers(&action);  // 注册 action 实现

debuggerd_register_handlers

/// @system/core/debuggerd/include/debuggerd/handler.h
// DEBUGGER_ACTION_DUMP_TOMBSTONE and DEBUGGER_ACTION_DUMP_BACKTRACE are both
// triggered via BIONIC_SIGNAL_DEBUGGER. The debugger_action_t is sent via si_value
// using sigqueue(2) or equivalent. If no si_value is specified (e.g. if the
// signal is sent by kill(2)), the default behavior is to print the backtrace
// to the log.
  //  debuggerd信号用于输出trace  ---   35 (__SIGRTMIN + 3)        debuggerd
#define DEBUGGER_SIGNAL BIONIC_SIGNAL_DEBUGGER 

static void __attribute__((__unused__)) debuggerd_register_handlers(struct sigaction* action) 
  char value[PROP_VALUE_MAX] = "";
  bool enabled =
      !(__system_property_get("ro.debuggable", value) > 0 && !strcmp(value, "1") &&
        __system_property_get("debug.debuggerd.disable", value) > 0 && !strcmp(value, "1"));
  if (enabled)  // 有一个开关,当debuggable且disable 则不会注册下面信号处理。
    sigaction(SIGABRT, action, nullptr);
    sigaction(SIGBUS, action, nullptr);
    sigaction(SIGFPE, action, nullptr);
    sigaction(SIGILL, action, nullptr);
    sigaction(SIGSEGV, action, nullptr);
    sigaction(SIGSTKFLT, action, nullptr);
    sigaction(SIGSYS, action, nullptr);
    sigaction(SIGTRAP, action, nullptr);
  

  sigaction(BIONIC_SIGNAL_DEBUGGER, action, nullptr);  //  设置信号处理action

下面是Android对一些特殊信号的定义:

/// @bionic/libc/platform/bionic/reserved_signals.h
// Realtime signals reserved for internal use:
//   32 (__SIGRTMIN + 0)        POSIX timers
//   33 (__SIGRTMIN + 1)        libbacktrace
//   34 (__SIGRTMIN + 2)        libcore
//   35 (__SIGRTMIN + 3)        debuggerd
//   36 (__SIGRTMIN + 4)        platform profilers (heapprofd, traced_perf)
//   37 (__SIGRTMIN + 5)        coverage (libprofile-extras)
//   38 (__SIGRTMIN + 6)        heapprofd ART managed heap dumps
//   39 (__SIGRTMIN + 7)        fdtrack
//   40 (__SIGRTMIN + 8)        android_run_on_all_threads (bionic/pthread_internal.cpp)

#define BIONIC_SIGNAL_POSIX_TIMERS (__SIGRTMIN + 0)
#define BIONIC_SIGNAL_BACKTRACE (__SIGRTMIN + 1)
#define BIONIC_SIGNAL_DEBUGGER (__SIGRTMIN + 3)
#define BIONIC_SIGNAL_PROFILER (__SIGRTMIN + 4)
#define BIONIC_SIGNAL_ART_PROFILER (__SIGRTMIN + 6)
#define BIONIC_SIGNAL_FDTRACK (__SIGRTMIN + 7)
#define BIONIC_SIGNAL_RUN_ON_ALL_THREADS (__SIGRTMIN + 8)

信号

在linux环境,执行如下命令,就可以看到各种信号的值及对应的含义:

# kill -l
 1    HUP Hangup                           23    URG Urgent I/O condition             45     45 Signal 45
 2    INT Interrupt                        24   XCPU CPU time limit exceeded          46     46 Signal 46
 3   QUIT Quit                             25   XFSZ File size limit exceeded         47     47 Signal 47
 4    ILL Illegal instruction              26 VTALRM Virtual timer expired            48     48 Signal 48
 5   TRAP Trap                             27   PROF Profiling timer expired          49     49 Signal 49
 6   ABRT Aborted                          28  WINCH Window size changed              50     50 Signal 50
 7    BUS Bus error                        29     IO I/O possible                     51     51 Signal 51
 8    FPE Floating point exception         30    PWR Power failure                    52     52 Signal 52
 9   KILL Killed                           31    SYS Bad system call                  53     53 Signal 53
10   USR1 User signal 1                    32     32 Signal 32                        54     54 Signal 54
11   SEGV Segmentation fault               33     33 Signal 33                        55     55 Signal 55
12   USR2 User signal 2                    34     34 Signal 34                        56     56 Signal 56
13   PIPE Broken pipe                      35     35 Signal 35                        57     57 Signal 57
14   ALRM Alarm clock                      36     36 Signal 36                        58     58 Signal 58
15   TERM Terminated                       37     37 Signal 37                        59     59 Signal 59
16 STKFLT Stack fault                      38     38 Signal 38                        60     60 Signal 60
17   CHLD Child exited                     39     39 Signal 39                        61     61 Signal 61
18   CONT Continue                         40     40 Signal 40                        62     62 Signal 62
19   STOP Stopped (signal)                 41     41 Signal 41                        63     63 Signal 63
20   TSTP Stopped                          42     42 Signal 42                        64     64 Signal 64
21   TTIN Stopped (tty input)              43     43 Signal 43
22   TTOU Stopped (tty output)             44     44 Signal 44

比较常见的错误信号如下:

  • 11 SEGV Segmentation fault 段错误
    • 解引用空指针或未初始化的或已经被释放的指针
    • 访问字节对齐错误的内存
    • 向只读内存区写操作
    • 读写分配的内存区域之外的内存
    • 其他内存损坏
  • 6 ABRT Aborted 通常是程序主动调用abort ,在tombstone文件一般有abort信息
  • 7 SIGBUS Bus error 比如出现的 内存对齐问题
  • 4 ILL Illegal instruction 非法指令问题
  • 8 FPE Floating point exception 非法算数问题,比较执行除0操作
  • 13 PIPE Broken pipe 管道损坏问题,比如向一个已经关闭的socket写
  • 3 QUIT Quit Android对应用进程做了拦截处理,可以进行dump trace , 执行 kill -3 $pid
  • 35 debuggerd 信号, 使用于Android,用于dump trace

当进程发生crash时,会收到相关信号,之前设置的信号处理器会进行处理

debuggerd_signal_handler

处理流程如下:

  • 打印crash信号概述
  • clone创建子线程去执行抓dump
  • 等待抓dump完成
  • 重新发送信号kill自身
/// @system/core/debuggerd/handler/debuggerd_handler.cpp
// Handler that does crash dumping by forking and doing the processing in the child.
// Do this by ptracing the relevant thread, and then execing debuggerd to do the actual dump.
static void debuggerd_signal_handler(int signal_number, siginfo_t* info, void* context) 
  // Make sure we don't change the value of errno, in case a signal comes in between the process
  // making a syscall and checking errno.
  ErrnoRestorer restorer;

  auto *ucontext = static_cast<ucontext_t*>(context);

  // It's possible somebody cleared the SA_SIGINFO flag, which would mean
  // our "info" arg holds an undefined value.
  if (!have_siginfo(signal_number)) 
    info = nullptr;
  

  struct siginfo dummy_info = ;
  if (!info)    // 收集 summary 信息 ,也就是打印的第一行
    memset(&dummy_info, 0, sizeof(dummy_info));
    dummy_info.si_signo = signal_number;
    dummy_info.si_code = SI_USER;
    dummy_info.si_pid = __getpid();
    dummy_info.si_uid = getuid();
    info = &dummy_info;
   else if (info->si_code >= 0 || info->si_code == SI_TKILL) 
    // rt_tgsigqueueinfo(2)'s documentation appears to be incorrect on kernels
    // that contain commit 66dd34a (3.9+). The manpage claims to only allow
    // negative si_code values that are not SI_TKILL, but 66dd34a changed the
    // check to allow all si_code values in calls coming from inside the house.
  

  debugger_process_info process_info = ;
  uintptr_t si_val = reinterpret_cast<uintptr_t>(info->si_ptr);
  if (signal_number == BIONIC_SIGNAL_DEBUGGER)  // 判断是否是 debuggerd信号
    if (info->si_code == SI_QUEUE && info->si_pid == __getpid()) 
      // Allow for the abort message to be explicitly specified via the sigqueue value.
      // Keep the bottom bit intact for representing whether we want a backtrace or a tombstone.
      if (si_val != kDebuggerdFallbackSivalUintptrRequestDump) 
        process_info.abort_msg = reinterpret_cast<void*>(si_val & ~1);
        info->si_ptr = reinterpret_cast<void*>(si_val & 1);
      
    
   else if (g_callbacks.get_process_info) 
    process_info = g_callbacks.get_process_info();
  

  // If sival_int is ~0, it means that the fallback handler has been called
  // once before and this function is being called again to dump the stack
  // of a specific thread. It is possible that the prctl call might return 1,
  // then return 0 in subsequent calls, so check the sival_int to determine if
  // the fallback handler should be called first.
  if (si_val == kDebuggerdFallbackSivalUintptrRequestDump ||
      prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0) == 1) 
    // This check might be racy if another thread sets NO_NEW_PRIVS, but this should be unlikely,
    // you can only set NO_NEW_PRIVS to 1, and the effect should be at worst a single missing
    // ANR trace.
    debuggerd_fallback_handler(info, ucontext, process_info.abort_msg);
    resend_signal(info);
    return;
  

  // Only allow one thread to handle 

以上是关于Android 12 进程native crash流程分析的主要内容,如果未能解决你的问题,请参考以下文章

Android native crash解析

Android 12 应用Java crash流程分析

Android 12 应用Java crash流程分析

Android Native crash日志分析

Android Native Crash问题排查思路

Android Native Crash问题排查思路