Android系统服务死锁Anr检测机制

Posted Nipuream

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Android系统服务死锁Anr检测机制相关的知识,希望对你有一定的参考价值。

android系统服务死锁、ANR检测机制

Android系统运行以后,System_server中可能有成百上千个线程在运行,各种服务之间调用很频繁,也很复杂,难免会出现死锁和长时间未响应的问题。这个问题对于系统来说是非常严重的,因为一旦出现这种情况,会导致一系列的并发症,最终会导致界面卡死,手机耗电急剧上升,发热严重。当然,我们要做的第一步是尽量避免此情况的发生,这种需要大量的测试和实践,Android系统现在已经做的很不错了,但是也要考虑一旦出现这种情况,系统对此的处理。本文主要来回顾下framework层 Watchdog、anr检测、处理相关的知识。

Watchdog检测原理

watchdog主要对系统重要的服务进行检测和处理,下来从源码的角度来分析它如何实现的。watchdog首先本身是一个线程,继承于Thread,在system_server初始化的过程中启动。

    private Watchdog() 
        super("watchdog");

        // The shared foreground thread is the main checker.  It is where we
        // will also dispatch monitor checks and do other work.
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
    

首先,在它初始化过程中,将几个重要的线程添加到mHandlerCheckers中,这些线程全都是事件驱动线程,继承于HandlerThread,而HandlerChecker本身是个Runnable对象。前台线程也是最主要的检测者,外界服务添加monitor check都是添加到mMonitorChecker中。

    public void addMonitor(Monitor monitor) 
        synchronized (this) 
            if (isAlive()) 
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            
            mMonitorChecker.addMonitor(monitor);
        
    

接下来看看Watchdog运行之后做了什么事情:

 @Override
    public void run() 
        boolean waitedHalf = false;
        while (true) 
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;  //可动态设置,当发生死锁,系统是否需要重启
            int debuggerWasConnected = 0;
            synchronized (this) 
                long timeout = CHECK_INTERVAL; // 30s
                //会调用每个线程对应的HandlerCheckers的scheduleCheckLocked方法
                //HandlerChecker中又持有该线程Handler引用,Handler又能获取到Looper
                for (int i=0; i<mHandlerCheckers.size(); i++) 
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                

                if (debuggerWasConnected > 0) 
                    debuggerWasConnected--;
                

                //记录开始时间
                long start = SystemClock.uptimeMillis();
                while (timeout > 0) 
                    if (Debug.isDebuggerConnected()) 
                        debuggerWasConnected = 2;
                    
                    try 
                        wait(timeout);  //等待30s
                     catch (InterruptedException e) 
                        Log.wtf(TAG, e);
                    
                    if (Debug.isDebuggerConnected()) 
                        debuggerWasConnected = 2;
                    
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                

                //这个方法稍后分析,waitState 是执行完获取HandlerCheck检测结果
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED)  //代表没有死锁的发生,重新开始
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                 else if (waitState == WAITING) //还是等待中
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                 else if (waitState == WAITED_HALF)  
                    //如果30s内HandleCheck未执行完,则打印native进程状态
                    if (!waitedHalf) 
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    
                    continue;
                

                //如果1分钟还未执行完,则获取哪些HandlerChecker堵塞了。
                blockedCheckers = getBlockedCheckersLocked();
                //将堵塞详细信息打印出来
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            

            //记录到EventLog中
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            //打印核心native进程堆栈信息
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);

            //等待两秒
            SystemClock.sleep(2000);

            //打印kernel线程执行堆栈信息
            if (RECORD_KERNEL_THREADS) 
                dumpKernelStackTraces();
            

            //触发kernel打印所有堵塞线程调用栈信息
            try 
                FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
                sysrq_trigger.write("w");
                sysrq_trigger.close();
             catch (IOException e) 
                Slog.e(TAG, "Failed to write to /proc/sysrq-trigger");
                Slog.e(TAG, e.getMessage());
            

            //给两秒时间记录到 dropbox中 (data/system/dropbox)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") 
                    public void run() 
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    
                ;
            dropboxThread.start();
            try 
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
             catch (InterruptedException ignored) 

            //...

            //这里在调试模式中和当allowRestart为false的情况下,不允许杀死进程
            if (debuggerWasConnected >= 2) 
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
             else if (debuggerWasConnected > 0) 
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
             else if (!allowRestart) 
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
             else 
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) 
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) 
                        Slog.w(TAG, "    at " + element);
                    
                
                //杀死system_server进程,导致zygote挂掉了,从导致Android重新进入Java世界
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            

            waitedHalf = false;
        
    

上个方法代码比较多,做的事情就三点:

  1. 执行HandlerChecker中的scheduleCheckLocked方法,通过handler引用的looper对象,将自己丢入对应线程的消息队列中,执行死锁检测。
  2. while循环中,每过30s会查看下HandlerChecker的检测结果,如果没有发生堵塞,则从新开始,如果堵塞了,则进入第三步。
  3. 将堵塞线程调用堆栈打印出来,搜集各类日志,包括kernel堵塞线程堆栈,核心native进程 dump信息,并持久化,最后杀死自己,让init进程重启自己。

下面分别学习下第一步、第二步分别做了什么事情:

        public void scheduleCheckLocked() 
            if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) 
                mCompleted = true;
                return;
            

            if (!mCompleted) 
                // we already have a check in flight, so no need
                return;
            

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            //将自己丢入MessageQueue中
            mHandler.postAtFrontOfQueue(this);
        

        //当线程执行到这个消息的时候,进来
        @Override
        public void run() 
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) 
                synchronized (Watchdog.this) 
                    mCurrentMonitor = mMonitors.get(i);
                
                //其实就是执行每个Monitor.monitor方法
                mCurrentMonitor.monitor();
            

            //如果没有发生堵塞,则完成检测,否则就卡在上面了。
            synchronized (Watchdog.this) 
                mCompleted = true;
                mCurrentMonitor = null;
            
        

//下面是检测AMS的例子,其他每个服务都是如此实现得。
public final class ActivityManagerService extends ActivityManagerNative
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback 
    ...
     //如果发生死锁,则无法获取到锁对象,注意外界调用AMS的方法,同步都是使用AMS实例这把“锁”
     public void monitor() 
        synchronized (this)  
    
    ...


根据上文分析,Watchdog执行完HandlerChecker的scheduleCheckLocked()方法后,会等待30s,然后执行getBlockedCheckersLocked方法:

    private ArrayList<HandlerChecker> getBlockedCheckersLocked() 
        ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
        for (int i=0; i<mHandlerCheckers.size(); i++) 
            HandlerChecker hc = mHandlerCheckers.get(i);
            if (hc.isOverdueLocked()) 
                checkers.add(hc);
            
        
        return checkers;
    

    public boolean isOverdueLocked() 
         return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
    

ANR检测机制和处理

首先来看看Android系统在哪些情况会触发anr:

  • 前台服务20s内未执行完成
  • 前台广播10s内未执行完成,后台广播20s内未执行完成
  • 内容提供者执行publishProvider,超时10s
  • 输入事件超时5s

系统触发Anr的原因主要是因为影响超时,影响到用户使用和体验。虽然系统触发anr的地方有好几种,但是检测机制和处理机制其实是差不多的,都是在操作前记录时间,然后向消息队列中丢向一个触发Anr的消息,再执行完响应的操作的时候将消息移除,如果超过指定时间没有移除,那么则会触发anr操作,下面来具体看下广播执行超时触发anr的过程。

private final class BroadcastHandler extends Handler 
    public BroadcastHandler(Looper looper) 
        super(looper, null, true);
    

    @Override
    public void handleMessage(Message msg) 
        switch (msg.what) 
                //接受intent处理
            case BROADCAST_INTENT_MSG: 
                if (DEBUG_BROADCAST) Slog.v(
                    TAG, "Received BROADCAST_INTENT_MSG");
                processNextBroadcast(true);
             break;
                //消息超时处理
            case BROADCAST_TIMEOUT_MSG: 
                synchronized (mService) 
                    broadcastTimeoutLocked(true);
                
             break;
        
    
;

//获取下一条广播
int recIdx = r.nextReceiver++;

//记录当时时间
r.receiverTime = SystemClock.uptimeMillis();
if (recIdx == 0) 
    r.dispatchTime = r.receiverTime;
    r.dispatchClockTime = System.currentTimeMillis();
    if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG, "Processing ordered broadcast ["
                                      + mQueueName + "] " + r);

if (! mPendingBroadcastTimeoutMessage) 
    long timeoutTime = r.receiverTime + mTimeoutPeriod;
    if (DEBUG_BROADCAST) Slog.v(TAG,
                                "Submitting BROADCAST_TIMEOUT_MSG ["
                                + mQueueName + "] for " + r + " at " + timeoutTime);
    //向消息队列中丢向Anr触发的延时消息
    setBroadcastTimeoutLocked(timeoutTime);


final void setBroadcastTimeoutLocked(long timeoutTime) 
    if (! mPendingBroadcastTimeoutMessage) 
        Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
        mHandler.sendMessageAtTime(msg, timeoutTime);
        mPendingBroadcastTimeoutMessage = true;
    

下面是取消Anr触发的延时消息代码:

if (r.receivers == null || r.nextReceiver >= numReceivers
    || r.resultAbort || forceReceive) 
    // No more receivers for this broadcast!  Send the final
    // result if requested...
    if (r.resultTo != null) 
        try 
            if (DEBUG_BROADCAST) 
                int seq = r.intent.getIntExtra("seq", -1);
                Slog.i(TAG, "Finishing broadcast ["
                       + mQueueName + "] " + r.intent.getAction()
                       + " seq=" + seq + " app=" + r.callerApp);
            
            //处理事件
            performReceiveLocked(r.callerApp, r.resultTo,
                                 new Intent(r.intent), r.resultCode,
                                 r.resultData, r.resultExtras, false, false, r.userId);
            // Set this to null so that the reference
            // (local and remote) isn't kept in the mBroadcastHistory.
            r.resultTo = null;
         catch (RemoteException e) 
            r.resultTo = null;
            Slog.w(TAG, "Failure ["
                   + mQueueName + "] sending broadcast result of "
                   + r.intent, e);
        
    

    //处理完事件,取消消息
    cancelBroadcastTimeoutLocked();

    // ... and on to the next...
    addBroadcastToHistoryLocked(r);
    mOrderedBroadcasts.remove(0);
    r = null;
    looped = true;
    continue;

Android 进阶——Framework 核心ANR( Applicatipon No Response)机制设计思想详解

Android 进阶——Framework 核心ANR( Applicatipon No Response)机制设计思想详解

Android 进阶——Framework 核心ANR( Applicatipon No Response)机制设计思想详解

NFC问题分析之死锁引起的ANR

Android带你细看Android input系统中ANR的机制

Android带你细看Android input系统中ANR的机制