Android 12 Watchdog 工作流程

Posted 2022-05-02 pecuyu

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Android 12 Watchdog 工作流程相关的知识，希望对你有一定的参考价值。

文章托管在gitee上 Android Notes , 同步csdn

Watchdog 工作流程概述

如上篇所述，当调用Watchdog#start方法时，启动其内部工作线程，之后它的run方法会被调用，Watchdog在此线程中执行监控逻辑。

Watchdog的工作概述如下：

CHECK_INTERVAL周期遍历HandlerChecker列表的所以元素，调用HandlerChecker#scheduleCheckLocked方法来执行检测操作，具体实现是向对应HandlerChecker的Handler中post一个Runnable消息到其消息队列的头部，等待此Runnable被执行。
如果Runnable被执行到了，说明该Handler线程没有出现阻塞，其run方法将被调用，此时会再遍历HandlerChecker的Monitor列表，调用Monitor#monitor方法(比如AMS#monitor方法)，来判断被监控对象是否处于正常状态；如果该Runnable长时间得不到执行或者Monitor#monitor调用长时间不返回，则说明该Handler线程可能发生卡住或者被监控对象状态异常，无法继续执行新任务了。
休眠CHECK_INTERVAL(通常是30s)，然后evaluateCheckerCompletionLocked方法来评估任务结果，有如下一些状态：
- 0 COMPLETED 所以任务已经完成
- 1 WAITING 所有任务等待完成时间 <CHECK_INTERVAL
- 2 WAITED_HALF 有任务等待完成时间 > CHECK_INTERVAL但是小于DEFAULT_TIMEOUT
- 3 OVERDUE 有任务等待完成时间 > DEFAULT_TIMEOUT
如果评估等待完成结果为COMPLETED或WAITING，则会进行下一轮监控流程；如果结果是WAITED_HALF，则会输出相关感兴趣的进程的trace，然后进行下一轮循环；如果结果是OVERDUE，则会再次输出相关trace，kernel、binder相关信息，然后会重启系统(框架层面)，某些情况比如跑monkey可能不会重启。
IActivityController的一些策略会影响Watchdog的工作流程，在OVERDUE时可能让系统继续等待或者走kill系统流程。

HandlerChecker 实现

该类是一个比较关键的类，主要功能在此实现，下面做一些介绍。通过类注释也可知，它的功能在于检测线程状态，并且调度执行monitor的回调，也就是执行Monitor#monitor。该类实现Runnable，用于post调度检查任务.

/**
 * Used for checking status of handle threads and scheduling monitor callbacks.
 */
public final class HandlerChecker implements Runnable 
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();

HandlerChecker构造

参数：

handler 代表其所监控的线程的Handler
name 指所监控的线程的名字
waitMaxMillis 代表等待完成的最大时限

HandlerChecker(Handler handler 代表其所监控的Handler
  - name , String name, long waitMaxMillis) 
    mHandler = handler;
    mName = name;
    mWaitMax = waitMaxMillis;
    mCompleted = true;

HandlerChecker#scheduleCheckLocked

scheduleCheckLocked 方法是用来执行具体检查工作的，通过post一个Runnable消息到其所监控的Handler的消息队列的头部，这么做的目的有2：

通过消息是否被执行，判断被监控线程状态，如果被监控线程卡住，则Runnable消息将迟迟得不到执行
当Runnable消息被执行时，在其中执行Monitor#monitor，将任务放在该线程而不是Watchdog线程，另外也通过该调用判断被监控对象状态是否也处于正常状态。

public void scheduleCheckLocked() 
    if (mCompleted)  // 将Monitor从mMonitorQueue转移到mMonitors，后续mMonitors列表保持不变，这是出于安全考虑设计
        // Safe to update monitors in queue, Handler is not in the middle of work
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    
    // 如果没有监控对象并且消息队列处于Polling状态(也就是处于等新消息状态)，或者该checker已经被设置了停止检查，则不会继续
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) 
        // Don't schedule until after resume OR
        // If the target looper has recently been polling, then
        // there is no reason to enqueue our checker on it since that
        // is as good as it not being deadlocked.  This avoid having
        // to do a context switch to check the thread. Note that we
        // only do this if we have no monitors since those would need to
        // be executed at this point.
        mCompleted = true;
        return;
    
    if (!mCompleted)  // 如果任务已经调度，则返回
        // we already have a check in flight, so no need
        return;
    

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this); // 向消息队列post一个Runnable消息

HandlerChecker#run

当其run方法执行时，说明监控线程状态正常。

@Override
public void run() 
    // Once we get here, we ensure that mMonitors does not change even if we call
    // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
    // move them to mMonitors on the next schedule when mCompleted is true, at which
    // point we have completed execution of this method.
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++)  // 遍历mMonitors并执行其monitor
        synchronized (mLock) 
            mCurrentMonitor = mMonitors.get(i);
        
        mCurrentMonitor.monitor();
    

    synchronized (mLock)  // 如果所有任务完成，则置状态mCompleted为true
        mCompleted = true;
        mCurrentMonitor = null;

HandlerChecker#getCompletionStateLocked

当评估任务是否完成时，会调用其getCompletionStateLocked方法

public int getCompletionStateLocked() 
    if (mCompleted) 
        return COMPLETED;
     else 
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) // 耗时小于一半
            return WAITING;
         else if (latency < mWaitMax) //耗时大于一半
            return WAITED_HALF;
        
    
    return OVERDUE; // 超过最大等待时限

pause/resume 检查

有些特殊情况，可能需要暂停检查，比如执行PackageManagerService的启动过过程，可能相当耗时，但是此时时正常耗时，不能当异常处理，需要暂停检查

t.traceBegin("StartPackageManagerService");
try 
    Watchdog.getInstance().pauseWatchingCurrentThread("packagemanagermain");
    mPackageManagerService = PackageManagerService.main(mSystemContext, installer,
            domainVerificationService, mFactoryTestMode != FactoryTest.FACTORY_TEST_OFF,
            mOnlyCore);
 finally 
    Watchdog.getInstance().resumeWatchingCurrentThread("packagemanagermain");

通过Watchdog#pauseWatchingCurrentThread方法来暂停当前线程的检查，通过Watchdog#resumeWatchingCurrentThread方法来恢复检查。具体实现是遍历所有HandlerChecker，找到其所监控的线程与当前一致的目标，当pause则调用HandlerChecker#pauseLocked方法。与之类似的，恢复则会调用HandlerChecker#resumeLocked方法

/**
 * Pauses Watchdog action for the currently running thread. Useful before executing long running
 * operations that could falsely trigger the watchdog. Each call to this will require a matching
 * call to @link #resumeWatchingCurrentThread.
 *
 * <p>If the current thread has not been added to the Watchdog, this call is a no-op.
 *
 * <p>If the Watchdog is already paused for the current thread, this call adds
 * adds another pause and will require an additional @link #resumeCurrentThread to resume.
 *
 * <p>Note: Use with care, as any deadlocks on the current thread will be undetected until all
 * pauses have been resumed.
 */
public void pauseWatchingCurrentThread(String reason) 
    synchronized (mLock) 
        for (HandlerChecker hc : mHandlerCheckers) 
            if (Thread.currentThread().equals(hc.getThread())) 
                hc.pauseLocked(reason);

暂停与恢复检查，实际上是通过控制mPauseCount来实现的，在scheduleCheckLocked方法中，判断mPauseCount>0则停止此HandlerChecker的检查

    /** Pause the HandlerChecker. */
    public void pauseLocked(String reason) 
        mPauseCount++; // 每次调用pauseLocked都会增加
        // Mark as completed, because there's a chance we called this after the watchog
        // thread loop called Object#wait after 'WAITED_HALF'. In that case we want to ensure
        // the next call to #getCompletionStateLocked for this checker returns 'COMPLETED'
        mCompleted = true;  // 暂停检查则直接设置状态为完成
        Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
                + reason + ". Pause count: " + mPauseCount);
    

    /** Resume the HandlerChecker from the last @link #pauseLocked. */
    public void resumeLocked(String reason) 
        if (mPauseCount > 0) 
            mPauseCount--;  // 每次调用resumeLocked则递减，当减少为0时恢复检查
            Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
                    + reason + ". Pause count: " + mPauseCount);
         else 
            Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);

Watchdog#run

这个方法比较长，分段进行说明。

调度检查

private void run() 
    boolean waitedHalf = false;
    while (true)  // 在while循环中执行
        List<HandlerChecker> blockedCheckers = Collections.emptyList();
        String subject = "";
        boolean allowRestart = true;
        int debuggerWasConnected = 0;
        boolean doWaitedHalfDump = false;
        final ArrayList<Integer> pids;
        synchronized (mLock) 
            long timeout = CHECK_INTERVAL; // 每CHECK_INTERVAL检查一次,CHECK_INTERVAL = DEFAULT_TIMEOUT / 2
            // Make sure we (re)spin the checkers that have become idle within
            // this wait-and-check interval
            for (int i=0; i<mHandlerCheckers.size(); i++) 
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked(); // 遍历HandlerChecker列表进行检查
            

            if (debuggerWasConnected > 0) 
                debuggerWasConnected--;
            

            // NOTE: We use uptimeMillis() here because we do not want to increment the time we
            // wait while asleep. If the device is asleep then the thing that we are waiting
            // to timeout on is asleep as well and won't have a chance to run, causing a false
            // positive on when to kill things.
            long start = SystemClock.uptimeMillis();
            while (timeout > 0)  // 下面代码确保等待CHECK_INTERVAL
                if (Debug.isDebuggerConnected()) 
                    debuggerWasConnected = 2;
                
                try 
                    mLock.wait(timeout);
                    // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                 catch (InterruptedException e) 
                    Log.wtf(TAG, e);
                
                if (Debug.isDebuggerConnected()) 
                    debuggerWasConnected = 2;
                
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            
    ...

评估结果

...
synchronized (mLock) 
   // 调度检查工作
   // 等待CHECK_INTERVAL,等待任务完成

   final int waitState = evaluateCheckerCompletionLocked(); // 评估所有任务完成结果
   if (waitState == COMPLETED) // 完成,继续下一轮检查
       // The monitors have returned; reset
       waitedHalf = false;
       continue;
    else if (waitState == WAITING) // 所有任务等待时间<CHECK_INTERVAL
       // still waiting but within their configured intervals; back off and recheck
       continue;
    else if (waitState == WAITED_HALF) // 等待时间大>CHECK_INTERVAL但<DEFAULT_TIMEOUT
       if (!waitedHalf) 
           Slog.i(TAG, "WAITED_HALF");
           waitedHalf = true;
           // We've waited half, but we'd need to do the stack trace dump w/o the lock.
           pids = new ArrayList<>(mInterestingJavaPids);
           doWaitedHalfDump = true; // 设置标志,将会dump感兴趣进程的trace
        else 
           continue;
       
    else // 等待时间>DEFAULT_TIMEOUT
       // something is overdue!
       blockedCheckers = getBlockedCheckersLocked();
       subject = describeCheckersLocked(blockedCheckers);//阻塞信息描述
       allowRestart = mAllowRestart;
       pids = new ArrayList<>(mInterestingJavaPids);
   
 // END synchronized (mLock)

evaluateCheckerCompletionLocked

对每个HandlerChecker的任务结果进行评估，对所有任务完成状态取最大值

private int evaluateCheckerCompletionLocked() 
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) 
        HandlerChecker hc = mHandlerCheckers.get(i);
        //获取HandlerChecker的完成状态, 取最大的状态值
        state = Math.max(state, hc.getCompletionStateLocked());
    
    return state;

HandlerChecker#getCompletionStateLocked

// These are temporally ordered: larger values as lateness increases
private static final int COMPLETED = 0;
private static final int WAITING = 1;
private static final int WAITED_HALF = 2;
private static final int OVERDUE = 3;

public int getCompletionStateLocked() 
    if (mCompleted)  //所以任务已经完成
        return COMPLETED;
     else // 计算完成的时间,获取对应的完成状态
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2)  // 等待时间小于一半
            return WAITING;
         else if (latency < mWaitMax) // 等待时间超过一半
            return WAITED_HALF;
        
    
    return OVERDUE;// 等待时间大于最大等待时间

输出相关信息

if (doWaitedHalfDump)  // 等待时间超过一半,输出第一份 trace
    // We've waited half the deadlock-detection interval.  Pull a stack
    // trace and wait another half.
    ActivityManagerService.dumpStackTraces(pids, null, null,
            getInterestingNativePids(), null, subject);
    continue;

// 下面是等待完全超时的处理逻辑
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);// 输出 Watchdog event 事件

// Log the atom as early as possible since it is used as a mechanism to trigger
// Perfetto. Ideally, the Perfetto trace capture should happen as close to the
// point in time when the Watchdog happens as possible.
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);

long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
final File stack = ActivityManagerService.dumpStackTraces(  // 输出第二份 Trace
        pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
        tracesFileException, subject);

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);

processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w'); // 输出kernel log信息
doSysRq('l');

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked.  (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") 
        public void run() 
            // If a watched thread hangs before init() is called, we don't have a
            // valid mActivity. So we can't log the error to dropbox.
            if (mActivity != null)  // 将 Watchdog信息添加到dropbox
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null, null,
                        null, report.toString(), stack, null, null, null,
                        errorId);
            
        
    ;
dropboxThread.start();
try 
    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
 catch (InterruptedException ignored)

超时重启流程

while(true) 
    // 调度检查工作
    // 等待CHECK_INTERVAL,等待任务完成
    // 评估任务完成结果
    // 输出trace和相关信息

    IActivityController controller;
    synchronized (mLock) 
        controller = mController;
    
    if (controller != null)  // IActivityController处理, 通过AMS注册到Watchdog
        Slog.i(TAG, "Reporting stuck state to activity controller");
        try 
            Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
            // 1 = keep waiting, -1 = kill system
            int res = controller.systemNotResponding(subject); // 由IActivityController判断是等待还是重启系统
            if (res >= 0) 
                Slog.i(TAG, "Activity controller requested to coninue to wait");
                waitedHalf = false;
                continue;
            
         catch (RemoteException e) 
        
    

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) 
        debuggerWasConnected = 2;
    
    if (debuggerWasConnected >= 2)  // 有 debugger 连接,不重启
        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
     else if (debuggerWasConnected > 0) 
        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
     else if (!allowRestart)  // 不允许重启, 通过setAllowRestart设置
        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
     else  // 重启系统. 杀死系统进程,导致zygote重启,然后走框架重启流程.
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        if (!Build.IS_USER && isCrashLoopFound()
                && !WatchdogProperties.should_ignore_fatal_count().orElse(false)) 
            breakCrashLoop();
        
        Process.killProcess(Process.myPid());// kill 系统进程
        System.exit(10);// 退出进程.
    

    waitedHalf = false; // 重置等待标志,允许再dump half

处理重启广播

在Watchdog的init方法中，注册了ACTION_REBOOT广播。当收到ACTION_REBOOT时，如果带int参数nowait，则会重启系统

final class RebootRequestReceiver extends BroadcastReceiver 
    @Override
    public void onReceive(Context c, Intent intent) 
        if (intent.getIntExtra("nowait", 0) != 0)  // 含nowait参数
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);

rebootSystem 方法如下，会调用PMS的reboot方法，此调用会导致手机整机重启。

/**
 * Perform a full reboot of the system.
 */
void rebootSystem(String reason) 
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try 
        pms.reboot(false, reason, false);
     catch (RemoteException ex)

至此,Watchdog基本流程介绍完毕.

工作流程log示例

// waitedHalf log
NX709J_CNCommon_V2.35-system.txt:447: 02-26 11:22:45.135451  2235  2495 I Watchdog: WAITED_HALF
// 超时 log, 打印前后两次的trace文件名
NX709J_CNCommon_V2.35-system.txt:515: 02-26 11:23:21.299433  2235  2495 E Watchdog: First set of traces taken from /data/anr/anr_2022-02-26-11-22-45-149
NX709J_CNCommon_V2.35-system.txt:516: 02-26 11:23:21.311599  2235  2495 E Watchdog: Second set of traces taken from /data/anr/anr_2022-02-26-11-23-15-848
// kill system_server
NX709J_CNCommon_V2.35-system.txt:517: 02-26 11:23:21.323736  2235  2495 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main)
NX709J_CNCommon_V2.35-system.txt:518: 02-26 11:23:21.324030  2235  2495 W Watchdog: main annotated stack trace:
NX709J_CNCommon_V2.35-system.txt:519: 02-26 11:23:21.324064  2235  2495 W Watchdog:     at com.android.server.am.BatteryStatsService.initPowerManagement(BatteryStatsService.java:510)
NX709J_CNCommon_V2.35-system.txt:520: 02-26 11:23:21.324220  2235  2495 W Watchdog:     - waiting to lock <0x0102e69c> (a com.android.internal.os.BatteryStatsImpl)
NX709J_CNCommon_V2.35-system.txt:521: 02-26 11:23:21.324235  2235  2495 W Watchdog:     at com.android.server.am.ActivityManagerService.initPowerManagement(ActivityManagerService.java:2641)
NX709J_CNCommon_V2.35-system.txt:522: 02-26 11:23:21.324244  2235  2495 W Watchdog:     at com.android.server.SystemServer.startBootstrapServices(SystemServer.java:1190)
NX709J_CNCommon_V2.35-system.txt:523: 02-26 11:23:21.324249  2235  2495 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:961)
NX709J_CNCommon_V2.35-system.txt:524: 02-26 11:23:21.324254  2235  2495 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:641)
NX709J_CNCommon_V2.35-system.txt:525: 02-26 11:23:21.324259  2235  2495 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
NX709J_CNCommon_V2.35-system.txt:526: 02-26 11:23:21.324269  2235  2495 W Watchdog:     at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:567)
NX709J_CNCommon_V2.35-system.txt:527: 02-26 11:23:21.324274  2235  2495 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:996)
NX709J_CNCommon_V2.35-system.txt:528: 02-26 11:23:21.324280  2235  2495 W Watchdog: *** GOODBYE!

以上是关于Android 12 Watchdog 工作流程的主要内容，如果未能解决你的问题，请参考以下文章