Android 12 Watchdog 工作流程

Posted pecuyu

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Android 12 Watchdog 工作流程相关的知识,希望对你有一定的参考价值。

文章托管在gitee上 Android Notes , 同步csdn

Watchdog 工作过程概述

如上篇所述,当调用Watchdog#start方法时,启动其内部工作线程,之后它的run方法会被调用,Watchdog在此线程中执行监控逻辑。

Watchdog的工作概述如下:

  • CHECK_INTERVAL(目前是30s)周期遍历HandlerChecker列表的所有元素,调用HandlerChecker#scheduleCheckLocked方法来执行检测操作,具体实现是向对应HandlerChecker的Handler中post一个Runnable消息到其消息队列的头部,等待此Runnable被执行。
  • 如果Runnable被执行到了,说明该Handler所在线程没有出现阻塞,Runnable对象的run方法将被调用,此时会再尝试遍历HandlerChecker的Monitor列表,调用每一个Monitor#monitor方法(比如AMS#monitor方法),来判断被监控对象是否处于正常状态,这个监控任务目前实现在fg线程里面执行,所有大多Watchdog的trace里面会有fg线程信息(如果该Runnable长时间得不到执行或者Monitor#monitor调用长时间不返回,则说明该Handler线程可能发生卡住或者被监控对象状态异常,无法继续执行新任务了)。
  • 休眠CHECK_INTERVAL,然后evaluateCheckerCompletionLocked方法来评估任务结果,有如下一些状态:
    • 0 COMPLETED 所以任务已经完成
    • 1 WAITING 所有任务等待完成时间 <CHECK_INTERVAL
    • 2 WAITED_HALF 有任务等待完成时间 > CHECK_INTERVAL但是小于DEFAULT_TIMEOUT
    • 3 OVERDUE 有任务等待完成时间 > DEFAULT_TIMEOUT
  • 如果评估等待完成结果为COMPLETED或WAITING,则会进行下一轮监控流程;如果结果是WAITED_HALF,则会输出相关感兴趣的进程的trace,然后进行下一轮循环;如果结果是OVERDUE,则会再次输出相关trace,kernel、binder相关信息,然后会重启系统(框架层面),某些情况比如跑monkey可能不会重启。
  • IActivityController的一些策略会影响Watchdog的工作流程,在OVERDUE时可能让系统继续等待或者走kill系统流程。

HandlerChecker 实现

该类是一个比较关键的类,主要功能在此实现,下面做一些介绍。通过类注释也可知,它的功能在于检测线程状态,并且调度执行monitor的回调,也就是执行Monitor#monitor。该类实现Runnable,用于post调度检查任务.

/**
 * Used for checking status of handle threads and scheduling monitor callbacks.
 */
public final class HandlerChecker implements Runnable 
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();

HandlerChecker构造

参数:

  • handler 代表其所监控的线程的Handler
  • name 指所监控的线程的名字
  • waitMaxMillis 代表等待完成的最大时限
HandlerChecker(Handler handler, String name, long waitMaxMillis) 
    mHandler = handler;
    mName = name;
    mWaitMax = waitMaxMillis;
    mCompleted = true;

HandlerChecker#scheduleCheckLocked

scheduleCheckLocked 方法是用来执行具体检查工作的,通过post一个Runnable消息到其所监控的Handler的消息队列的头部,这么做的目的有2:

  • 通过消息是否被执行,判断被监控线程状态,如果被监控线程卡住,则Runnable消息将迟迟得不到执行
  • 当Runnable消息被执行时,在其中执行Monitor#monitor,将任务放在该线程而不是Watchdog线程,另外也通过该调用判断被监控对象状态是否也处于正常状态。
public void scheduleCheckLocked() 
    if (mCompleted)  // 将Monitor从mMonitorQueue转移到mMonitors,后续mMonitors列表保持不变,这是出于安全考虑设计
        // Safe to update monitors in queue, Handler is not in the middle of work
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    
    // 如果没有监控对象并且消息队列处于Polling状态(也就是处于等新消息状态),或者该checker已经被设置了停止检查,则不会继续
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
            || (mPauseCount > 0)) 
        // Don't schedule until after resume OR
        // If the target looper has recently been polling, then
        // there is no reason to enqueue our checker on it since that
        // is as good as it not being deadlocked.  This avoid having
        // to do a context switch to check the thread. Note that we
        // only do this if we have no monitors since those would need to
        // be executed at this point.
        mCompleted = true;
        return;
    
    if (!mCompleted)  // 如果任务已经调度,则返回
        // we already have a check in flight, so no need
        return;
    

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this); // 向消息队列post一个Runnable消息

HandlerChecker#run

当其run方法执行时,说明对应handler线程能正常执行消息,没有发生阻塞。对于fg线程而言,它还需要遍历mMonitors中所有的Monitor,并执行其monitor方法。

@Override
public void run() 
    // Once we get here, we ensure that mMonitors does not change even if we call
    // #addMonitorLocked because we first add the new monitors to mMonitorQueue and
    // move them to mMonitors on the next schedule when mCompleted is true, at which
    // point we have completed execution of this method.
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++)  // 遍历mMonitors并执行其monitor
        synchronized (mLock) 
            mCurrentMonitor = mMonitors.get(i);
        
        mCurrentMonitor.monitor();
    

    synchronized (mLock)  // 如果所有任务完成,则置状态mCompleted为true
        mCompleted = true;
        mCurrentMonitor = null;
    

HandlerChecker#getCompletionStateLocked

当评估任务是否完成时,会调用其getCompletionStateLocked方法

public int getCompletionStateLocked() 
    if (mCompleted) 
        return COMPLETED;
     else 
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) // 耗时小于一半
            return WAITING;
         else if (latency < mWaitMax) //耗时大于一半
            return WAITED_HALF;
        
    
    return OVERDUE; // 超过最大等待时限


pause/resume 检查

有些特殊情况,可能需要暂停检查,比如执行PackageManagerService的启动过过程,可能相当耗时,但是此时时正常耗时,不能当异常处理,需要暂停检查

t.traceBegin("StartPackageManagerService");
try 
    Watchdog.getInstance().pauseWatchingCurrentThread("packagemanagermain");
    mPackageManagerService = PackageManagerService.main(mSystemContext, installer,
            domainVerificationService, mFactoryTestMode != FactoryTest.FACTORY_TEST_OFF,
            mOnlyCore);
 finally 
    Watchdog.getInstance().resumeWatchingCurrentThread("packagemanagermain");

通过Watchdog#pauseWatchingCurrentThread方法来暂停当前线程的检查,通过Watchdog#resumeWatchingCurrentThread方法来恢复检查。具体实现是遍历所有HandlerChecker,找到其所监控的线程与当前一致的目标,当pause则调用HandlerChecker#pauseLocked方法。与之类似的,恢复则会调用HandlerChecker#resumeLocked方法

/**
 * Pauses Watchdog action for the currently running thread. Useful before executing long running
 * operations that could falsely trigger the watchdog. Each call to this will require a matching
 * call to @link #resumeWatchingCurrentThread.
 *
 * <p>If the current thread has not been added to the Watchdog, this call is a no-op.
 *
 * <p>If the Watchdog is already paused for the current thread, this call adds
 * adds another pause and will require an additional @link #resumeCurrentThread to resume.
 *
 * <p>Note: Use with care, as any deadlocks on the current thread will be undetected until all
 * pauses have been resumed.
 */
public void pauseWatchingCurrentThread(String reason) 
    synchronized (mLock) 
        for (HandlerChecker hc : mHandlerCheckers) 
            if (Thread.currentThread().equals(hc.getThread())) 
                hc.pauseLocked(reason);
            
        
    

暂停与恢复检查,实际上是通过控制mPauseCount来实现的,在scheduleCheckLocked方法中,判断mPauseCount>0则停止此HandlerChecker的检查

    /** Pause the HandlerChecker. */
    public void pauseLocked(String reason) 
        mPauseCount++; // 每次调用pauseLocked都会增加
        // Mark as completed, because there's a chance we called this after the watchog
        // thread loop called Object#wait after 'WAITED_HALF'. In that case we want to ensure
        // the next call to #getCompletionStateLocked for this checker returns 'COMPLETED'
        mCompleted = true;  // 暂停检查则直接设置状态为完成
        Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
                + reason + ". Pause count: " + mPauseCount);
    

    /** Resume the HandlerChecker from the last @link #pauseLocked. */
    public void resumeLocked(String reason) 
        if (mPauseCount > 0) 
            mPauseCount--;  // 每次调用resumeLocked则递减,当减少为0时恢复检查
            Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
                    + reason + ". Pause count: " + mPauseCount);
         else 
            Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
        
    

Watchdog#run

这个方法比较长,分段进行说明。

调度检查

private void run() 
    boolean waitedHalf = false;
    while (true)  // 在while循环中执行
        List<HandlerChecker> blockedCheckers = Collections.emptyList();
        String subject = "";
        boolean allowRestart = true;
        int debuggerWasConnected = 0;
        boolean doWaitedHalfDump = false;
        final ArrayList<Integer> pids;
        synchronized (mLock) 
            long timeout = CHECK_INTERVAL; // 每CHECK_INTERVAL检查一次,CHECK_INTERVAL = DEFAULT_TIMEOUT / 2
            // Make sure we (re)spin the checkers that have become idle within
            // this wait-and-check interval
            for (int i=0; i<mHandlerCheckers.size(); i++) 
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked(); // 遍历HandlerChecker列表进行检查
            

            if (debuggerWasConnected > 0) 
                debuggerWasConnected--;
            

            // NOTE: We use uptimeMillis() here because we do not want to increment the time we
            // wait while asleep. If the device is asleep then the thing that we are waiting
            // to timeout on is asleep as well and won't have a chance to run, causing a false
            // positive on when to kill things.
            long start = SystemClock.uptimeMillis();
            while (timeout > 0)  // 下面代码确保等待CHECK_INTERVAL
                if (Debug.isDebuggerConnected()) 
                    debuggerWasConnected = 2;
                
                try 
                    mLock.wait(timeout);
                    // Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
                 catch (InterruptedException e) 
                    Log.wtf(TAG, e);
                
                if (Debug.isDebuggerConnected()) 
                    debuggerWasConnected = 2;
                
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            
    ...
  

评估结果

...
synchronized (mLock) 
   // 调度检查工作
   // 等待CHECK_INTERVAL,等待任务完成

   final int waitState = evaluateCheckerCompletionLocked(); // 评估所有任务完成结果
   if (waitState == COMPLETED) // 完成,继续下一轮检查
       // The monitors have returned; reset
       waitedHalf = false;
       continue;
    else if (waitState == WAITING) // 所有任务等待时间<CHECK_INTERVAL
       // still waiting but within their configured intervals; back off and recheck
       continue;
    else if (waitState == WAITED_HALF) // 等待时间大>CHECK_INTERVAL但<DEFAULT_TIMEOUT
       if (!waitedHalf) 
           Slog.i(TAG, "WAITED_HALF");
           waitedHalf = true;
           // We've waited half, but we'd need to do the stack trace dump w/o the lock.
           pids = new ArrayList<>(mInterestingJavaPids);
           doWaitedHalfDump = true; // 设置标志,将会dump感兴趣进程的trace
        else 
           continue;
       
    else // 等待时间>DEFAULT_TIMEOUT
       // something is overdue!
       blockedCheckers = getBlockedCheckersLocked();
       subject = describeCheckersLocked(blockedCheckers);//阻塞信息描述
       allowRestart = mAllowRestart;
       pids = new ArrayList<>(mInterestingJavaPids);
   
 // END synchronized (mLock)

evaluateCheckerCompletionLocked

对每个HandlerChecker的任务结果进行评估,对所有任务完成状态取最大值

private int evaluateCheckerCompletionLocked() 
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) 
        HandlerChecker hc = mHandlerCheckers.get(i);
        //获取HandlerChecker的完成状态, 取最大的状态值
        state = Math.max(state, hc.getCompletionStateLocked());
    
    return state;

HandlerChecker#getCompletionStateLocked

// These are temporally ordered: larger values as lateness increases
private static final int COMPLETED = 0;
private static final int WAITING = 1;
private static final int WAITED_HALF = 2;
private static final int OVERDUE = 3;

public int getCompletionStateLocked() 
    if (mCompleted)  //所以任务已经完成
        return COMPLETED;
     else // 计算完成的时间,获取对应的完成状态
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2)  // 等待时间小于一半
            return WAITING;
         else if (latency < mWaitMax) // 等待时间超过一半
            return WAITED_HALF;
        
    
    return OVERDUE;// 等待时间大于最大等待时间

输出相关信息

if (doWaitedHalfDump)  // 等待时间超过一半,输出第一份 trace
    // We've waited half the deadlock-detection interval.  Pull a stack
    // trace and wait another half.
    ActivityManagerService.dumpStackTraces(pids, null, null,
            getInterestingNativePids(), null, subject);
    continue;

// 下面是等待完全超时的处理逻辑
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);// 输出 Watchdog event 事件

// Log the atom as early as possible since it is used as a mechanism to trigger
// Perfetto. Ideally, the Perfetto trace capture should happen as close to the
// point in time when the Watchdog happens as possible.
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);

long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
final File stack = ActivityManagerService.dumpStackTraces(  // 输出第二份 Trace
        pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
        tracesFileException, subject);

// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);

processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w'); // 输出kernel log信息
doSysRq('l');

// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked.  (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") 
        public void run() 
            // If a watched thread hangs before init() is called, we don't have a
            // valid mActivity. So we can't log the error to dropbox.
            if (mActivity != null)  // 将 Watchdog信息添加到dropbox
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null, null,
                        null, report.toString(), stack, null, null, null,
                        errorId);
            
        
    ;
dropboxThread.start();
try 
    dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
 catch (InterruptedException ignored) 

超时重启流程

while(true) 
    // 调度检查工作
    // 等待CHECK_INTERVAL,等待任务完成
    // 评估任务完成结果
    // 输出trace和相关信息

    IActivityController controller;
    synchronized (mLock) 
        controller = mController;
    
    if (controller != null)  // IActivityController处理, 通过AMS注册到Watchdog
        Slog.i(TAG, "Reporting stuck state to activity controller");
        try 
            Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
            // 1 = keep waiting, -1 = kill system
            int res = controller.systemNotResponding(subject); // 由IActivityController判断是等待还是重启系统
            if (res >= 0) 
                Slog.i(TAG, "Activity controller requested to coninue to wait");
                waitedHalf = false;
                continue;
            
         catch (RemoteException e) 
        
    

    // Only kill the process if the debugger is not attached.
    if (Debug.isDebuggerConnected()) 
        debuggerWasConnected = 2;
    
    if (debuggerWasConnected >= 2)  // 有 debugger 连接,不重启
        Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
     else if (debuggerWasConnected > 0) 
        Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
     else if (!allowRestart)  // 不允许重启, 通过setAllowRestart设置
        Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
     else  // 重启系统. 杀死系统进程,导致zygote重启,然后走框架重启流程.
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        if (!Build.IS_USER && isCrashLoopFound()
                && !WatchdogProperties.should_ignore_fatal_count().orElse(false)) 
            breakCrashLoop();
        
        Process.killProcess(Process.myPid());// kill 系统进程
        System.exit(10);// 退出进程.
    

    waitedHalf = false; // 重置等待标志,允许再dump half

处理重启广播

在Watchdog的init方法中,注册了ACTION_REBOOT广播。当收到ACTION_REBOOT时,如果带int参数nowait,则会重启系统

final class RebootRequestReceiver extends BroadcastReceiver 
    @Override
    public void onReceive(Context c, Intent intent) 
        if (intent.getIntExtra("nowait", 0) != 0)  // 含nowait参数
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    

rebootSystem 方法如下,会调用PMS的reboot方法,此调用会导致手机整机重启。

/**
 * Perform a full reboot of the system.
 */
void rebootSystem(String reason) 
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try 
        pms.reboot(false, reason, false);
     catch (RemoteException ex) 
    

至此,Watchdog基本流程介绍完毕.

工作流程log示例

// waitedHalf log
NX709J_CNCommon_V2.35-system.txt:447: 02-26 11:22:45.135451  2235  2495 I Watchdog: WAITED_HALF
// 超时 log, 打印前后两次的trace文件名
NX709J_CNCommon_V2.35-system.txt:515: 02-26 11:23:21.299433  2235  2495 E Watchdog: First set of traces taken from /data/anr/anr_2022-02-26-11-22-45-149
NX709J_CNCommon_V2.35-system.txt:516: 02-26 11:23:21.311599  2235  2495 E Watchdog: Second set of traces taken from /data/anr/anr_2022-02-26-11-23-15-848
// kill system_server
NX709J_CNCommon_V2.35-system.txt:517: 02-26 11:23:21.323736  2235  2495 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main)
NX709J_CNCommon_V2.35-system.txt:518: 02-26 11:23:21.324030  2235  2495 W Watchdog: main annotated stack trace:
NX709J_CNCommon_V2.35-system.txt:519: 02-26 11:23:21.324064  2235  2495 W Watchdog:     at com.android.server.am.BatteryStatsService.initPowerManagement(BatteryStatsService.java:510)
NX709J_CNCommon_V2.35-system.txt:520: 02-26 11:23:21.324220  2235  2495 W Watchdog:     - waiting to lock <0x0102e69c> (a com.android.internal.os.BatteryStatsImpl)
NX709J_CNCommon_V2.35-system.txt:521: 02-26 11:23:21.324235  2235  2495 W Watchdog:     at com.android.server.am.ActivityManagerService.initPowerManagement(ActivityManagerService.java:2641)
NX709J_CNCommon_V2.35-system.txt:522: 02-26 11:23:21.324244  2235  2495 W Watchdog:     at com.android.server.SystemServer.startBootstrapServices(SystemServer.java:1190)
NX709J_CNCommon_V2.35-system.txt:523: 02-26 11:23:21.324249  2235  2495 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:961)
NX709J_CNCommon_V2.35-system.txt:524: 02-26 11:23:21.324254  2235  2495 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:641)
NX709J_CNCommon_V2.35-system.txt:525: 02-26 11:23:21.324259  2235  2495 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
NX709J_CNCommon_V2.35-system.txt:526: 02-26 11:23:21.324269  2235  2495 W Watchdog:     at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:567)
NX709J_CNCommon_V2.35-system.txt:527: 02-26 11:23:21.324274  2235  2495 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:996)
NX709J_CNCommon_V2.35-system.txt:528: 02-26 11:23:21.324280  2235  2495 W Watchdog: *** GOODBYE!

以上是关于Android 12 Watchdog 工作流程的主要内容,如果未能解决你的问题,请参考以下文章

Android 12 Watchdog 介绍与启动

Android 12 Watchdog 介绍与启动

Android 12 Watchdog Trace生成过程

Android 12 Watchdog Trace生成过程

Android 12 Watchdog 案例分析集

Android 12 Watchdog 案例分析集