Android 12 Watchdog 工作流程
Posted pecuyu
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Android 12 Watchdog 工作流程相关的知识,希望对你有一定的参考价值。
文章托管在gitee上 Android Notes , 同步csdn
Watchdog 工作过程概述
如上篇所述,当调用Watchdog#start方法时,启动其内部工作线程,之后它的run方法会被调用,Watchdog在此线程中执行监控逻辑。
Watchdog的工作概述如下:
- CHECK_INTERVAL(目前是30s)周期遍历HandlerChecker列表的所有元素,调用HandlerChecker#scheduleCheckLocked方法来执行检测操作,具体实现是向对应HandlerChecker的Handler中post一个Runnable消息到其消息队列的头部,等待此Runnable被执行。
- 如果Runnable被执行到了,说明该Handler所在线程没有出现阻塞,Runnable对象的run方法将被调用,此时会再尝试遍历HandlerChecker的Monitor列表,调用每一个Monitor#monitor方法(比如AMS#monitor方法),来判断被监控对象是否处于正常状态,这个监控任务目前实现在fg线程里面执行,所有大多Watchdog的trace里面会有fg线程信息(如果该Runnable长时间得不到执行或者Monitor#monitor调用长时间不返回,则说明该Handler线程可能发生卡住或者被监控对象状态异常,无法继续执行新任务了)。
- 休眠CHECK_INTERVAL,然后evaluateCheckerCompletionLocked方法来评估任务结果,有如下一些状态:
- 0 COMPLETED 所以任务已经完成
- 1 WAITING 所有任务等待完成时间 <CHECK_INTERVAL
- 2 WAITED_HALF 有任务等待完成时间 > CHECK_INTERVAL但是小于DEFAULT_TIMEOUT
- 3 OVERDUE 有任务等待完成时间 > DEFAULT_TIMEOUT
- 如果评估等待完成结果为COMPLETED或WAITING,则会进行下一轮监控流程;如果结果是WAITED_HALF,则会输出相关感兴趣的进程的trace,然后进行下一轮循环;如果结果是OVERDUE,则会再次输出相关trace,kernel、binder相关信息,然后会重启系统(框架层面),某些情况比如跑monkey可能不会重启。
- IActivityController的一些策略会影响Watchdog的工作流程,在OVERDUE时可能让系统继续等待或者走kill系统流程。
HandlerChecker 实现
该类是一个比较关键的类,主要功能在此实现,下面做一些介绍。通过类注释也可知,它的功能在于检测线程状态,并且调度执行monitor的回调,也就是执行Monitor#monitor。该类实现Runnable,用于post调度检查任务.
/**
* Used for checking status of handle threads and scheduling monitor callbacks.
*/
public final class HandlerChecker implements Runnable
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
HandlerChecker构造
参数:
- handler 代表其所监控的线程的Handler
- name 指所监控的线程的名字
- waitMaxMillis 代表等待完成的最大时限
HandlerChecker(Handler handler, String name, long waitMaxMillis)
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
HandlerChecker#scheduleCheckLocked
scheduleCheckLocked 方法是用来执行具体检查工作的,通过post一个Runnable消息到其所监控的Handler的消息队列的头部,这么做的目的有2:
- 通过消息是否被执行,判断被监控线程状态,如果被监控线程卡住,则Runnable消息将迟迟得不到执行
- 当Runnable消息被执行时,在其中执行Monitor#monitor,将任务放在该线程而不是Watchdog线程,另外也通过该调用判断被监控对象状态是否也处于正常状态。
public void scheduleCheckLocked()
if (mCompleted) // 将Monitor从mMonitorQueue转移到mMonitors,后续mMonitors列表保持不变,这是出于安全考虑设计
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
// 如果没有监控对象并且消息队列处于Polling状态(也就是处于等新消息状态),或者该checker已经被设置了停止检查,则不会继续
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0))
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
if (!mCompleted) // 如果任务已经调度,则返回
// we already have a check in flight, so no need
return;
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this); // 向消息队列post一个Runnable消息
HandlerChecker#run
当其run方法执行时,说明对应handler线程能正常执行消息,没有发生阻塞。对于fg线程而言,它还需要遍历mMonitors中所有的Monitor,并执行其monitor方法。
@Override
public void run()
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) // 遍历mMonitors并执行其monitor
synchronized (mLock)
mCurrentMonitor = mMonitors.get(i);
mCurrentMonitor.monitor();
synchronized (mLock) // 如果所有任务完成,则置状态mCompleted为true
mCompleted = true;
mCurrentMonitor = null;
HandlerChecker#getCompletionStateLocked
当评估任务是否完成时,会调用其getCompletionStateLocked方法
public int getCompletionStateLocked()
if (mCompleted)
return COMPLETED;
else
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) // 耗时小于一半
return WAITING;
else if (latency < mWaitMax) //耗时大于一半
return WAITED_HALF;
return OVERDUE; // 超过最大等待时限
pause/resume 检查
有些特殊情况,可能需要暂停检查,比如执行PackageManagerService的启动过过程,可能相当耗时,但是此时时正常耗时,不能当异常处理,需要暂停检查
t.traceBegin("StartPackageManagerService");
try
Watchdog.getInstance().pauseWatchingCurrentThread("packagemanagermain");
mPackageManagerService = PackageManagerService.main(mSystemContext, installer,
domainVerificationService, mFactoryTestMode != FactoryTest.FACTORY_TEST_OFF,
mOnlyCore);
finally
Watchdog.getInstance().resumeWatchingCurrentThread("packagemanagermain");
通过Watchdog#pauseWatchingCurrentThread方法来暂停当前线程的检查,通过Watchdog#resumeWatchingCurrentThread方法来恢复检查。具体实现是遍历所有HandlerChecker,找到其所监控的线程与当前一致的目标,当pause则调用HandlerChecker#pauseLocked方法。与之类似的,恢复则会调用HandlerChecker#resumeLocked方法
/**
* Pauses Watchdog action for the currently running thread. Useful before executing long running
* operations that could falsely trigger the watchdog. Each call to this will require a matching
* call to @link #resumeWatchingCurrentThread.
*
* <p>If the current thread has not been added to the Watchdog, this call is a no-op.
*
* <p>If the Watchdog is already paused for the current thread, this call adds
* adds another pause and will require an additional @link #resumeCurrentThread to resume.
*
* <p>Note: Use with care, as any deadlocks on the current thread will be undetected until all
* pauses have been resumed.
*/
public void pauseWatchingCurrentThread(String reason)
synchronized (mLock)
for (HandlerChecker hc : mHandlerCheckers)
if (Thread.currentThread().equals(hc.getThread()))
hc.pauseLocked(reason);
暂停与恢复检查,实际上是通过控制mPauseCount来实现的,在scheduleCheckLocked方法中,判断mPauseCount>0则停止此HandlerChecker的检查
/** Pause the HandlerChecker. */
public void pauseLocked(String reason)
mPauseCount++; // 每次调用pauseLocked都会增加
// Mark as completed, because there's a chance we called this after the watchog
// thread loop called Object#wait after 'WAITED_HALF'. In that case we want to ensure
// the next call to #getCompletionStateLocked for this checker returns 'COMPLETED'
mCompleted = true; // 暂停检查则直接设置状态为完成
Slog.i(TAG, "Pausing HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
/** Resume the HandlerChecker from the last @link #pauseLocked. */
public void resumeLocked(String reason)
if (mPauseCount > 0)
mPauseCount--; // 每次调用resumeLocked则递减,当减少为0时恢复检查
Slog.i(TAG, "Resuming HandlerChecker: " + mName + " for reason: "
+ reason + ". Pause count: " + mPauseCount);
else
Slog.wtf(TAG, "Already resumed HandlerChecker: " + mName);
Watchdog#run
这个方法比较长,分段进行说明。
调度检查
private void run()
boolean waitedHalf = false;
while (true) // 在while循环中执行
List<HandlerChecker> blockedCheckers = Collections.emptyList();
String subject = "";
boolean allowRestart = true;
int debuggerWasConnected = 0;
boolean doWaitedHalfDump = false;
final ArrayList<Integer> pids;
synchronized (mLock)
long timeout = CHECK_INTERVAL; // 每CHECK_INTERVAL检查一次,CHECK_INTERVAL = DEFAULT_TIMEOUT / 2
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++)
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked(); // 遍历HandlerChecker列表进行检查
if (debuggerWasConnected > 0)
debuggerWasConnected--;
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) // 下面代码确保等待CHECK_INTERVAL
if (Debug.isDebuggerConnected())
debuggerWasConnected = 2;
try
mLock.wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
catch (InterruptedException e)
Log.wtf(TAG, e);
if (Debug.isDebuggerConnected())
debuggerWasConnected = 2;
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
...
评估结果
...
synchronized (mLock)
// 调度检查工作
// 等待CHECK_INTERVAL,等待任务完成
final int waitState = evaluateCheckerCompletionLocked(); // 评估所有任务完成结果
if (waitState == COMPLETED) // 完成,继续下一轮检查
// The monitors have returned; reset
waitedHalf = false;
continue;
else if (waitState == WAITING) // 所有任务等待时间<CHECK_INTERVAL
// still waiting but within their configured intervals; back off and recheck
continue;
else if (waitState == WAITED_HALF) // 等待时间大>CHECK_INTERVAL但<DEFAULT_TIMEOUT
if (!waitedHalf)
Slog.i(TAG, "WAITED_HALF");
waitedHalf = true;
// We've waited half, but we'd need to do the stack trace dump w/o the lock.
pids = new ArrayList<>(mInterestingJavaPids);
doWaitedHalfDump = true; // 设置标志,将会dump感兴趣进程的trace
else
continue;
else // 等待时间>DEFAULT_TIMEOUT
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);//阻塞信息描述
allowRestart = mAllowRestart;
pids = new ArrayList<>(mInterestingJavaPids);
// END synchronized (mLock)
evaluateCheckerCompletionLocked
对每个HandlerChecker的任务结果进行评估,对所有任务完成状态取最大值
private int evaluateCheckerCompletionLocked()
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++)
HandlerChecker hc = mHandlerCheckers.get(i);
//获取HandlerChecker的完成状态, 取最大的状态值
state = Math.max(state, hc.getCompletionStateLocked());
return state;
HandlerChecker#getCompletionStateLocked
// These are temporally ordered: larger values as lateness increases
private static final int COMPLETED = 0;
private static final int WAITING = 1;
private static final int WAITED_HALF = 2;
private static final int OVERDUE = 3;
public int getCompletionStateLocked()
if (mCompleted) //所以任务已经完成
return COMPLETED;
else // 计算完成的时间,获取对应的完成状态
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) // 等待时间小于一半
return WAITING;
else if (latency < mWaitMax) // 等待时间超过一半
return WAITED_HALF;
return OVERDUE;// 等待时间大于最大等待时间
输出相关信息
if (doWaitedHalfDump) // 等待时间超过一半,输出第一份 trace
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ActivityManagerService.dumpStackTraces(pids, null, null,
getInterestingNativePids(), null, subject);
continue;
// 下面是等待完全超时的处理逻辑
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);// 输出 Watchdog event 事件
// Log the atom as early as possible since it is used as a mechanism to trigger
// Perfetto. Ideally, the Perfetto trace capture should happen as close to the
// point in time when the Watchdog happens as possible.
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
final File stack = ActivityManagerService.dumpStackTraces( // 输出第二份 Trace
pids, processCpuTracker, new SparseArray<>(), getInterestingNativePids(),
tracesFileException, subject);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(5000);
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w'); // 输出kernel log信息
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox")
public void run()
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
if (mActivity != null) // 将 Watchdog信息添加到dropbox
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null, null,
null, report.toString(), stack, null, null, null,
errorId);
;
dropboxThread.start();
try
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
catch (InterruptedException ignored)
超时重启流程
while(true)
// 调度检查工作
// 等待CHECK_INTERVAL,等待任务完成
// 评估任务完成结果
// 输出trace和相关信息
IActivityController controller;
synchronized (mLock)
controller = mController;
if (controller != null) // IActivityController处理, 通过AMS注册到Watchdog
Slog.i(TAG, "Reporting stuck state to activity controller");
try
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject); // 由IActivityController判断是等待还是重启系统
if (res >= 0)
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
catch (RemoteException e)
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected())
debuggerWasConnected = 2;
if (debuggerWasConnected >= 2) // 有 debugger 连接,不重启
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
else if (debuggerWasConnected > 0)
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
else if (!allowRestart) // 不允许重启, 通过setAllowRestart设置
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
else // 重启系统. 杀死系统进程,导致zygote重启,然后走框架重启流程.
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
if (!Build.IS_USER && isCrashLoopFound()
&& !WatchdogProperties.should_ignore_fatal_count().orElse(false))
breakCrashLoop();
Process.killProcess(Process.myPid());// kill 系统进程
System.exit(10);// 退出进程.
waitedHalf = false; // 重置等待标志,允许再dump half
处理重启广播
在Watchdog的init方法中,注册了ACTION_REBOOT广播。当收到ACTION_REBOOT时,如果带int参数nowait,则会重启系统
final class RebootRequestReceiver extends BroadcastReceiver
@Override
public void onReceive(Context c, Intent intent)
if (intent.getIntExtra("nowait", 0) != 0) // 含nowait参数
rebootSystem("Received ACTION_REBOOT broadcast");
return;
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
rebootSystem 方法如下,会调用PMS的reboot方法,此调用会导致手机整机重启。
/**
* Perform a full reboot of the system.
*/
void rebootSystem(String reason)
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try
pms.reboot(false, reason, false);
catch (RemoteException ex)
至此,Watchdog基本流程介绍完毕.
工作流程log示例
// waitedHalf log
NX709J_CNCommon_V2.35-system.txt:447: 02-26 11:22:45.135451 2235 2495 I Watchdog: WAITED_HALF
// 超时 log, 打印前后两次的trace文件名
NX709J_CNCommon_V2.35-system.txt:515: 02-26 11:23:21.299433 2235 2495 E Watchdog: First set of traces taken from /data/anr/anr_2022-02-26-11-22-45-149
NX709J_CNCommon_V2.35-system.txt:516: 02-26 11:23:21.311599 2235 2495 E Watchdog: Second set of traces taken from /data/anr/anr_2022-02-26-11-23-15-848
// kill system_server
NX709J_CNCommon_V2.35-system.txt:517: 02-26 11:23:21.323736 2235 2495 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on main thread (main)
NX709J_CNCommon_V2.35-system.txt:518: 02-26 11:23:21.324030 2235 2495 W Watchdog: main annotated stack trace:
NX709J_CNCommon_V2.35-system.txt:519: 02-26 11:23:21.324064 2235 2495 W Watchdog: at com.android.server.am.BatteryStatsService.initPowerManagement(BatteryStatsService.java:510)
NX709J_CNCommon_V2.35-system.txt:520: 02-26 11:23:21.324220 2235 2495 W Watchdog: - waiting to lock <0x0102e69c> (a com.android.internal.os.BatteryStatsImpl)
NX709J_CNCommon_V2.35-system.txt:521: 02-26 11:23:21.324235 2235 2495 W Watchdog: at com.android.server.am.ActivityManagerService.initPowerManagement(ActivityManagerService.java:2641)
NX709J_CNCommon_V2.35-system.txt:522: 02-26 11:23:21.324244 2235 2495 W Watchdog: at com.android.server.SystemServer.startBootstrapServices(SystemServer.java:1190)
NX709J_CNCommon_V2.35-system.txt:523: 02-26 11:23:21.324249 2235 2495 W Watchdog: at com.android.server.SystemServer.run(SystemServer.java:961)
NX709J_CNCommon_V2.35-system.txt:524: 02-26 11:23:21.324254 2235 2495 W Watchdog: at com.android.server.SystemServer.main(SystemServer.java:641)
NX709J_CNCommon_V2.35-system.txt:525: 02-26 11:23:21.324259 2235 2495 W Watchdog: at java.lang.reflect.Method.invoke(Native Method)
NX709J_CNCommon_V2.35-system.txt:526: 02-26 11:23:21.324269 2235 2495 W Watchdog: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:567)
NX709J_CNCommon_V2.35-system.txt:527: 02-26 11:23:21.324274 2235 2495 W Watchdog: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:996)
NX709J_CNCommon_V2.35-system.txt:528: 02-26 11:23:21.324280 2235 2495 W Watchdog: *** GOODBYE!
以上是关于Android 12 Watchdog 工作流程的主要内容,如果未能解决你的问题,请参考以下文章