突然弹出王者荣耀停止运行，GC超时导致的后台应用崩溃问题分析

Posted 2022-07-09 River_ly

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了突然弹出王者荣耀停止运行，GC超时导致的后台应用崩溃问题分析相关的知识，希望对你有一定的参考价值。

写在前面

这个问题之所以会拿出来仔细分析，一方面是因为这个问题不是简单的应用崩溃而是框架层的报错，另一方面是因为希望通过这个问题梳理下后台GC的超时检测机制怎样的，这样我们后面在应用层如果重写finalize方法回收时会考虑的更加全面点。

问题背景

复现概率: 偶现
问题版本: android R
问题现象: 处于微信界面，突然弹出王者荣耀停止运行

初步分析

拿到问题日志后，先看下报错的堆栈。

09-02 20:53:26.679  2073  2089 E AndroidRuntime: FATAL EXCEPTION: FinalizerWatchdogDaemon
09-02 20:53:26.679  2073  2089 E AndroidRuntime: Process: com.tencent.tmgp.sgame:xg_vip_service, PID: 2073
09-02 20:53:26.679  2073  2089 E AndroidRuntime: java.util.concurrent.TimeoutException: android.database.BulkCursorToCursorAdaptor.finalize() timed out after 10 seconds
//省略部分堆栈
09-02 20:53:26.679  2073  2089 E AndroidRuntime:at android.database.AbstractCursor.finalize(AbstractCursor.java:524)
09-02 20:53:26.679  2073  2089 E AndroidRuntime:at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:291)
09-02 20:53:26.679  2073  2089 E AndroidRuntime:at java.lang.Daemons$FinalizerDaemon.runInternal(Daemons.java:278)

单单从这段堆栈看的话，BulkCursorToCursorAdaptor执行finalize超过了10s，导致FinalizerWatchdogDaemon报错，FinalizerWatchdogDaemon字面上看像是监测回收超时的守护线程。
看下FinalizerWatchdogDaemon代码中的作用解释。

/**
 * The watchdog exits the VM if the finalizer ever gets stuck. We consider
 * the finalizer to be stuck if it spends more than MAX_FINALIZATION_MILLIS
 * on one instance.
 */
private static class FinalizerWatchdogDaemon extends Daemon 
    @UnsupportedAppUsage
    private static final FinalizerWatchdogDaemon INSTANCE = new FinalizerWatchdogDaemon();

    private boolean needToWork = true;  // Only accessed in synchronized methods.

    private long finalizerTimeoutNs = 0;  // Lazily initialized.

    FinalizerWatchdogDaemon() 
        super("FinalizerWatchdogDaemon");

简单解释下就是：如果对象的finalize出现阻塞超时了会导致进程退出

这个问题中对应的是数据库的关闭，当然也可以发生在其它场景下，只要重写了成员函数finalize的对象都有可能会遇到这个问题，所以如果再遇到GC超时的报错，报错堆栈AndroidRuntime:at java.lang.Daemons$上面的内容可能会不一样。
那么对于重写了成员函数finalize的对象，当它们被GC决定要被回收时，会立刻回收吗？
其实不会马上被回收，而是被放入到一个队列中，等待FinalizerDaemon守护线程去调用它们的成员函数finalize后再被回收。

/**
 * This heap management thread moves elements from the garbage collector's
 * pending list to the managed reference queue.
 */
private static class ReferenceQueueDaemon extends Daemon 
    @UnsupportedAppUsage
    private static final ReferenceQueueDaemon INSTANCE = new ReferenceQueueDaemon();

    ReferenceQueueDaemon() 
        super("ReferenceQueueDaemon");
    

    @Override public void runInternal() 
        while (isRunning()) 
            Reference<?> list;
            try 
                synchronized (ReferenceQueue.class) 
                    while (ReferenceQueue.unenqueued == null) 
                        ReferenceQueue.class.wait();
                    
                    list = ReferenceQueue.unenqueued;
                    ReferenceQueue.unenqueued = null;
                
             catch (InterruptedException e) 
                continue;
             catch (OutOfMemoryError e) 
                continue;
            
            ReferenceQueue.enqueuePending(list);

超时阈值

// This used to be final. IT IS NOW ONLY WRITTEN. We now update it when we look at the command
// line argument, for the benefit of mis-behaved apps that might read it.  SLATED FOR REMOVAL.
// There is no reason to use this: Finalizers should not rely on the value. If a finalizer takes
// appreciable time, the work should be done elsewhere.  Based on disassembly of Daemons.class,
// the value is effectively inlined, so changing the field never did have an effect.
// DO NOT USE. FOR ANYTHING. THIS WILL BE REMOVED SHORTLY.
@UnsupportedAppUsage
private static long MAX_FINALIZE_NANOS = 10L * 1000 * NANOS_PER_MILLI;

注释中对于该值的说明是它很快将被移除，实际这个值在代码中并没有起到真正的作用了，更新它的值是为了方便在外边读取到。
真正的超时阈值是通过VMRuntime.getFinalizerTimeoutMs获取，默认值是10s.

finalizer_timeout_ms_ = runtime_options.GetOrDefault(Opt::FinalizerTimeoutMs);
RUNTIME_OPTIONS_KEY (unsigned int, FinalizerTimeoutMs, 10000u)

超时检测

通过watchdog机制检测finalizer在超时时间内有没有成功析构回收对象。

* The watchdog exits the VM if the finalizer ever gets stuck. We consider
 * the finalizer to be stuck if it spends more than MAX_FINALIZATION_MILLIS
 * on one instance.
 */
private static class FinalizerWatchdogDaemon extends Daemon 
    @UnsupportedAppUsage
    private static final FinalizerWatchdogDaemon INSTANCE = new FinalizerWatchdogDaemon();

    private boolean needToWork = true;  // Only accessed in synchronized methods.

    private long finalizerTimeoutNs = 0;  // Lazily initialized.

    FinalizerWatchdogDaemon() 
        super("FinalizerWatchdogDaemon");
    

    @Override public void runInternal() 
        while (isRunning()) 
            if (!sleepUntilNeeded())  (1)
                // We have been interrupted, need to see if this daemon has been stopped.
                continue;
            
            final Object finalizing = waitForFinalization();(2)
            if (finalizing != null && !VMDebug.isDebuggerConnected()) 
                finalizerTimedOut(finalizing);(3)
                break;

Step1 GC前的检查

/**
        * Notify daemon that it's OK to sleep until notified that something is ready to be
        * finalized.
        */
       private synchronized void goToSleep() 
           needToWork = false;
       

       /**
        * Notify daemon that there is something ready to be finalized.
        */
       private synchronized void wakeUp() 
           needToWork = true;
           notify();

开启回收之前，needToWork会被置为true，此时sleepUntilNeeded返回的是true，所以线程不会wait

@Override public void runInternal() 
           // This loop may be performance critical, since we need to keep up with mutator
           // generation of finalizable objects.
           // We minimize the amount of work we do per finalizable object. For example, we avoid
           // reading the current time here, since that involves a kernel call per object.  We
           // limit fast path communication with FinalizerWatchDogDaemon to what's unavoidable: A
           // non-volatile store to communicate the current finalizable object, e.g. for
           // reporting, and a release store (lazySet) to a counter.
           // We do stop the  FinalizerWatchDogDaemon if we have nothing to do for a
           // potentially extended period.  This prevents the device from waking up regularly
           // during idle times.

           // Local copy of progressCounter; saves a fence per increment on ARM and MIPS.
           int localProgressCounter = progressCounter.get();

           while (isRunning()) 
               try 
                   // Use non-blocking poll to avoid FinalizerWatchdogDaemon communication
                   // when busy.
                   FinalizerReference<?> finalizingReference = (FinalizerReference<?>)queue.poll();
                   if (finalizingReference != null) 
                       finalizingObject = finalizingReference.get();
                       progressCounter.lazySet(++localProgressCounter);
                    else 
                       finalizingObject = null;
                       progressCounter.lazySet(++localProgressCounter);
                       // Slow path; block.
                       FinalizerWatchdogDaemon.INSTANCE.goToSleep();
                       finalizingReference = (FinalizerReference<?>)queue.remove();
                       finalizingObject = finalizingReference.get();
                       progressCounter.set(++localProgressCounter);
                       //回收之前先唤醒看门狗线程
                       FinalizerWatchdogDaemon.INSTANCE.wakeUp();
                   
                   //开始回收的流程
                   doFinalize(finalizingReference);
                catch (InterruptedException ignored) 
                catch (OutOfMemoryError ignored)

如果此时线程处于wait，被中断了或者有OOME发生时，这个时候回到开头判断下isRunning()，也就是看下回收对象这个线程是否为空，如果该线程为空的话，这个循环体就没有必要再继续执行下去了。

/**
        * Wait until something is ready to be finalized.
        * Return false if we have been interrupted
        * See also http://code.google.com/p/android/issues/detail?id=22778.
        */
       private synchronized boolean sleepUntilNeeded() 
           while (!needToWork) 
               try 
                   wait();
                catch (InterruptedException e) 
                   // Daemon.stop may have interrupted us.
                   return false;
                catch (OutOfMemoryError e) 
                   return false;
               
           
           return true;

Step2 等待GC完成

这一步是等待回收结束的过程，这个睡眠过程中如果被中断，说明在这个周期内完成了析构，直接返回null

/**
        * Return an object that took too long to finalize or return null.
        * Wait VMRuntime.getFinalizerTimeoutMs.  If the FinalizerDaemon took essentially the
        * whole time processing a single reference, return that reference.  Otherwise return
        * null.  Only called from a single thread.
        */
       private Object waitForFinalization() 
           if (finalizerTimeoutNs == 0) 
               finalizerTimeoutNs =
                       NANOS_PER_MILLI * VMRuntime.getRuntime().getFinalizerTimeoutMs();
               // Temporary app backward compatibility. Remove eventually.
               MAX_FINALIZE_NANOS = finalizerTimeoutNs;
           
           long startCount = FinalizerDaemon.INSTANCE.progressCounter.get();
           // Avoid remembering object being finalized, so as not to keep it alive.
           //如果回收对象没有超时的话，这里会返回null
           if (!sleepForNanos(finalizerTimeoutNs)) 
               // Don't report possibly spurious timeout if we are interrupted.
               return null;
           
           if (getNeedToWork() && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) 
               // We assume that only remove() and doFinalize() may take time comparable to
               // the finalizer timeout.
               // We observed neither the effect of the gotoSleep() nor the increment preceding a
               // later wakeUp. Any remove() call by the FinalizerDaemon during our sleep
               // interval must have been followed by a wakeUp call before we checked needToWork.
               // But then we would have seen the counter increment.  Thus there cannot have
               // been such a remove() call.
               // The FinalizerDaemon must not have progressed (from either the beginning or the
               // last progressCounter increment) to either the next increment or gotoSleep()
               // call.  Thus we must have taken essentially the whole finalizerTimeoutMs in a
               // single doFinalize() call.  Thus it's OK to time out.  finalizingObject was set
               // just before the counter increment, which preceded the doFinalize call.  Thus we
               // are guaranteed to get the correct finalizing value below, unless doFinalize()
               // just finished as we were timing out, in which case we may get null or a later
               // one.  In this last case, we are very likely to discard it below.
               Object finalizing = FinalizerDaemon.INSTANCE.finalizingObject;
               sleepForNanos(500 * NANOS_PER_MILLI);
               // Recheck to make it even less likely we report the wrong finalizing object in
               // the case which a very slow finalization just finished as we were timing out.
               if (getNeedToWork()
                       && FinalizerDaemon.INSTANCE.progressCounter.get() == startCount) 
                   return finalizing;
               
           
           return null;

sleepForNanos对应的函数很简单，如果在超时时间内完成GC，就会计算传进来的超时阈值减去当前已经睡眠的时间，如果这个差值小于0，说明睡眠的时间超过了阈值。

/**
        * Sleep for the given number of nanoseconds, or slightly longer.
        * @return false if we were interrupted.
        */
       private boolean sleepForNanos(long durationNanos) 
           // It's important to base this on nanoTime(), not currentTimeMillis(), since
           // the former stops counting when the processor isn't running.
           long startNanos = System.nanoTime();
           while (true) 
               long elapsedNanos = System.nanoTime() - startNanos;
               long sleepNanos = durationNanos - elapsedNanos;
               if (sleepNanos <= 0) 
                   return true;
               
               // Ensure the nano time is always rounded up to the next whole millisecond,
               // ensuring the delay is >= the requested delay.
               long sleepMillis = (sleepNanos + NANOS_PER_MILLI - 1) / NANOS_PER_MILLI;
               try 
                   Thread.sleep(sleepMillis);
                catch (InterruptedException e) 
                   if (!isRunning()) 
                       return false;
                   
                catch (OutOfMemoryError ignored) 
                   if (!isRunning()) 
                       return false;

Step3 GC处理超时
如果第二步中的超时时间内析构没有完成，则返回析构的对象，触发finalizerTimedOut。
到了这一步是最不希望看到的结局，此时系统会弹出应用停止运行的报错框。

注意这个时候并没有立刻杀死进程，杀死进程的选择权交给了用户，即通过弹窗展示给用户，但对于用户来说会一头雾水。

分析结论

这种问题其实还是比较常见的，特别是低内存的机器上。RootCasue就是对象回收超时了，一般是由于队列中等待FinalizerDaemon线程回收的对象太多导致，或者此时系统资源异常紧张比如CPU负载过高或者低内存环境下。

场景实测

模拟还原现场

通过模拟GC时耗时操作，应用退到后台后10s会弹出报错框，堆栈如下。

验证了超时时间的确是10s，同时也验证了GC时耗时的操作确实会可能触发这个现象

对比机情况

在手头的小米note9 pro上进行场景模拟测试，模拟GC耗时100s的情况。

在小米的机器上，到了默认的10s后并不会有弹窗，说明小米肯定修改了超时时间，第一次是等待了全部的100s后竟然正常回收，说明超时时间设置的比较大。紧接着下一次在达到了近80s时，进程收到signal 9直接被kill了，此时再点击应用是冷启动。

小米修改了超时阈值(超过100s)，通过直接sig 9杀掉了进程，没有报错弹窗，所以用户无感知

测试机情况

同样的在我们的机器上模拟GC耗时100s的情况
退出应用到后台，此时系统触发GC回收，达到十秒钟时，界面上直接弹出停止运行的报错框，此时只有点击了关闭应用，才会去kill进程

修改策略

在GC规定的超时时间内如果没有完成析构，直接sig 9给对应进程

以上是关于突然弹出王者荣耀停止运行，GC超时导致的后台应用崩溃问题分析的主要内容，如果未能解决你的问题，请参考以下文章