Scheduling restart of crashed service解决方案与源码分析

Posted 2021-08-16 爱炒饭

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Scheduling restart of crashed service解决方案与源码分析相关的知识，希望对你有一定的参考价值。

测试发现一个bug，service中某个方法由于空指针导致程序挂掉，接着触发程序的保活机制触发程序重启，但是这个异常service先启动访问未初始化资源导致程序连续循环重启。
下面代码模拟了service子线程显示toast引起程序挂掉

public class MyService extends Service {
    @Override
    public int onStartCommand(Intent intent, int flags, int startId) {
        LogUtil.d("onStartCommand,flags="+super.onStartCommand(intent, flags, startId)+",START_NOT_STICKY="+START_NOT_STICKY);
        LogUtil.d("onStartCommand,super.onStartCommand(intent, flags, startId)="+super.onStartCommand(intent, flags, startId)+",super.onStartCommand(intent, Service.START_NOT_STICKY, startId)="+super.onStartCommand(intent, Service.START_NOT_STICKY, startId));
        // return super.onStartCommand(intent, flags, startId);
        //super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
        return START_NOT_STICKY;
    }
    @Override
    public void onCreate() {
        LogUtil.d("onCreate");
        super.onCreate();
        new Thread(new Runnable() {
            @Override
            public void run() {
                try {
                    Thread.sleep(10_000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                LogUtil.d("run crash before");
                Toast.makeText(MyService.this,"演示子线程更新UI发生crash",Toast.LENGTH_SHORT).show();
                LogUtil.d("run crash after");
            }
        }).start();
    }

    public MyService() {
        LogUtil.d("MyService");
    }

    @Override
    public void onDestroy() {
        LogUtil.d("onDestroy");
        super.onDestroy();
    }
}

log中打印一个信息很关键。

07-17 09:52:57.674  1022  1037 I ActivityManager: Process com.shan.mvvm (pid 13678) has died: prcp SVC 
07-17 09:52:57.675  1022  1037 W ActivityManager: Scheduling restart of crashed service com.shan.mvvm/.MyService in 1000ms for start-requested

一、解决方案

系统按照程序启动时要求重新启动了service。这就要提到Service的onStartCommand方法中涉及到的启动模式了。

/**
 * Constant to return from {@link #onStartCommand}: compatibility
 * version of {@link #START_STICKY} that does not guarantee that
 * {@link #onStartCommand} will be called again after being killed.
 */
public static final int START_STICKY_COMPATIBILITY = 0;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), then leave it in the started state but
 * don't retain this delivered intent.  Later the system will try to
 * re-create the service.  Because it is in the started state, it will
 * guarantee to call {@link #onStartCommand} after creating the new
 * service instance; if there are not any pending start commands to be
 * delivered to the service, it will be called with a null intent
 * object, so you must take care to check for this.
 * 
 * <p>This mode makes sense for things that will be explicitly started
 * and stopped to run for arbitrary periods of time, such as a service
 * performing background music playback.
 */
public static final int START_STICKY = 1;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), and there are no new start intents to
 * deliver to it, then take the service out of the started state and
 * don't recreate until a future explicit call to
 * {@link Context#startService Context.startService(Intent)}.  The
 * service will not receive a {@link #onStartCommand(Intent, int, int)}
 * call with a null Intent because it will not be restarted if there
 * are no pending Intents to deliver.
 * 
 * <p>This mode makes sense for things that want to do some work as a
 * result of being started, but can be stopped when under memory pressure
 * and will explicit start themselves again later to do more work.  An
 * example of such a service would be one that polls for data from
 * a server: it could schedule an alarm to poll every N minutes by having
 * the alarm start its service.  When its {@link #onStartCommand} is
 * called from the alarm, it schedules a new alarm for N minutes later,
 * and spawns a thread to do its networking.  If its process is killed
 * while doing that check, the service will not be restarted until the
 * alarm goes off.
 */
public static final int START_NOT_STICKY = 2;

/**
 * Constant to return from {@link #onStartCommand}: if this service's
 * process is killed while it is started (after returning from
 * {@link #onStartCommand}), then it will be scheduled for a restart
 * and the last delivered Intent re-delivered to it again via
 * {@link #onStartCommand}.  This Intent will remain scheduled for
 * redelivery until the service calls {@link #stopSelf(int)} with the
 * start ID provided to {@link #onStartCommand}.  The
 * service will not receive a {@link #onStartCommand(Intent, int, int)}
 * call with a null Intent because it will only be restarted if
 * it is not finished processing all Intents sent to it (and any such
 * pending events will be delivered at the point of restart).
 */
public static final int START_REDELIVER_INTENT = 3;

一共四种模式，
START_STICKY （1）模式在服务死掉后被系统自动重启拉活，但是不会保留之前的intent参数；START_STICKY_COMPATIBILITY （0）是START_STICKY 的兼容模式，不保证服务死掉后被系统自动拉活；
START_NOT_STICKY（2）服务死掉系统不会自动去拉活；
START_REDELIVER_INTENT（3）模式在服务死掉后被系统自动重启拉活，并且保留之前的intent参数。
知道了这四种参数含义，我就将START_NOT_STICKY传入到onStartCommand方法中，但是还是会重启，怎么回事呢？排查发现我虽然将START_NOT_STICKY传入到onStartCommand方法中了，但是姿势不对，第一次的错误传参是这样的：

public int onStartCommand(Intent intent, int flags, int startId) {
    super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
}

大佬们应该知道错误出在什么地方了，实际上super.onStartCommand(intent, Service.START_NOT_STICKY, startId)返回的值还是START_STICKY ，打印log可以看到，实际上直接return START_NOT_STICK即可。
正确的做法是这样子的：

public int onStartCommand(Intent intent, int flags, int startId) {       
    LogUtil.d("onStartCommand,flags="+super.onStartCommand(intent, flags, startId)+",START_NOT_STICKY="+START_NOT_STICKY);
    LogUtil.d("onStartCommand,super.onStartCommand(intent, flags, startId)="+super.onStartCommand(intent, flags, startId)+",super.onStartCommand(intent, Service.START_NOT_STICKY, startId)="+super.onStartCommand(intent, Service.START_NOT_STICKY, startId));
   // return super.onStartCommand(intent, flags, startId);
    //super.onStartCommand(intent, Service.START_NOT_STICKY, startId);
    return START_NOT_STICKY;
}

log打印如下：

onStartCommand,flags=1,START_NOT_STICKY=2
onStartCommand,super.onStartCommand(intent, flags, startId)=1,super.onStartCommand(intent, Service.START_NOT_STICKY, startId)=1

二、源码分析

在 startService过程一文中提到启动服务会走到realStartServiceLocked方法，在该方法中通过sendServiceArgsLocked方法设置onStartCommand中的参数。

2.1 ActiveServices.realStartServiceLocked

 //ActiveServices.java
private final void realStartServiceLocked(ServiceRecord r,
        ProcessRecord app, boolean execInFg) throws RemoteException {
……
        app.thread.scheduleCreateService(r, r.serviceInfo,
                mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo),
                app.repProcState); //创建service
 ……
    sendServiceArgsLocked(r, execInFg, true); //添加service的启动参数
……
}

2.2 ActiveServices.sendServiceArgsLocked

sendServiceArgsLocked方法会调用ActivityThread的scheduleServiceArgs方法。

 //ActiveServices.java
private final void sendServiceArgsLocked(ServiceRecord r, boolean execInFg,
        boolean oomAdjusted) throws TransactionTooLargeException {
……
        r.app.thread.scheduleServiceArgs(r, slice);
……
}

2.3 ActivityThread.scheduleServiceArgs

scheduleServiceArgs位于ActivityThread.java内部类ApplicationThread中，scheduleServiceArgs方法获取到service的参数集合，遍历其中的参数，通过hander发送消息H.SERVICE_ARGS。

//ActivityThread$ApplicationThread
public final void scheduleServiceArgs(IBinder token, ParceledListSlice args) {
    List<ServiceStartArgs> list = args.getList();

    for (int i = 0; i < list.size(); i++) {
        ServiceStartArgs ssa = list.get(i);
        ServiceArgsData s = new ServiceArgsData();
        s.token = token;
        s.taskRemoved = ssa.taskRemoved;
        s.startId = ssa.startId;
        s.flags = ssa.flags;
        s.args = ssa.args;

        sendMessage(H.SERVICE_ARGS, s);
    }
}

2.4 H.handleMessage

H是ActivityThread.java内部类，它的父类是Handler， H.SERVICE_ARGS消息在H的handleMessage方法中中被处理，接着调用handleServiceArgs方法。

//ActivityThread&H
   public void handleMessage(Message msg) {
        ……
            case SERVICE_ARGS:
                if (Trace.isTagEnabled(Trace.TRACE_TAG_ACTIVITY_MANAGER)) {
                    Trace.traceBegin(Trace.TRACE_TAG_ACTIVITY_MANAGER,
                            ("serviceStart: " + String.valueOf(msg.obj)));
                }
                handleServiceArgs((ServiceArgsData)msg.obj);
                Trace.traceEnd(Trace.TRACE_TAG_ACTIVITY_MANAGER);
                break;
            ……
    }

2.5 ActivityThread.handleServiceArgs

ActivityThread.java的handleServiceArgs方法首先将service中onStartCommand方法的返回值赋值给int型局部变量res，然后将res作为参数传入到AMS的serviceDoneExecuting方法中。

//ActivityThread.java
private void handleServiceArgs(ServiceArgsData data) {
    Service s = mServices.get(data.token);
    if (s != null) {
        try {
            if (data.args != null) {
                data.args.setExtrasClassLoader(s.getClassLoader());
                data.args.prepareToEnterProcess();
            }
            int res;
            if (!data.taskRemoved) {
                //这里取到service中onStartCommand方法的返回值
                res = s.onStartCommand(data.args, data.flags, data.startId); 
            } else {
                s.onTaskRemoved(data.args);
                res = Service.START_TASK_REMOVED_COMPLETE;
            }

            QueuedWork.waitToFinish();

            try {
                //onStartCommand参数传输到AMS中
                ActivityManager.getService().serviceDoneExecuting(
                        data.token, SERVICE_DONE_EXECUTING_START, data.startId, res); //将返回值传入到AMS中
            } catch (RemoteException e) {
                throw e.rethrowFromSystemServer();
            }
        } catch (Exception e) {
            if (!mInstrumentation.onException(s, e)) {
                throw new RuntimeException(
                        "Unable to start service " + s
                        + " with " + data.args + ": " + e.toString(), e);
            }
        }
    }
}

2.5 ActivityManagerService.serviceDoneExecuting

AMS的serviceDoneExecuting方法调用了ActiveServices.java的serviceDoneExecutingLocked方法。

//ActivityManagerService.java
public void serviceDoneExecuting(IBinder token, int type, int startId, int res) {
    synchronized(this) {
        if (!(token instanceof ServiceRecord)) {
            Slog.e(TAG, "serviceDoneExecuting: Invalid service token=" + token);
            throw new IllegalArgumentException("Invalid service token");
        }
        mServices.serviceDoneExecutingLocked((ServiceRecord)token, type, startId, res);
    }
}

2.6 ActiveServices.serviceDoneExecutingLocked

ActiveServices.java的serviceDoneExecutingLocked方法对onStartCommand不同类型返回值进行了处理，这里重点关注r.stopIfKilled变量，可以看出START_STICKY类型的stopIfKilled为false，代表被杀重启；START_NOT_STICKY类型stopIfKilled为true，代表被杀就停止。然后将ServiceRecord对象传入到serviceDoneExecutingLocked方法中。

//ActiveServices.java
void serviceDoneExecutingLocked(ServiceRecord r, int type, int startId, int res) {
    boolean inDestroying = mDestroyingServices.contains(r);
    if (r != null) {
	//启动阶段就分析ActivityThread.SERVICE_DONE_EXECUTING_START类型
        if (type == ActivityThread.SERVICE_DONE_EXECUTING_START) {
            // This is a call from a service start...  take care of
            // book-keeping.
            r.callStart = true;
            switch (res) {
                case Service.START_STICKY_COMPATIBILITY:
                case Service.START_STICKY: {
                    // We are done with the associated start arguments.
                    r.findDeliveredStart(startId, false, true);
                    // Don't stop if killed.
			//START_STICKY类型的stopIfKilled为false，代表被杀重启
                    r.stopIfKilled = false;
                    break;
                }
                case Service.START_NOT_STICKY: {
                    // We are done with the associated start arguments.
                    r.findDeliveredStart(startId, false, true);
                    if (r.getLastStartId() == startId) {
                        // There is no more work, and this service
                        // doesn't want to hang around if killed.
			//START_NOT_STICKY类型stopIfKilled为true，代表被杀就停止
                        r.stopIfKilled = true;
                    }
                    break;
                }
                case Service.START_REDELIVER_INTENT: {
                    // We'll keep this item until they explicitly
                    // call stop for it, but keep track of the fact
                    // that it was delivered.
                    ServiceRecord.StartItem si = r.findDeliveredStart(startId, false, false);
                    if (si != null) {
                        si.deliveryCount = 0;
                        si.doneExecutingCount++;
                        // Don't stop if killed.
                        r.stopIfKilled = true;
                    }
                    break;
                }
                case Service.START_TASK_REMOVED_COMPLETE: {
                    // Special processing for onTaskRemoved().  Don't
                    // impact normal onStartCommand() processing.
                    r.findDeliveredStart(startId, true, true);
                    break;
                }
                default:
                    throw new IllegalArgumentException(
                            "Unknown service start result: " + res);
            }
            if (res == Service.START_STICKY_COMPATIBILITY) {
                r.callStart = false;
            }
        }
   ……
}

2.7 ActivityManagerService.appDiedLocked

可以搜索一下哪里使用了r.stopIfKilled变量，比如ServiceRecord.java的canStopIfKilled方法就有用到，从方法名也可以看出应该和程序重启有关。在上面提到服务异常重启日志中的第一行Process com.shan.mvvm (pid 13678) has died: prcp SVC 实际上是AMS的appDiedLocked方法中打印的，进一步看下handleAppDiedLocked函数，并且第三个参数allowRestart为true表示允许重启。

//ActivityManagerService.java
final void appDiedLocked(ProcessRecord app, int pid, IApplicationThread thread,
        boolean fromBinderDied, String reason) {
   ……
    // Clean up already done if the process has been re-started.
    if (app.pid == pid && app.thread != null &&
            app.thread.asBinder() == thread.asBinder()) {
        boolean doLowMem = app.getActiveInstrumentation() == null;
        boolean doOomAdj = doLowMem;
        if (!app.killedByAm) {
		 //打印app死掉的信息
            reportUidInfoMessageLocked(TAG,
                    "Process " + app.processName + " (pid " + pid + ") has died: "
                            + ProcessList.makeOomAdjString(app.setAdj, true) + " "
                            + ProcessList.makeProcStateString(app.setProcState), app.info.uid);
            mAllowLowerMemLevel = true;
        } else {
            // Note that we always want to do oom adj to update our state with the
            // new number of procs.
            mAllowLowerMemLevel = false;
            doLowMem = false;
        }
        EventLogTags.writeAmProcDied(app.userId, app.pid, app.processName, app.setAdj,
                app.setProcState);
        if (DEBUG_CLEANUP) Slog.v(TAG_CLEANUP,
            "Dying app: " + app + ", pid: " + pid + ", thread: " + thread.asBinder());
        //app死亡处理
        handleAppDiedLocked(app, false, true);
……
}

2.8 ActivityManagerService.handleAppDiedLocked

handleAppDiedLocked方法调用cleanUpApplicationRecordLocked去清理应用记录，此时allowRestart仍然是true。

//ActivityManagerService.java
final void handleAppDiedLocked(ProcessRecord app,
        boolean restarting, boolean allowRestart) {
    int pid = app.pid;
    boolean kept = cleanUpApplicationRecordLocked(app, restarting, allowRestart, -1,
            false /*replacingPid*/);
……
}

2.9 ActivityManagerService.cleanUpApplicationRecordLocked

cleanUpApplicationRecordLocked方法就有调用ActiveServices的killServicesLocked方法。

//ActivityManagerService.java
final boolean cleanUpApplicationRecordLocked(ProcessRecord app,
        boolean restarting, boolean allowRestart, int index, boolean replacingPid) {
    ……
    mServices.killServicesLocked(app, allowRestart);
	……
}

2.10 ActiveServices.killServicesLocked

ActiveServices.java的killServicesLocked方法会统计服务crash的次数，由于此时allowRestart 传入的参数为true，当服务次数小于16次是代码会走到else里面调用scheduleServiceRestartLocked方法。

//ActiveServices.java
final void killServicesLocked(ProcessRecord app, boolean allowRestart) {
    // Report disconnected services.
    ……
        // Any services running in the application may need to be placed
        // back in the pending list.
		//allowRestart为true，BOUND_SERVICE_MAX_CRASH_RETRY为16
        if (allowRestart && sr.crashCount >= mAm.mConstants.BOUND_SERVICE_MAX_CRASH_RETRY
                && (sr.serviceInfo.applicationInfo.flags
                    &ApplicationInfo.FLAG_PERSISTENT) == 0) {
            Slog.w(TAG, "Service crashed " + sr.crashCount
                    + " times, stopping: " + sr);
            EventLog.writeEvent(EventLogTags.AM_SERVICE_CRASHED_TOO_MUCH,
                    sr.userId, sr.crashCount, sr.shortInstanceName, app.pid);
            bringDownServiceLocked(sr);
        } else if (!allowRestart
                || !mAm.mUserController.isUserRunning(sr.userId, 0)) {
            bringDownServiceLocked(sr);
        } else {
		//这里尝试重启service
            final boolean scheduled = scheduleServiceRestartLocked(sr, true /* allowCancel */);
          ……
        }
    }
……
}

2.11 ActiveServices.scheduleServiceRestartLocked

scheduleServiceRestartLocked方法就会用到canStopIfKilled方法，上文中提到过START_STICKY类型canStopIfKilled方法为false，START_NOT_STICKY则为true，如果START_STICKY类型就会继续下面的service重启逻辑并且打印Scheduling restart of crashed service日志。

//Active以上是关于Scheduling restart of crashed service解决方案与源码分析的主要内容，如果未能解决你的问题，请参考以下文章 
 error : qemuMonitorIO:697 : internal error: End of file from qemu monitor
 什么是 CRA ?
 sudo service httpd restart 给出错误或 ssl.conf
 ex19-20 函数变量和文件
 CRA 打字稿皮棉
 找不到带有自定义 cra 模板的 package.json