spark--job和DAGScheduler源码

Posted parent-absent-son

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark--job和DAGScheduler源码相关的知识,希望对你有一定的参考价值。

 

 一个job对应一个action操作,action执行会有先后顺序;

每个job执行会先构建一个DAG路径,一个job会含有多个stage,主要逻辑在DAGScheduler。

spark提交job的源码见(SparkContext.scala的runJob方法):

  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD‘s recursive dependencies:
" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

DAGScheduler--job调度的核心入口:

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
//创建finalStage
var finalStage: ResultStage = null try { // New stage creation may throw an exception if, for example, jobs are run on a // HadoopRDD whose underlying HDFS files have been deleted.
//创建一个stage对象,并且将stage加入到DAGScheduler内存缓存中
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => logWarning("Creating new stage failed due to exception - job: " + jobId, e) listener.jobFailed(e) return } //创建job val job = new ActiveJob(jobId, finalStage, callSite, listener, properties) clearCacheLocs() logInfo("Got job %s (%s) with %d output partitions".format( job.jobId, callSite.shortForm, partitions.length)) logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")") logInfo("Parents of final stage: " + finalStage.parents) logInfo("Missing parents: " + getMissingParentStages(finalStage)) val jobSubmissionTime = clock.getTimeMillis()
//将job加入到内存缓存中 jobIdToActiveJob(jobId)
= job activeJobs += job finalStage.setActiveJob(job) val stageIds = jobIdToStageIds(jobId).toArray val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo)) listenerBus.post( SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties)) //使用submitStage() 方法提交finalStage
submitStage(finalStage) }

 

以上是关于spark--job和DAGScheduler源码的主要内容,如果未能解决你的问题,请参考以下文章

一个Spark job的生命历程

spark job提交执行流程

spark job提交执行流程

spark job提交执行流程

spark job提交执行流程

Spark: Job in detail