Spark天堂之门(SparkContext)解密(DT大数据梦工厂)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark天堂之门(SparkContext)解密(DT大数据梦工厂)相关的知识,希望对你有一定的参考价值。

内容:

1、Spark天堂之门;

2、SparkContext使用案例鉴赏;

3、SparkContext内幕;

4、SparkContext源码解密;

SparkContext是编写任意Spark程序的第一个对象,用SparkConf为传入的参数

技术分享

==========Spark天堂之门:SparkContext !!!============

1、Spark程序在运行的时候分为Driver和Executors;

2、Spark程序编写是基于SparkContext的,具体来说包含两个方面:

1)Spark编程的核心基础RDD,是由SparkContext来最初创建的(第一个RDD一定是由SparkContext来创建的);

2)Spark程序的调度优化,也是基于SparkContext;

3、Spark程序的注册,是通过SparkContext内部实例化时候生成的对象来完成的(SchedulerBackend来注册程序)

4、Spark程序运行的时候要通过Cluster Manager获得具体的计算资源,计算资源的获取也是通过SparkContext产生的对象(SchedulerBackend来获取的)来申请的

5、SparkContext崩溃或者结束的时候,整个Spark程序也结束了!!!

总结:

SparkContext开启了天堂之门:Spark程序是通过SparkContext发布到Spark集群的;

SparkContext导演天堂世界:Spark程序的运行都是在SparkContext为核心的调度器的指挥下进行的;

SparkContext关闭天堂之门:SparkContext崩溃或者结束的时候,整个Spark程序也结束了!

==========SparkContext使用案例鉴赏 ============

运行之前的WordCount来观赏:

16/02/14 14:03:46 INFO Executor: Starting executor ID driver on host localhost

16/02/14 14:03:46 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService‘ on port 56954.

16/02/14 14:03:46 INFO NettyBlockTransferService: Server created on 56954

16/02/14 14:03:46 INFO BlockManagerMaster: Trying to register BlockManager

16/02/14 14:03:46 INFO BlockManagerMasterEndpoint: Registering block manager localhost:56954 with 2.4 GB RAM, BlockManagerId(driver, localhost, 56954)

16/02/14 14:03:46 INFO BlockManagerMaster: Registered BlockManager

16/02/14 14:03:48 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)

16/02/14 14:03:48 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)

16/02/14 14:03:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:56954 (size: 13.9 KB, free: 2.4 GB)

16/02/14 14:03:48 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:37

16/02/14 14:03:50 WARN : Your hostname, fengwei-pc resolves to a loopback/non-reachable address: fe80:0:0:0:10a6:7a9f:c570:4f85%24, but we couldn‘t find any external IP address!

16/02/14 14:03:51 INFO FileInputFormat: Total input paths to process : 1

16/02/14 14:03:51 INFO SparkContext: Starting job: foreach at WordCount.scala:57

16/02/14 14:03:52 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:49)

16/02/14 14:03:52 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:57) with 1 output partitions

16/02/14 14:03:52 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:57)

16/02/14 14:03:52 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)

16/02/14 14:03:52 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)

16/02/14 14:03:52 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:49), which has no missing parents

16/02/14 14:03:52 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 171.6 KB)

16/02/14 14:03:52 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 173.9 KB)

16/02/14 14:03:52 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:56954 (size: 2.3 KB, free: 2.4 GB)

16/02/14 14:03:52 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006

16/02/14 14:03:52 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:49)

16/02/14 14:03:52 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

16/02/14 14:03:52 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2161 bytes)

16/02/14 14:03:52 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

16/02/14 14:03:52 INFO HadoopRDD: Input split: file:/F:/安装文件/操作系统/spark-1.6.0-bin-hadoop2.6/README.md:0+3359

16/02/14 14:03:52 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

16/02/14 14:03:52 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

16/02/14 14:03:52 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

16/02/14 14:03:52 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

16/02/14 14:03:52 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

16/02/14 14:03:53 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver

16/02/14 14:03:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 828 ms on localhost (1/1)

16/02/14 14:03:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

16/02/14 14:03:53 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:49) finished in 0.879 s

16/02/14 14:03:53 INFO DAGScheduler: looking for newly runnable stages

16/02/14 14:03:53 INFO DAGScheduler: running: Set()

16/02/14 14:03:53 INFO DAGScheduler: waiting: Set(ResultStage 1)

16/02/14 14:03:53 INFO DAGScheduler: failed: Set()

16/02/14 14:03:53 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54), which has no missing parents

16/02/14 14:03:53 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 176.4 KB)

16/02/14 14:03:53 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1581.0 B, free 177.9 KB)

16/02/14 14:03:53 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:56954 (size: 1581.0 B, free: 2.4 GB)

16/02/14 14:03:53 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/02/14 14:03:53 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54)

16/02/14 14:03:53 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks

16/02/14 14:03:53 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)

16/02/14 14:03:53 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

16/02/14 14:03:53 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks

16/02/14 14:03:53 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms

["Building:1

shell::2

Scala,:1

and:10

command,:2

./dev/run-tests:1

sample:1

16/02/14 14:03:53 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver

16/02/14 14:03:53 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 233 ms on localhost (1/1)

16/02/14 14:03:53 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:57) finished in 0.234 s

16/02/14 14:03:53 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

16/02/14 14:03:53 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:57, took 1.777176 s

16/02/14 14:03:53 INFO SparkUI: Stopped Spark web UI at http://192.168.145.1:4040

16/02/14 14:03:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

16/02/14 14:03:53 INFO MemoryStore: MemoryStore cleared

16/02/14 14:03:53 INFO BlockManager: BlockManager stopped

16/02/14 14:03:53 INFO BlockManagerMaster: BlockManagerMaster stopped

16/02/14 14:03:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

16/02/14 14:03:53 INFO SparkContext: Successfully stopped SparkContext

16/02/14 14:03:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

16/02/14 14:03:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

16/02/14 14:03:53 INFO ShutdownHookManager: Shutdown hook called

16/02/14 14:03:53 INFO ShutdownHookManager: Deleting directory C:\Temp\spark-9596fafa-5bfe-4d34-b3cf-b7daa2cd86c7

==========SparkContext内幕============

1、在创建的时候,有三大的顶级核心:DAGScheduler、TaskScheduler、SchedulerBackend,其中:

1)DAGScheduler是面向Job的Stage的高层调度器;

2)TaskScheduler是一个接口,根据具体的Cluster Manager的不同会有不同的实现,Standalone模式下,具体的实现是TaskSchedulerImpl;

3)SchedulerBackend是一个接口,Standalone模式下,具体实现是SparkDeploySchedulerBackend;

2、从整个程序运行角度讲,SparkContextn运行包括四大核心对象:DAGScheduler、TaskScheduler、SchedulerBackend、MapOutputTrackerMaster

首先创建taskScheduler(主要是实例化taskScheduler ):

// Create and start the scheduler
val (schedts) = SparkContext.createTaskScheduler(thismaster)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler‘s
// constructor
_taskScheduler.start()

standaolne模式下creatTaskScheduler(创建TaskSchedulerImpl,使用了SparkDeploySchedulerBackend作为参数):

case SPARK_REGEX(sparkUrl) =>
  val scheduler = new TaskSchedulerImpl(sc)
  val masterUrls = sparkUrl.split(",").map("spark://" + _)
  val backend = new SparkDeploySchedulerBackend(schedulerscmasterUrls)
  scheduler.initialize(backend)
  (backendscheduler)

TaskSchedulerImplinitilize的时候做了以下操作(创建SchedulerPool):

def initialize(backend: SchedulerBackend) {
  this.backend = backend
  // temporarily set rootPool name to empty
  rootPool new Pool(""schedulingMode00)
  schedulableBuilder = {
    schedulingMode match {
      case SchedulingMode.FIFO =>
        new FIFOSchedulableBuilder(rootPool)
      case SchedulingMode.FAIR =>
        new FairSchedulableBuilder(rootPoolconf)
    }
  }
  schedulableBuilder.buildPools()
}

SparkDeploySchedulerBackend有三大核心功能:

1)负责与Master链接注册当前程序;

2)接收集群中为当前应用程序而分配的计算资源Executor的注册并管理Executors;

3)负责发送Task到具体的Executor执行;

补充说明的是:SparkDeploySchedulerBackend是被TaskSchedulerImpl来管理的!

然后就是启动taskScheduler

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler‘s
// constructor
_taskScheduler.start()

它会导致SparkDeploySchedulerBackend的start,而SparkDeploySchedulerBackend启动时,关键的代码,说明它注册程序给Master的时候会把下面的command提交给Master,Master发指令给Worker去启动executor所在的进程的时候加载main方法所在的入口类就是command中的CoarseGrainedExecutorBackend ,当然你可以实现自己的executorBackend,只要改下指令的内容,就可以自定义了,在CoarseGrainedExecutorBackend 中,启动executor(executor是先注册再实例化 ),executor 通过线程池并发执行task:

val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
  argssc.executorEnvsclassPathEntries ++ testingClassPathlibraryPathEntriesjavaOpts)

override def receive: PartialFunction[Any, Unit] = {
  case RegisteredExecutor(hostname) =>
    logInfo("Successfully registered with driver")
    executor new Executor(executorIdhostnameenvuserClassPathisLocal = false)

SparkDeploySchedulerBackend启动的时候相当于启动了一个应用程序,其中有ClientEndpoint:

def start() {
  // Just launch an rpcEndpoint; it will call back into the listener.
  endpoint.set(rpcEnv.setupEndpoint("AppClient"new ClientEndpoint(rpcEnv)))
}

override def onStart(): Unit = {
  try {
    registerWithMaster(1)
  } catch {
    case e: Exception =>
      logWarning("Failed to connect to master"e)
      markDisconnected()
      stop()
  }
}

/**
 * Register with all masters asynchronously. It will call `registerWithMasterevery
 * REGISTRATION_TIMEOUT_SECONDS seconds until exceeding REGISTRATION_RETRIES times.
 * Once we connect to a master successfully, all scheduling work and Futures will be cancelled.
 *
 * nthRetry means this is the nth attempt to register with master.
 */
private def registerWithMaster(nthRetry: Int) {
  registerMasterFutures.set(tryRegisterAllMasters())
  registrationRetryTimer.set(registrationRetryThread.scheduleAtFixedRate(new Runnable {
    override def run(): Unit = {
      Utils.tryOrExit {
        if (registered.get) {
          registerMasterFutures.get.foreach(_.cancel(true))
          registerMasterThreadPool.shutdownNow()
        } else if (nthRetry >= REGISTRATION_RETRIES) {
          markDead("All masters are unresponsive! Giving up.")
        } else {
          registerMasterFutures.get.foreach(_.cancel(true))
          registerWithMaster(nthRetry + 1)
        }
      }
    }
  }REGISTRATION_TIMEOUT_SECONDSREGISTRATION_TIMEOUT_SECONDSTimeUnit.SECONDS))
}

注册是通过Thread完成的,注册给Master,Master通过给Worker发送指令启动Executor,所有的Executor向SparkDeploySchedulerBackend去注册:

/**
 *  Register with all masters asynchronously and returns an array `Future`s for cancellation.
 */
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
  for (masterAddress <- masterRpcAddressesyield {
    registerMasterThreadPool.submit(new Runnable {
      override def run(): Unit try {
        if (registered.get) {
          return
        }
        logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
        val masterRef =
          rpcEnv.setupEndpointRef(Master.SYSTEM_NAMEmasterAddressMaster.ENDPOINT_NAME)
        masterRef.send(RegisterApplication(appDescriptionself))
      } catch {
        case ie: InterruptedException => // Cancelled
        case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress"e)
      }
    })
  }
}

技术分享

王家林老师名片:

中国Spark第一人

新浪微博:http://weibo.com/ilovepains

微信公众号:DT_Spark

博客:http://blog.sina.com.cn/ilovepains

手机:18610086859

QQ:1740415547

邮箱:[email protected]


本文出自 “一枝花傲寒” 博客,谢绝转载!

以上是关于Spark天堂之门(SparkContext)解密(DT大数据梦工厂)的主要内容,如果未能解决你的问题,请参考以下文章

数学趣题

Spark(16)——SparkContext的作用

SparkContext:运行 Spark 作业时初始化 SparkContext 时出错

spark源码解读-SparkContext初始化过程

spark源码之SparkContext

Spark源码剖析——SparkContext