从Spark 的 log 文件看 job 执行过程

Posted yjyyjy

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了从Spark 的 log 文件看 job 执行过程相关的知识,希望对你有一定的参考价值。

Spark Job log 文件分析:

下面是一个Spark 数据据统计Job 的 log 文件, 从前到后的顺序分析Job的执行过程(Spark local mode)。

 

启动 SparkContext 

提交 Job RatingHistogram

20/05/04 18:02:20 INFO SparkContext: Running Spark version 2.4.4
20/05/04 18:02:20 INFO SparkContext: Submitted application: RatingsHistogram

 

启动SparkDriver 服务 

20/05/04 18:02:20 INFO SecurityManager: Changing view acls to: xxx
20/05/04 18:02:20 INFO SecurityManager: Changing modify acls to: xxx
20/05/04 18:02:20 INFO SecurityManager: Changing view acls groups to:
20/05/04 18:02:20 INFO SecurityManager: Changing modify acls groups to:
20/05/04 18:02:20 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
20/05/04 18:02:21 INFO Utils: Successfully started service sparkDriver on port 4911.

 

登记 Block 数据存储管理组件 和 开启 MemoryStore 内存空间

Spark 云计算基于内存处理数据,必定有其内容管理及使用组件

访问当前 MemoryStore 的“习惯做法”是向管理内存存储的块管理器请求 SparkEnv:SparkEnv.get.blockManager.memoryStore

DefaultTopologyMapper:

  • 一个假设所有节点都在同一机架上的拓扑映射器(TopologyMapper
  • 在选择executor进行块复制时,可以使用此信息来区分executor与候选executor来自相同还是不同的机架。

 

20/05/04 18:02:22 INFO SparkEnv: Registering MapOutputTracker
20/05/04 18:02:22 INFO SparkEnv: Registering BlockManagerMaster
20/05/04 18:02:22 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/05/04 18:02:22 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/05/04 18:02:22 INFO DiskBlockManager: Created local directory at C:UsersxxxAppDataLocalTemplockmgr-168e9bad-f66f-4834-a7c7-6d4dadcc00c4
20/05/04 18:02:22 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/05/04 18:02:22 INFO SparkEnv: Registering OutputCommitCoordinator

 

开启 UI 监管服务 web service

20/05/04 18:02:22 INFO Utils: Successfully started service SparkUI on port 4040.
20/05/04 18:02:22 INFO SparkUI: Bound SparkUI to 0.0.0.0, andINFO Executor: Starting executor ID driver on host localhost
 started at http://localhost:4040

 

启动 Block 传输服务

用于网络间传输数据块(和Shuffle 有关了

使用Netty每次获取一组块(Netty是一个异步事件驱动的网络应用程序框架,用于快速开发可维护的高性能协议服务器和客户端- https://netty.io/)。

 

20/05/04 18:02:22 20/05/04 18:02:22 INFO Utils: Successfully started service org.apache.spark.network.netty.NettyBlockTransferService on port 4920.
20/05/04 18:02:22 INFO NettyBlockTransferService: Server created on localhost:4920

 

继续登记BlockManager 和 相应的内存空间

RandomBlockReplicationPolicy:把数据块放到不同的机器上
内存空间:366.3 MB RAM
 
20/05/04 18:02:22 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/04 18:02:22 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, localhost, 4920, None)
20/05/04 18:02:22 INFO BlockManagerMasterEndpoint: Registering block manager localhost:4920 with 366.3 MB RAM, BlockManagerId(driver, localhost, 4920, None)
20/05/04 18:02:22 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, localhost, 4920, None)
20/05/04 18:02:22 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, localhost, 4920, None)
20/05/04 18:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 236.7 KB, free 366.1 MB)
20/05/04 18:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 366.0 MB)
20/05/04 18:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:4920 (size: 22.9 KB, free: 366.3 MB)

 

开始识别输入文件

但是并没有split 处理

20/05/04 18:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 236.7 KB, free 366.1 MB)
20/05/04 18:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 366.0 MB)
20/05/04 18:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:4920 (size: 22.9 KB, free: 366.3 MB)
20/05/04 18:02:23 INFO SparkContext: Created broadcast 0 from textFile at <unknown>:0
20/05/04 18:02:23 INFO FileInputFormat: Total input paths to process : 1

 

Action trigger DAG Scheduler 安排任务

20/05/04 18:02:24 INFO SparkContext: Starting job: countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17
20/05/04 18:02:24 INFO DAGScheduler: Got job 0 (countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17) with 1 output partitions
20/05/04 18:02:24 INFO DAGScheduler: Final stage: ResultStage 0 (countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17)
20/05/04 18:02:24 INFO DAGScheduler: Parents of final stage: List()
20/05/04 18:02:24 INFO DAGScheduler: Missing parents: List()
20/05/04 18:02:24 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17), which has no missing parents

 

Task Scheduler 执行 task

20/05/04 18:02:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/05/04 18:02:24 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7917 bytes)
20/05/04 18:02:24 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

 

Task 执行后返回 Driver 结果

20/05/04 18:02:24 INFO HadoopRDD: Input split: file:/Learning/lessons/spark_code_resource/ml-100k/u.data:0+1979173
20/05/04 18:02:25 INFO PythonRunner: Times: total = 906, boot = 515, init = 32, finish = 359
20/05/04 18:02:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1651 bytes result sent to driver
20/05/04 18:02:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1094 ms on localhost (executor driver) (1/1)
20/05/04 18:02:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

 

Job 结束

Task 结束后 返回 DAG Scheduler

输出结果

20/05/04 18:02:25 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 4922
20/05/04 18:02:25 INFO DAGScheduler: ResultStage 0 (countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17) finished in 1.204 s
20/05/04 18:02:25 INFO DAGScheduler: Job 0 finished: countByValue at C:/Learning/lessons/spark_code_resource/Ratings-counter.py:17, took 1.308090 s
 score 1. movie num: 6110
 score 2. movie num: 11370
 score 3. movie num: 27145
 score 4. movie num: 34174
 score 5. movie num: 21201

 

关闭服务释放内存

20/05/04 18:02:25 INFO SparkContext: Invoking stop() from shutdown hook
20/05/04 18:02:25 INFO SparkUI: Stopped Spark web UI at http://localhost:4040
20/05/04 18:02:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/05/04 18:02:25 INFO MemoryStore: MemoryStore cleared
20/05/04 18:02:25 INFO BlockManager: BlockManager stopped
20/05/04 18:02:25 INFO BlockManagerMaster: BlockManagerMaster stopped
20/05/04 18:02:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/05/04 18:02:25 INFO SparkContext: Successfully stopped SparkContext
20/05/04 18:02:25 INFO ShutdownHookManager: Shutdown hook called
20/05/04 18:02:25 INFO ShutdownHookManager: Deleting directory C:UsersxxxAppDataLocalTempspark-1345353d-b57d-4bd3-905b-79d0b4bf8f8epyspark-ad7b222c-410a-40f3-a2f6-f1153b55041f
20/05/04 18:02:25 INFO ShutdownHookManager: Deleting directory C:UsersxxxAppDataLocalTempspark-1345353d-b57d-4bd3-905b-79d0b4bf8f8e
20/05/04 18:02:25 INFO ShutdownHookManager: Deleting directory C:UsersxxxppDataLocalTempspark-999989b5-d5d2-4929-82a6-474b464e8b4e

 

以上是关于从Spark 的 log 文件看 job 执行过程的主要内容,如果未能解决你的问题,请参考以下文章

Spark 中 JobStageTask 的划分+源码执行过程分析

深入理解Spark原理,从性能优化入手

总结Spark优化-多Job并发执行

从物理执行的角度透视Spark Job(DT大数据梦工厂)

Spark作业执行

Spark Streaming 数据接收过程