Spark IMF传奇行动第18课:RDD持久化广播累加器总结

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark IMF传奇行动第18课:RDD持久化广播累加器总结相关的知识,希望对你有一定的参考价值。

昨晚听了王家林老师的Spark IMF传奇行动第18课:RDD持久化、广播、累加器,作业是unpersist试验,阅读累加器源码看内部工作机制:

 

scala> val rdd = sc.parallelize(1 to 1000)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> rdd.persist
res0: rdd.type = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> rdd.count
16/01/24 11:42:56 INFO DAGScheduler: Job 0 finished: count at <console>:24, took 1.451543 s
res1: Long = 1000

16/01/24 11:43:14 INFO DAGScheduler: Job 2 finished: count at <console>:24, took 0.094119 s
res3: Long = 1000

scala> rdd.unpersist()
16/01/24 11:43:43 INFO ParallelCollectionRDD: Removing RDD 0 from persistence list
16/01/24 11:43:43 INFO BlockManager: Removing RDD 0
res5: rdd.type = ParallelCollectionRDD[0] at parallelize at <console>:21

scala> rdd.count
16/01/24 11:44:56 INFO DAGScheduler: Job 0 finished: count at <console>:24, took 1.475321 s
res1: Long = 1000

persisit后,count执行快了许多,但unpersist后,执行又变慢了。

 累加器Accumulator:全局唯一,对于Executor只能修改但不可读,只对Driver可读,只增不减

 val sum = sc.accumulator(0)
 val d1 = sc.parallelize(1 to 5)
 val result1 = d1.foreach(item => sum+= item)
 println(sum)

结果是15.

 

后续课程可以参照新浪微博 王家林_DT大数据梦工厂:http://weibo.com/ilovepains

王家林  中国Spark第一人,微信公共号DT_Spark

 

转发请写明出处。

 

以上是关于Spark IMF传奇行动第18课:RDD持久化广播累加器总结的主要内容,如果未能解决你的问题,请参考以下文章

Spark IMF传奇行动第19课:spark排序总结

Spark IMF传奇行动第17课Transformations实战总结

Spark IMF传奇行动第20课:top N总结

Spark IMF传奇行动第21课:从Spark架构中透视Job

[Spark传奇行动] 第34课:Stage划分和Task最佳位置算法源码彻底解密

(版本定制)第18课:Spark Streaming中空RDD处理及流处理程序优雅的停止