Spark的cache和persist的区别
Posted 周大任
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Spark的cache和persist的区别相关的知识,希望对你有一定的参考价值。
spark的数据持久化
昨天面试的时候被问到了spark cache和persist的区别, 今天学习了一下并做一些记录
首先要了解的是RDD是lazy的,具体贴一段stack over flow的解答,很详细的介绍了怎么理解RDD, 加cache与不加有什么区别,这个区别具体作用在哪里。
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called _actions_.) An example is RDD.count
— to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count
, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count
again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache
do? If you add textFile.cache
to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache
is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count
the first time, the file will be loaded, cached, and counted. If you call textFile.count
a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count
will fall back to the usual behavior and re-read the file.
cache()是persist()的无参数化调用,对于RDD来说,用cache就选择了默认的storage level : MEMORY_ONLY
spark官方关于STORAGE_LEVEL的解释
http://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose
You can mark an RDD to be persisted using the persist()
or cache()
methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
以上是关于Spark的cache和persist的区别的主要内容,如果未能解决你的问题,请参考以下文章