原创大叔经验分享（39）spark cache unpersist级联操作

Posted 2021-02-13 barneywill

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了原创大叔经验分享（39）spark cache unpersist级联操作相关的知识，希望对你有一定的参考价值。

问题：spark中如果有两个DataFrame（或者DataSet），DataFrameA依赖DataFrameB，并且两个DataFrame都进行了cache，将DataFrameB unpersist之后，DataFrameA的cache也会失效，官方解释如下：

When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.

之前默认的模式为regular mode，这种模式下为了保证被cache数据是最新的（没有过期），会对cache的unpersist进行级联操作，即清空所有依赖（包括间接依赖）该cache的其他cache；
从spark2.4开始引入了一个新的模式：non-cascading mode，这个模式下不会对cache的unpersist进行级联操作；

DataFrame/DataSet的cache操作默认用的level是MEMORY_AND_DISK，除非手工指定MEMORY，并且确认内存足够，否则unpersist之前的cache看起来没有必要；

参考：
https://issues.apache.org/jira/browse/SPARK-21478
https://issues.apache.org/jira/browse/SPARK-24596
https://issues.apache.org/jira/browse/SPARK-21579

以上是关于原创大叔经验分享（39）spark cache unpersist级联操作的主要内容，如果未能解决你的问题，请参考以下文章