如果我们缓存一个DataSet，然后将同一个DataSet缓存为一张表，Spark会缓存两次数据吗

Posted 2023-04-15

技术标签:

【中文标题】如果我们缓存一个DataSet，然后将同一个DataSet缓存为一张表，Spark会缓存两次数据吗【英文标题】：Will Spark cache the data twice if we cache a DataSet and then cache the same DataSet as a table 【发布时间】：2018-04-23 06:29:06 【问题描述】：

DataSet<Row> dataSet = sqlContext.sql("some query");
dataSet.registerTempTable("temp_table");
dataset.cache(); // cache 1
sqlContext.cacheTable("temp_table"); // cache 2

所以，我的问题是只触发一次缓存数据集，或者会有两个相同数据集的副本，一个作为数据集（缓存 1），另一个作为表（缓存 2）

【问题讨论】：

【参考方案1】：

不会，或者至少在任何最新版本中都不会：

scala> val df = spark.range(1)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.cache
res0: df.type = [id: bigint]

scala> df.createOrReplaceTempView("df")

scala> spark.catalog.cacheTable("df")
2018-01-23 12:33:48 WARN  CacheManager:66 - Asked to cache already cached data.

【讨论】：

以上是关于如果我们缓存一个DataSet，然后将同一个DataSet缓存为一张表，Spark会缓存两次数据吗的主要内容，如果未能解决你的问题，请参考以下文章