spark persist() （然后动作）真的持续存在吗？

Posted 2023-04-15

技术标签:

【中文标题】spark persist() （然后动作）真的持续存在吗？【英文标题】：Is spark persist() (then action) really persisting? 【发布时间】：2019-04-23 21:17:14 【问题描述】：

我一直理解persist()和cache()，然后激活DAG的动作，会计算并保存在内存中以备后用。这里有很多线程会告诉你缓存以提高常用数据帧的性能。

最近我做了一个测试，很困惑，因为似乎不是这样。

    temp_tab_name = "mytablename";
    x = spark.sql("select * from " +temp_tab_name +" limit 10");
    x = x.persist()
    x.count() #action to activate all the above steps
    x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
    x.is_cached #True
    spark.sql("drop table "+ temp_tab_name);
    x.is_cached #Still true!!
    x.show() # Error, table not found here

所以在我看来，x 永远不会被计算和持久化。对 x 的下一个引用仍然可以追溯到评估其 DAG 定义 "select..." 。我在这里错过了什么？

【问题讨论】：

我的回答能回答你要找的吗？ 【参考方案1】：

cache 和 persist 不会完全从源中分离计算结果。它只是尽最大努力避免重新计算。因此，一般来说，在完成数据集之前删除源是一个坏主意。

在您的特定情况下可能会出现什么问题（从我的脑海中）：

~~1) show 不需要表的所有记录，因此它可能仅触发少数分区的计算。所以此时大部分分区还没有计算出来。~~

2) spark 需要表中的一些辅助信息（例如用于分区）

【讨论】：

1. count 必须触及所有分区；该操作专门强制评估 2。您确定要暗示分区发生变化吗？ 1.你说的对。 2. 我的主要观点是，即使你已经缓存了 RDD，破坏源也不安全。究竟是什么触发了这种行为 - 主要是猜测。【参考方案2】：

正确的语法如下...这里是“未缓存”表的一些附加文档 => https://spark.apache.org/docs/latest/sql-performance-tuning.html ...您可以在 Spark UI 中的“存储”选项卡下确认以下示例以查看对象“缓存”和“未缓存”

# df method
df = spark.range(10)
df.cache() # cache
# df.persist() # acts same as cache
df.count() # action to materialize df object in ram
# df.foreach(lambda x: x) # another action to materialize df object in ram
df.unpersist() # remove df object from ram

# temp table method
df.createOrReplaceTempView("df_sql")
spark.catalog.cacheTable("df_sql") # cache
spark.sql("select * from df_sql").count() # action to materialize temp table in ram
spark.catalog.uncacheTable("df_sql") # remove temp table from ram

【讨论】：

抱歉 - 忘记包含 df.cache() ...这些是数据帧缓存/取消缓存的正确方法 ...如果这不是您的问题，请更新您的问题以更具体正在寻找@Kenny 抱歉您没有回答问题。我不是在问如何缓存，取消缓存。我在问为什么缓存/持久化没有按预期工作。我运行了您的示例，但从未收到“未找到表”错误...尽管误解了您的问题，但我不认为提供正确的信息需要投反对票，但感谢您澄清您的身份寻找。

以上是关于spark persist() （然后动作）真的持续存在吗？的主要内容，如果未能解决你的问题，请参考以下文章