当我迭代地重复使用旧的缓存数据时，Spark Dataframe突然变得非常慢

Question

当我尝试将缓存结果保存在List中并尝试通过每次迭代中最后一个列表中的所有数据计算新DataFrame时，问题就发生了。但是，即使我使用空的DataFrame并且每次都得到一个空的结果，该函数在大约8~12轮之后会突然变得很慢。

这是我的代码

testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){      
  // do some dummy transformation like union and cache the result
  val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache        

  // always get 0, of course  
  println(resultDf.count)  

  // benchmark action
  benchmark(resultDf.count)    

  testLoop(resultDf::lastDfList)
}

基准测试结果1~6 round : < 200ms 7 round : 367ms 8 round : 918ms 9 round : 2476ms 10 round : 7833ms 11 round : 24231ms

我不认为GC或Block驱逐是我的问题，因为我已经使用了一个空的DataFrame，但我不知道是什么原因？我是否误解了缓存或其他什么的含义？

谢谢！

在阅读ImDarrenG的解决方案后，我将我的代码更改为以下内容：

spark.sparkContext.setCheckpointDir("/tmp")

testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){      
  // do some dummy transformation like union and cache the result
  val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache        

  resultDf.checkpoint()  

  // always get 0, of course  
  println(resultDf.count)  

  // benchmark action
  benchmark(resultDf.count)    

  testLoop(resultDf::lastDfList)
}

但经过几次迭代后它仍然变得很慢。

Answer 1

另一答案