使用 toPandas() 方法创建的数据框是不是分布在 spark 集群中？

Posted 2023-04-15

技术标签:

【中文标题】使用 toPandas() 方法创建的数据框是不是分布在 spark 集群中？【英文标题】：Is dataframe created using toPandas() method is distributed across the spark cluster?使用 toPandas() 方法创建的数据框是否分布在 spark 集群中？ 【发布时间】：2015-08-05 16:57:26 【问题描述】：

我正在通过

读取 CSV

data=sc.textFile("filename") 

Df = Sparksql.create dataframe()

Pdf = Df.toPandas ()

现在 Pdf 是分布在 spark 集群中还是驻留在主机环境中？？

【问题讨论】：

它将驻留在本地驱动程序机器中 @hadooped Df() 是否使数据帧分布？或者我如何使数据帧分布？？阅读文档后我的理解是，任何 Spark 数据帧都将分布在整个集群中，但是当您将其转换为 pandas 数据帧时，它将存在于您的代码执行的任何机器/节点上开。 What is the Spark DataFrame method `toPandas` actually doing?的可能重复 Requirements for converting Spark dataframe to Pandas/R dataframe的可能重复 【参考方案1】：

没有。

正如 PySpark 中所说的 source code of DataFrame：

    .. note:: This method should only be used if the resulting Pandas's DataFrame is expected
        to be small, as all the data is loaded into the driver's memory.

【讨论】：

以上是关于使用 toPandas() 方法创建的数据框是不是分布在 spark 集群中？的主要内容，如果未能解决你的问题，请参考以下文章