从 DataFrame 中按分区收集集合

Posted 2023-03-23

技术标签:

【中文标题】从 DataFrame 中按分区收集集合【英文标题】：Collect collections by partitions from DataFrame 【发布时间】：2022-01-16 15:00:40 【问题描述】：

我有按列分区的 DataFrame：

val dfDL = spark.read.option("delimiter", ",")
                     .option("header", true)
                     .csv(file.getPath.toUri.getPath)
                     .repartition(col("column_to"))

val structure =  "schema_from" ::
                 "table_from"  ::
                 "column_from" ::
                 "link_type"   ::
                 "schema_to"   ::
                 "table_to"    ::
                 "column_to"   :: Nil

如何按分区获取数组集合？也就是说，对于每个分区，我都需要一个集合。比如我需要这个方法：

def getArrays(df: DataFrame): Iterator[Array] =  //Or Iterator[List]
   ???

分区的所有值：

val allTargetCol = df.select(col("column_to")).distinct().collect().map(_.getString(0))

【问题讨论】：

【参考方案1】：

如果你知道分区值，你可以遍历每个分区值，调用过滤器然后收集。

伪代码

partitions = []

for partition_value in partition_values_list:
  partitions.append(df.filter(f.col('partiton_column') == partition_value).collect())

否则，您需要先制作不同分区值的列表/数组，然后重复上述步骤。

【讨论】：

是的，我知道分区值。 allTargetCol. 然后像上面给出的伪代码那样简单地迭代它（更新了答案）。这是个好主意?。谢谢！但我只需要使用来自 Scala 的不可变集合）我理解你的想法。我试试看 .collect() 之后应该有转换成不可变集合的选项。

以上是关于从 DataFrame 中按分区收集集合的主要内容，如果未能解决你的问题，请参考以下文章