在 Pyspark 中列出保存顺序的数据框列

Posted 2023-04-15

技术标签:

【中文标题】在 Pyspark 中列出保存顺序的数据框列【英文标题】：Dataframe Column to list conserving order in Pyspark 【发布时间】：2020-05-28 08:09:06 【问题描述】：

我有一个包含 2 列“id”和“timetamp”的 Spark 数据框。如何将“id”列转换为按时间戳保存原始顺序的列表？当我尝试收集时，订单未保存。

谢谢

【问题讨论】：

您是否尝试过创建 Pandas 数据框，按时间戳对其进行排序并列出 id 列？ 【参考方案1】：

你不能使用collect_list，因为它是对一组元素的非确定性收集，请参阅 doc -

/**
   * Aggregate function: returns a list of objects with duplicates.
   *
   * @note The function is non-deterministic because the order of collected results depends
   * on order of rows which may be non-deterministic after a shuffle.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_list(e: Column): Column = withAggregateFunction  CollectList(e.expr)

在分布式计算中，按特定顺序收集元素是不可能的，因为数据是跨节点分布的。为此，您需要将数据收集到执行程序上的单个分区，然后执行聚合。 This may cause Resource crunch on the executor。如果您知道您的数据数量较少，您可以使用UDAF 将数据合并到1 来执行此操作。

如果您有未倾斜的重新分区列，那么您可以以高效可靠的方式执行此操作

这是一个很好的example，可以根据 cloudera 的时间戳对值进行排序

【讨论】：

以上是关于在 Pyspark 中列出保存顺序的数据框列的主要内容，如果未能解决你的问题，请参考以下文章