pyspark 数据框中的自定义排序

Posted 2023-04-15

技术标签:

【中文标题】pyspark 数据框中的自定义排序【英文标题】：Custom sorting in pyspark dataframes 【发布时间】：2020-03-05 00:25:42 【问题描述】：

是否有任何推荐的方法来为 pyspark 中的分类数据实现自定义排序顺序？理想情况下，我正在寻找 pandas 分类数据类型提供的功能。

因此，给定具有Speed 列的数据集，可能的选项是["Super Fast", "Fast", "Medium", "Slow"]。我想实现适合上下文的自定义排序。

如果我使用默认排序，类别将按字母顺序排序。 Pandas 允许将列数据类型更改为分类，并且部分定义给出了自定义排序顺序：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

【问题讨论】：

你不会得到像你在 pandas 中那样的通用解决方案。对于 pyspark，您可以按数字或字母排序，因此使用您的速度列，我们可以创建一个新列，其中超快为 1，快为 2，中为 3，慢为 4，然后对其进行排序。如果您可以提供示例带有速度列的数据，我很乐意为您提供代码 【参考方案1】：

您可以使用orderBy 并使用when 定义您的自定义排序：

from pyspark.sql.functions col, when

df.orderBy(when(col("Speed") == "Super Fast", 1)
           .when(col("Speed") == "Fast", 2)
           .when(col("Speed") == "Medium", 3)
           .when(col("Speed") == "Slow", 4)
           )

【讨论】：

以上是关于pyspark 数据框中的自定义排序的主要内容，如果未能解决你的问题，请参考以下文章