从 spark 数据框或 sql 中选择具有偏好层次结构的多个记录

Posted 2023-04-17

技术标签:

【中文标题】从 spark 数据框或 sql 中选择具有偏好层次结构的多个记录【英文标题】：Select a record from multiple with preference hierarchy from spark dataframe or sql 【发布时间】：2020-03-12 15:05:17 【问题描述】：

我有一个产品数据框，其中包含具有不同类别的相同产品。我只想根据层次结构选择一条记录，例如

Product ID.  Category.  Status
1.           Cat1.      status1
1.           Cat2.      status1
1.           Cat3.      status1
2.           Cat1.      status1
2.           Cat2.      status1
3.           Cat2.      status1

如果存在 Cat1 的记录，则选择它，否则选择 Cat2。如果 Cat2 不存在，请选择 Cat3。但只能从多个中选择一个。

【问题讨论】：

【参考方案1】：

使用row_number()：

select t.*
from (select t.*, row_number () over (partition by productid order by category) as seq
      from table t
     ) t
where seq = 1;

如果类别名称不同，则使用case 表达式

order by (case when category = 'category_x' then 1 
               when category = 'category_gg' then 2 
               else 3 
         end)

【讨论】：

感谢 Yogesh 的及时回复。这帮助很大。【参考方案2】：

这是使用数据框函数与@Yogesh Sharma 相同的答案。

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("Product ID").orderBy("Category")
df.withColumn("row", row_number.over(w))
  .filter($"row" === 1)
  .orderBy("Product ID")
  .drop("row")
  .show

或者使用groupBy和自加入比如

df.join(df.groupBy("Product ID").agg(first("Category").as("Category")), Seq("Product ID", "Category")).show

那些会给你结果的地方：

+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

【讨论】：

这正是我所需要的。非常感谢【参考方案3】：

考虑到您的类别是cat1., cat2., ... cat10., ...cat100.,....

您必须从类别中获取数字，然后对它们进行相应的排序。

SELECT * FROM
    (
        SELECT
            T.*,
            ROW_NUMBER() OVER(
                PARTITION BY PRODUCTID
                ORDER BY TO_NUMBER(REGEXP_SUBSTR(CATEGORY, '[0-9]+'))
            ) AS RN
        FROM YOUR_TABLE T
    )
WHERE RN = 1;

干杯！！

【讨论】：

【参考方案4】：

我在 Yogesh 和 Lamansa 的回答的帮助下开发了以下解决方案

 val df1 = df.withColum("row_num", when($"category"==="Cat1", "A"),
    .when($"category" ==== "Cat2", "B"),
    .when($"category" === "Cat3", "C"))

    df1.join(df1.groupBy("product_id).agg(first("category").as("category")), 
    Seq("product_id","category")).show

当被用作按类别排序时，不能确保您的偏好符合所需的顺序。例如。 Cat2 可以是首选。

Output :
+----------+--------+-------+
|Product ID|Category| Status|
+----------+--------+-------+
|         1|    Cat1|status1|
|         2|    Cat1|status1|
|         3|    Cat2|status1|
+----------+--------+-------+

输出：

【讨论】：

以上是关于从 spark 数据框或 sql 中选择具有偏好层次结构的多个记录的主要内容，如果未能解决你的问题，请参考以下文章