如何使用 Scala 在 Spark 中进行滑动窗口排名?

Posted

技术标签:

【中文标题】如何使用 Scala 在 Spark 中进行滑动窗口排名?【英文标题】:How to do Sliding Window Rank in Spark using Scala? 【发布时间】:2019-05-14 02:51:12 【问题描述】:

我有一个数据集:

+-----+-------------------+---------------------+------------------+
|query|similar_queries    |model_score          |count             |
+-----+-------------------+---------------------+------------------+
|shirt|funny shirt        |0.0034038130658784866|189.0             |
|shirt|shirt womens       |0.0019435265241921438|136.0             |
|shirt|watch              |0.001097496453284101 |212.0             |
|shirt|necklace           |6.694577024597908E-4 |151.0             |
|shirt|white shirt        |0.0037413097560623485|217.0             |
|shirt|shoes              |0.0022062579255572733|575.0             |
|shirt|crop top           |9.065831060804897E-4 |173.0             |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |
|shirt|shorts             |0.002669621942466027 |200.0             |
|shirt|black shirt        |0.03264296242546658  |114.0             |
+-----+-------------------+---------------------+------------------+

我首先根据“计数”对数据集进行排名。

lazy val countWindowByFreq = Window.partitionBy(col(QUERY)).orderBy(col(COUNT).desc)
val ranked_data = data.withColumn("count_rank", row_number over countWindowByFreq)

+-----+-------------------+---------------------+------------------+----------+
|query|similar_queries    |model_score          |count             |count_rank|
+-----+-------------------+---------------------+------------------+----------+
|shirt|shoes              |0.0022062579255572733|575.0             |1         |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |2         |
|shirt|white shirt        |0.0037413097560623485|217.0             |3         |
|shirt|watch              |0.001097496453284101 |212.0             |4         |
|shirt|shorts             |0.002669621942466027 |200.0             |5         |
|shirt|funny shirt        |0.0034038130658784866|189.0             |6         |
|shirt|crop top           |9.065831060804897E-4 |173.0             |7         |
|shirt|necklace           |6.694577024597908E-4 |151.0             |8         |
|shirt|shirt womens       |0.0019435265241921438|136.0             |9         |
|shirt|black shirt        |0.03264296242546658  |114.0             |10        |
+-----+-------------------+---------------------+------------------+----------+

我现在尝试使用 row_number(4 行)上的滚动窗口对内容进行排名,并根据 model_score 在窗口内排名。例如:

在第一个窗口,row_number 1 到 4,新的排名(新列)将是

1. polo shirts for men
2. white shirt
3. shoes
4. watch

在第一个窗口中,row_number 5 到 8,新的排名(新列)将是

5. funny shirt
6. shorts
7. shirt womens 
8. crop top

在第一个窗口,row_number 9 休息,新的排名(新列)将是

9. black shirt 
10. shirt womens

有人可以告诉我如何使用这个 spark 和 Scala 实现吗?有没有我可以使用的预定义函数?

我试过了:

lazy val MODEL_RANK = Window.partitionBy(col(QUERY)) .orderBy(col(MODEL_SCORE).desc).rowsBetween(0, 3)

但这给了我:

sql.AnalysisException: Window Frame ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;

另外,尝试使用 .rowsBetween(-3, 0) 但这也给了我错误:

org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN 3 PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;

【问题讨论】:

预期的输出数据帧是什么? @ollik1 预期的 o/p 是 1. 男士马球衫 2. 白衬衫 3. 鞋子 4. 手表 5. 有趣的衬衫 6. 短裤 7. 女式衬衫 8. 露脐上衣 9. 黑色衬衫 10. 女式衬衫 【参考方案1】:

既然您已经计算了count_rank,下一步就是找到一种方法将行分组为一组四人组。可以这样做:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val ranked_data_grouped = ranked_data
  .withColumn("bucket", (($"count_rank" -1)/4).cast(IntegerType))

ranked_data_grouped 将如下所示:

+-----+-------------------+---------------------+------------------+----------+-------+
|query|similar_queries    |model_score          |count             |count_rank|bucket |
+-----+-------------------+---------------------+------------------+----------+-------+
|shirt|shoes              |0.0022062579255572733|575.0             |1         |0      |
|shirt|polo shirts for men|0.007706416273211698 |349.0             |2         |0      |      
|shirt|white shirt        |0.0037413097560623485|217.0             |3         |0      |
|shirt|watch              |0.001097496453284101 |212.0             |4         |0      |
|shirt|shorts             |0.002669621942466027 |200.0             |5         |1      |
|shirt|funny shirt        |0.0034038130658784866|189.0             |6         |1      |
|shirt|crop top           |9.065831060804897E-4 |173.0             |7         |1      |
|shirt|necklace           |6.694577024597908E-4 |151.0             |8         |1      |
|shirt|shirt womens       |0.0019435265241921438|136.0             |9         |2      |
|shirt|black shirt        |0.03264296242546658  |114.0             |10        |2      |
+-----+-------------------+---------------------+------------------+----------+-------+

现在,您所要做的就是按bucket 分区并按model_score 排序:

val output = ranked_data_grouped
  .withColumn("finalRank", row_number().over(Window.partitionBy($"bucket").orderBy($"model_score".desc)))

【讨论】:

但这并没有给出从 1 到 n 的 finalRank .. 它又给了我 1..4 1..4 等等.. 有没有办法获得最终排名 1.. n即.. 1..4(第 0 组)紧随其后的是 5..8(第 1 组第 1 到第 4 名).. 我知道了- val output = rank_data_grouped .withColumn("finalRanksTemp", row_number().over(Window.partitionBy($"bucket").orderBy(col("model_score").desc) )) .withColumn("finalRanks", row_number().over(Window.partitionBy($"query").orderBy(col("bucket"), col("finalRanksTemp"))))

以上是关于如何使用 Scala 在 Spark 中进行滑动窗口排名?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用scala对spark中rdd的每一行进行排序?

02 使用spark进行词频统计【scala交互】

如何使用 Spark 数据框列上的函数或方法使用 Scala 进行转换

如何使用 spark/scala 检查是不是存在大查询表

02 使用spark进行词频统计scala交互

如何在 if-else 条件下的列中使用 Spark 值 - Scala