PySpark 窗口函数标记满足特定条件的每个分区的第一行
Posted
技术标签:
【中文标题】PySpark 窗口函数标记满足特定条件的每个分区的第一行【英文标题】:PySpark window function mark first row of each partition that meet specific condition 【发布时间】:2021-07-03 10:47:48 【问题描述】:给定这个数据框
+--------+------+----------+--------+
|app_id |order |entry_flag|operator|
+--------+------+----------+--------+
|AP-1 |1 |1 |S |
|AP-1 |2 |0 |A |
|AP-2 |3 |0 |S |
|AP-2 |4 |0 |A |
|AP-2 |5 |1 |S |
|AP-2 |6 |0 |S |
|AP-2 |7 |0 |A |
|AP-2 |8 |0 |A |
|AP-2 |9 |1 |A |
|AP-2 |10 |0 |S |
+--------+------+----------+--------+
我要新增一列flag_x
,是boolean类型,逻辑是:
按
app_id
分区/分组,按order
排序,当遇到entry_flag = 1
所在的行时,继续前进,找到第一行后有entry_flag = 0 and operator = A
的行,标记@ 987654327@,否则flag_x = 0
对于上面的示例,输出应该是:
+--------+------+----------+--------+------+
|app_id |order |entry_flag|operator|flag_x|
+--------+------+----------+--------+------+
|AP-1 |1 |1 |S |0 |
|AP-1 |2 |0 |A |1 |
|AP-2 |3 |0 |S |0 |
|AP-2 |4 |0 |A |0 |
|AP-2 |5 |1 |S |0 |
|AP-2 |6 |0 |S |0 |
|AP-2 |7 |0 |A |1 |
|AP-2 |8 |0 |A |0 |
|AP-2 |9 |1 |A |0 |
|AP-2 |10 |0 |S |0 |
+--------+------+----------+--------+------+
我们如何使用 PySpark 数据帧操作来实现这一点?
【问题讨论】:
【参考方案1】:你的问题不太难解决,我把 cmets 留在了代码中:
from pyspark.sql import Row, Window
import pyspark.sql.functions as f
df = spark.createDataFrame([
Row(app_id='AP-1', order=1, entry_flag=1, operator='S'),
Row(app_id='AP-1', order=2, entry_flag=0, operator='A'),
Row(app_id='AP-2', order=3, entry_flag=0, operator='S'),
Row(app_id='AP-2', order=4, entry_flag=0, operator='A'),
Row(app_id='AP-2', order=5, entry_flag=1, operator='S'),
Row(app_id='AP-2', order=6, entry_flag=0, operator='S'),
Row(app_id='AP-2', order=7, entry_flag=0, operator='A'),
Row(app_id='AP-2', order=8, entry_flag=0, operator='A'),
Row(app_id='AP-2', order=9, entry_flag=1, operator='A'),
Row(app_id='AP-2', order=10, entry_flag=0, operator='S')
])
# Creating a column to group each entry where the value is 1
w_entry = Window.partitionBy('app_id').orderBy('order')
df = df.withColumn('group', f.sum('entry_flag').over(w_entry))
# Applying your boolean rule
df = df.withColumn('match', f.when(f.col('group') > f.lit(0),
(f.col('entry_flag') == f.lit(0)) & (f.col('operator')== f.lit('A')))
.otherwise(f.lit(False)))
# +------+-----+----------+--------+-----+-----+
# |app_id|order|entry_flag|operator|group|match|
# +------+-----+----------+--------+-----+-----+
# |AP-1 |1 |1 |S |1 |false|
# |AP-1 |2 |0 |A |1 |true |
# |AP-2 |3 |0 |S |0 |false|
# |AP-2 |4 |0 |A |0 |false|
# |AP-2 |5 |1 |S |1 |false|
# |AP-2 |6 |0 |S |1 |false|
# |AP-2 |7 |0 |A |1 |true |
# |AP-2 |8 |0 |A |1 |true |
# |AP-2 |9 |1 |A |2 |false|
# |AP-2 |10 |0 |S |2 |false|
# +------+-----+----------+--------+-----+-----+
# If a group has two or more matches like the example below
# |AP-2 |7 |0 |A |1 |true |
# |AP-2 |8 |0 |A |1 |true |
# identify which is the first occurrence and set `flag_x` with 1 to it.
w_flag = Window.partitionBy('app_id', 'group', 'match')
df = df.withColumn('flag_x', (f.col('match') & (f.col('order') == f.min('order').over(w_flag))).cast('int'))
# Drop temporary columns
df = df.drop('group', 'match')
df.show(truncate=False)
# +------+-----+----------+--------+------+
# |app_id|order|entry_flag|operator|flag_x|
# +------+-----+----------+--------+------+
# |AP-1 |1 |1 |S |0 |
# |AP-1 |2 |0 |A |1 |
# |AP-2 |3 |0 |S |0 |
# |AP-2 |4 |0 |A |0 |
# |AP-2 |5 |1 |S |0 |
# |AP-2 |6 |0 |S |0 |
# |AP-2 |7 |0 |A |1 |
# |AP-2 |8 |0 |A |0 |
# |AP-2 |9 |1 |A |0 |
# |AP-2 |10 |0 |S |0 |
# +------+-----+----------+--------+------+
【讨论】:
感谢 Kafels,这是一种非常简洁的方法,有详细的解释。以上是关于PySpark 窗口函数标记满足特定条件的每个分区的第一行的主要内容,如果未能解决你的问题,请参考以下文章