Pyspark 计数非空值之间的空值
Posted
技术标签:
【中文标题】Pyspark 计数非空值之间的空值【英文标题】:Pyspark Count Null Values Between Non-Null Values 【发布时间】:2020-12-28 13:00:46 【问题描述】:我的输入数据框是;
Date Client Until_non_null_value
2020-10-26 1 NULL
2020-10-27 1 NULL
2020-10-28 1 3
2020-10-29 1 6
2020-10-30 1 NULL
2020-10-31 1 NULL
2020-11-01 1 NULL
2020-11-02 1 NULL
2020-11-03 1 NULL
2020-11-04 1 NULL
2020-11-05 1 NULL
2020-11-06 1 NULL
2020-11-07 1 NULL
2020-11-08 1 NULL
2020-11-09 1 35
2020-10-26 2 NULL
2020-10-27 2 NULL
2020-10-28 2 NULL
2020-10-29 2 28
2020-10-30 2 NULL
2020-10-31 2 NULL
2020-11-01 2 NULL
2020-11-02 2 NULL
2020-11-03 2 NULL
2020-11-04 2 NULL
2020-11-05 2 1
2020-11-06 2 NULL
2020-11-07 2 NULL
2020-11-08 2 NULL
2020-11-09 2 NULL
我想计算每个客户端的两个非空值之间的空计数作为 pyspark 中的新列。我试图 rangeBetween 等,但我无法处理它。我在下面分享了请求的输出示例;
Date Client Score Until_non_null_value
2020-10-26 1 NULL 2 -> First null score value. 2 days away from first non null value (3).
2020-10-27 1 NULL NULL -> Not the first null value for score column. So it is null for result column .
2020-10-28 1 3 NULL
2020-10-29 1 6 NULL
2020-10-30 1 NULL 10 -> First null value after non null value (6). 10 days away from first non null value (25).
2020-10-31 1 NULL NULL
2020-11-01 1 NULL NULL
2020-11-02 1 NULL NULL
2020-11-03 1 NULL NULL
2020-11-04 1 NULL NULL
2020-11-05 1 NULL NULL
2020-11-06 1 NULL NULL
2020-11-07 1 NULL NULL
2020-11-08 1 NULL NULL
2020-11-09 1 25 NULL
2020-10-26 2 NULL 3
2020-10-27 2 NULL NULL
2020-10-28 2 NULL NULL
2020-10-29 2 28 NULL
2020-10-30 2 NULL 6
2020-10-31 2 NULL NULL
2020-11-01 2 NULL NULL
2020-11-02 2 NULL NULL
2020-11-03 2 NULL NULL
2020-11-04 2 NULL NULL
2020-11-05 2 1 NULL
2020-11-06 2 NULL NULL
2020-11-07 2 NULL NULL
2020-11-08 2 NULL NULL
2020-11-09 2 NULL NULL
你能帮我解决这个问题吗?
【问题讨论】:
【参考方案1】:大量的窗口函数...
from pyspark.sql import functions as F, Window
w = Window.partitionBy('Client').orderBy('Date')
result = df.withColumn(
'rn',
F.row_number().over(w)
).withColumn( # get the difference in row numbers
'Until_non_null_value',
F.first(
F.when(
F.col('Score').isNotNull(),
F.col('rn')
),
ignorenulls=True
).over(w.rowsBetween(1, Window.unboundedFollowing)) - F.col('rn')
).withColumn( # only keep the relevant rows and hide others with null
'Until_non_null_value',
F.when(
F.lag('Score').over(w).isNotNull() | (F.col('rn') == 1),
F.col('Until_non_null_value')
)
).withColumn( # hide more rows with null
'Until_non_null_value',
F.when(
F.lead('Until_non_null_value').over(w).isNull(),
F.col('Until_non_null_value')
)
)
result.show(99,0)
+----------+------+-----+---+--------------------+
|Date |Client|Score|rn |Until_non_null_value|
+----------+------+-----+---+--------------------+
|2020-10-26|1 |null |1 |2 |
|2020-10-27|1 |null |2 |null |
|2020-10-28|1 |3 |3 |null |
|2020-10-29|1 |6 |4 |null |
|2020-10-30|1 |null |5 |10 |
|2020-10-31|1 |null |6 |null |
|2020-11-01|1 |null |7 |null |
|2020-11-02|1 |null |8 |null |
|2020-11-03|1 |null |9 |null |
|2020-11-04|1 |null |10 |null |
|2020-11-05|1 |null |11 |null |
|2020-11-06|1 |null |12 |null |
|2020-11-07|1 |null |13 |null |
|2020-11-08|1 |null |14 |null |
|2020-11-09|1 |35 |15 |null |
|2020-10-26|2 |null |1 |3 |
|2020-10-27|2 |null |2 |null |
|2020-10-28|2 |null |3 |null |
|2020-10-29|2 |28 |4 |null |
|2020-10-30|2 |null |5 |6 |
|2020-10-31|2 |null |6 |null |
|2020-11-01|2 |null |7 |null |
|2020-11-02|2 |null |8 |null |
|2020-11-03|2 |null |9 |null |
|2020-11-04|2 |null |10 |null |
|2020-11-05|2 |1 |11 |null |
|2020-11-06|2 |null |12 |null |
|2020-11-07|2 |null |13 |null |
|2020-11-08|2 |null |14 |null |
|2020-11-09|2 |null |15 |null |
+----------+------+-----+---+--------------------+
【讨论】:
以上是关于Pyspark 计数非空值之间的空值的主要内容,如果未能解决你的问题,请参考以下文章