Pyspark 计数非空值之间的空值

Posted

技术标签:

【中文标题】Pyspark 计数非空值之间的空值【英文标题】:Pyspark Count Null Values Between Non-Null Values 【发布时间】:2020-12-28 13:00:46 【问题描述】:

我的输入数据框是;

Date        Client  Until_non_null_value
2020-10-26  1       NULL
2020-10-27  1       NULL
2020-10-28  1       3 
2020-10-29  1       6
2020-10-30  1       NULL
2020-10-31  1       NULL
2020-11-01  1       NULL
2020-11-02  1       NULL
2020-11-03  1       NULL
2020-11-04  1       NULL
2020-11-05  1       NULL
2020-11-06  1       NULL
2020-11-07  1       NULL
2020-11-08  1       NULL
2020-11-09  1       35
2020-10-26  2       NULL
2020-10-27  2       NULL
2020-10-28  2       NULL
2020-10-29  2       28
2020-10-30  2       NULL
2020-10-31  2       NULL
2020-11-01  2       NULL
2020-11-02  2       NULL
2020-11-03  2       NULL
2020-11-04  2       NULL
2020-11-05  2       1
2020-11-06  2       NULL
2020-11-07  2       NULL
2020-11-08  2       NULL
2020-11-09  2       NULL

我想计算每个客户端的两个非空值之间的空计数作为 pyspark 中的新列。我试图 rangeBetween 等,但我无法处理它。我在下面分享了请求的输出示例;

Date        Client  Score  Until_non_null_value
2020-10-26  1       NULL   2     -> First null score value. 2 days away from first non null value (3). 
2020-10-27  1       NULL   NULL  -> Not the first null value for score column. So it is null for result column .
2020-10-28  1       3      NULL 
2020-10-29  1       6     NULL
2020-10-30  1       NULL   10    -> First null value after non null value (6). 10 days away from first non null value (25). 
2020-10-31  1       NULL   NULL
2020-11-01  1       NULL   NULL
2020-11-02  1       NULL   NULL
2020-11-03  1       NULL   NULL
2020-11-04  1       NULL   NULL
2020-11-05  1       NULL   NULL
2020-11-06  1       NULL   NULL
2020-11-07  1       NULL   NULL
2020-11-08  1       NULL   NULL
2020-11-09  1       25     NULL
2020-10-26  2       NULL   3
2020-10-27  2       NULL   NULL
2020-10-28  2       NULL   NULL
2020-10-29  2       28     NULL
2020-10-30  2       NULL   6
2020-10-31  2       NULL   NULL
2020-11-01  2       NULL   NULL
2020-11-02  2       NULL   NULL
2020-11-03  2       NULL   NULL
2020-11-04  2       NULL   NULL
2020-11-05  2       1      NULL
2020-11-06  2       NULL   NULL
2020-11-07  2       NULL   NULL
2020-11-08  2       NULL   NULL
2020-11-09  2       NULL   NULL

你能帮我解决这个问题吗?

【问题讨论】:

【参考方案1】:

大量的窗口函数...

from pyspark.sql import functions as F, Window

w = Window.partitionBy('Client').orderBy('Date')

result = df.withColumn(
    'rn',
    F.row_number().over(w)
).withColumn(    # get the difference in row numbers
    'Until_non_null_value',
    F.first(
        F.when(
            F.col('Score').isNotNull(),
            F.col('rn')
       ),
       ignorenulls=True
    ).over(w.rowsBetween(1, Window.unboundedFollowing)) - F.col('rn')
).withColumn(    # only keep the relevant rows and hide others with null
    'Until_non_null_value',
    F.when(
        F.lag('Score').over(w).isNotNull() | (F.col('rn') == 1),
        F.col('Until_non_null_value')
    )
).withColumn(    # hide more rows with null
    'Until_non_null_value', 
    F.when(
        F.lead('Until_non_null_value').over(w).isNull(), 
        F.col('Until_non_null_value')
    )
)
result.show(99,0)
+----------+------+-----+---+--------------------+
|Date      |Client|Score|rn |Until_non_null_value|
+----------+------+-----+---+--------------------+
|2020-10-26|1     |null |1  |2                   |
|2020-10-27|1     |null |2  |null                |
|2020-10-28|1     |3    |3  |null                |
|2020-10-29|1     |6    |4  |null                |
|2020-10-30|1     |null |5  |10                  |
|2020-10-31|1     |null |6  |null                |
|2020-11-01|1     |null |7  |null                |
|2020-11-02|1     |null |8  |null                |
|2020-11-03|1     |null |9  |null                |
|2020-11-04|1     |null |10 |null                |
|2020-11-05|1     |null |11 |null                |
|2020-11-06|1     |null |12 |null                |
|2020-11-07|1     |null |13 |null                |
|2020-11-08|1     |null |14 |null                |
|2020-11-09|1     |35   |15 |null                |
|2020-10-26|2     |null |1  |3                   |
|2020-10-27|2     |null |2  |null                |
|2020-10-28|2     |null |3  |null                |
|2020-10-29|2     |28   |4  |null                |
|2020-10-30|2     |null |5  |6                   |
|2020-10-31|2     |null |6  |null                |
|2020-11-01|2     |null |7  |null                |
|2020-11-02|2     |null |8  |null                |
|2020-11-03|2     |null |9  |null                |
|2020-11-04|2     |null |10 |null                |
|2020-11-05|2     |1    |11 |null                |
|2020-11-06|2     |null |12 |null                |
|2020-11-07|2     |null |13 |null                |
|2020-11-08|2     |null |14 |null                |
|2020-11-09|2     |null |15 |null                |
+----------+------+-----+---+--------------------+

【讨论】:

以上是关于Pyspark 计数非空值之间的空值的主要内容,如果未能解决你的问题,请参考以下文章

从 Pyspark 中的数据框中计算空值和非空值

从pyspark的多列中选择非空值

在 PySpark 中为每一行查找最新的非空值

Pyspark - 计算每个数据框列中的空值数量

Pyspark:如何处理 python 用户定义函数中的空值

pyspark 数据框立方体方法返回重复的空值