每天生成日期和前向填充列[重复]
Posted
技术标签:
【中文标题】每天生成日期和前向填充列[重复]【英文标题】:Generates dates daily & forward fill columns [duplicate] 【发布时间】:2020-12-17 15:56:05 【问题描述】:我有几个稀疏日期和值的数据集:
date | value
12/01/20 | 1
12/04/20 | 2
12/08/20 | 3
&想为它们之间的每个日期创建一行,向前填充最后一个值,例如:
date | value
12/01/20 | 1
12/02/20 | 1
12/03/20 | 1
12/04/20 | 2
12/05/20 | 2
12/06/20 | 2
12/07/20 | 2
12/08/20 | 3
谢谢!
【问题讨论】:
【参考方案1】:以下代码应该与您要查找的内容接近。
from pyspark.sql import functions as F
from pyspark.sql import Window
import datetime
df_all = spark.createDataFrame([
"date": datetime.date(2020, 12, 1), "value": 1,
"date": datetime.date(2020, 12, 4), "value": 2,
"date": datetime.date(2020, 12, 8), "value": 3
])
df_all.show()
"""
+----------+-----+
| date|value|
+----------+-----+
|2020-12-01| 1|
|2020-12-04| 2|
|2020-12-08| 3|
+----------+-----+
"""
window = Window.orderBy("date")
df_with_previous_date = df_all.withColumn("previous_date", F.lag("date", 1).over(window))
df_with_previous_value = df_with_previous_date.withColumn("previous_value", F.lag("value", 1).over(window))
df_with_days_between = df_with_previous_value.withColumn(
"days_between",
F.coalesce(
F.datediff("previous_date", "date") + 1,
F.lit(0)
)
)
df_with_days_between.show()
"""
+----------+-----+-------------+--------------+------------+
| date|value|previous_date|previous_value|days_between|
+----------+-----+-------------+--------------+------------+
|2020-12-01| 1| null| null| 0|
|2020-12-04| 2| 2020-12-01| 1| -2|
|2020-12-08| 3| 2020-12-04| 2| -3|
+----------+-----+-------------+--------------+------------+
"""
df_with_sequence = df_with_days_between.withColumn("day_offset_sequence", F.sequence(F.lit(0), "days_between"))
df_with_sequence.show()
"""
+----------+-----+-------------+--------------+------------+-------------------+
| date|value|previous_date|previous_value|days_between|day_offset_sequence|
+----------+-----+-------------+--------------+------------+-------------------+
|2020-12-01| 1| null| null| 0| [0]|
|2020-12-04| 2| 2020-12-01| 1| -2| [0, -1, -2]|
|2020-12-08| 3| 2020-12-04| 2| -3| [0, -1, -2, -3]|
+----------+-----+-------------+--------------+------------+-------------------+
"""
df_exploded = df_with_sequence.withColumn("day_offset", F.explode("day_offset_sequence"))
df_range = df_exploded.withColumn("date_index", F.col("date") + F.col("day_offset"))
df_range.show()
"""
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
| date|value|previous_date|previous_value|days_between|day_offset_sequence|day_offset|date_index|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
|2020-12-01| 1| null| null| 0| [0]| 0|2020-12-01|
|2020-12-04| 2| 2020-12-01| 1| -2| [0, -1, -2]| 0|2020-12-04|
|2020-12-04| 2| 2020-12-01| 1| -2| [0, -1, -2]| -1|2020-12-03|
|2020-12-04| 2| 2020-12-01| 1| -2| [0, -1, -2]| -2|2020-12-02|
|2020-12-08| 3| 2020-12-04| 2| -3| [0, -1, -2, -3]| 0|2020-12-08|
|2020-12-08| 3| 2020-12-04| 2| -3| [0, -1, -2, -3]| -1|2020-12-07|
|2020-12-08| 3| 2020-12-04| 2| -3| [0, -1, -2, -3]| -2|2020-12-06|
|2020-12-08| 3| 2020-12-04| 2| -3| [0, -1, -2, -3]| -3|2020-12-05|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
"""
df_true_value = df_range.withColumn(
"true_value",
F.when(
F.col("day_offset") == F.lit(0),
F.col("value")
).otherwise(
F.col("previous_value")
)
)
df = df_true_value.select(
F.col("date_index").alias("date"),
F.col("true_value").alias("value")
).orderBy("date")
df.show()
"""
+----------+-----+
| date|value|
+----------+-----+
|2020-12-01| 1|
|2020-12-02| 1|
|2020-12-03| 1|
|2020-12-04| 2|
|2020-12-05| 2|
|2020-12-06| 2|
|2020-12-07| 2|
|2020-12-08| 3|
+----------+-----+
"""
【讨论】:
以上是关于每天生成日期和前向填充列[重复]的主要内容,如果未能解决你的问题,请参考以下文章