每天生成日期和前向填充列[重复]

Posted

技术标签:

【中文标题】每天生成日期和前向填充列[重复]【英文标题】:Generates dates daily & forward fill columns [duplicate] 【发布时间】:2020-12-17 15:56:05 【问题描述】:

我有几个稀疏日期和值的数据集:

  date   | value
12/01/20 |   1
12/04/20 |   2
12/08/20 |   3

&想为它们之间的每个日期创建一行,向前填充最后一个值,例如:

  date   | value
12/01/20 |   1
12/02/20 |   1
12/03/20 |   1
12/04/20 |   2
12/05/20 |   2
12/06/20 |   2
12/07/20 |   2
12/08/20 |   3

谢谢!

【问题讨论】:

【参考方案1】:

以下代码应该与您要查找的内容接近。

from pyspark.sql import functions as F
from pyspark.sql import Window
import datetime

df_all = spark.createDataFrame([
  "date": datetime.date(2020, 12, 1), "value": 1,
  "date": datetime.date(2020, 12, 4), "value": 2,
  "date": datetime.date(2020, 12, 8), "value": 3
])
df_all.show()
"""
+----------+-----+
|      date|value|
+----------+-----+
|2020-12-01|    1|
|2020-12-04|    2|
|2020-12-08|    3|
+----------+-----+
"""

window = Window.orderBy("date")

df_with_previous_date = df_all.withColumn("previous_date", F.lag("date", 1).over(window))
df_with_previous_value = df_with_previous_date.withColumn("previous_value", F.lag("value", 1).over(window))
df_with_days_between = df_with_previous_value.withColumn(
  "days_between",
  F.coalesce(
    F.datediff("previous_date", "date") + 1,
    F.lit(0)
  )
)

df_with_days_between.show()
"""
+----------+-----+-------------+--------------+------------+
|      date|value|previous_date|previous_value|days_between|
+----------+-----+-------------+--------------+------------+
|2020-12-01|    1|         null|          null|           0|
|2020-12-04|    2|   2020-12-01|             1|          -2|
|2020-12-08|    3|   2020-12-04|             2|          -3|
+----------+-----+-------------+--------------+------------+
"""


df_with_sequence = df_with_days_between.withColumn("day_offset_sequence", F.sequence(F.lit(0), "days_between"))
df_with_sequence.show()
"""
+----------+-----+-------------+--------------+------------+-------------------+
|      date|value|previous_date|previous_value|days_between|day_offset_sequence|
+----------+-----+-------------+--------------+------------+-------------------+
|2020-12-01|    1|         null|          null|           0|                [0]|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|
+----------+-----+-------------+--------------+------------+-------------------+
"""


df_exploded = df_with_sequence.withColumn("day_offset", F.explode("day_offset_sequence"))
df_range = df_exploded.withColumn("date_index", F.col("date") + F.col("day_offset"))
df_range.show()

"""
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
|      date|value|previous_date|previous_value|days_between|day_offset_sequence|day_offset|date_index|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
|2020-12-01|    1|         null|          null|           0|                [0]|         0|2020-12-01|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|         0|2020-12-04|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|        -1|2020-12-03|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|        -2|2020-12-02|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|         0|2020-12-08|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -1|2020-12-07|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -2|2020-12-06|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -3|2020-12-05|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
"""

df_true_value = df_range.withColumn(
  "true_value",
  F.when(
    F.col("day_offset") == F.lit(0),
    F.col("value")
  ).otherwise(
    F.col("previous_value")
  )
)
df = df_true_value.select(
  F.col("date_index").alias("date"),
  F.col("true_value").alias("value")
).orderBy("date")
df.show()
"""
+----------+-----+
|      date|value|
+----------+-----+
|2020-12-01|    1|
|2020-12-02|    1|
|2020-12-03|    1|
|2020-12-04|    2|
|2020-12-05|    2|
|2020-12-06|    2|
|2020-12-07|    2|
|2020-12-08|    3|
+----------+-----+
"""

【讨论】:

以上是关于每天生成日期和前向填充列[重复]的主要内容,如果未能解决你的问题,请参考以下文章

只有一个日期列和重复条件的 SQL 岛

SQL查询显示员工表中ename列的长度和前3个字符[重复]

SQL查询每月每天的前两条数据....

垂直自动填充特定月份的日期[重复]

前向填充特定行的特定列

如何通过特定日期的唯一客户和重复客户获取每天的客户数量?