前向填充多列可重用功能代码

Posted

技术标签:

【中文标题】前向填充多列可重用功能代码【英文标题】:Forward Filling Multiple Columns Reusable Function Code 【发布时间】:2019-05-10 16:08:51 【问题描述】:

我正在尝试将基于先前堆栈溢出帖子的前向填充估算过程转换为可重用函数(def(...)),因此我可以将其应用于多个列,而不是每个列都有一个代码 sn-p柱子。创建带参数的可重用函数对我来说一直是个挑战。

谢谢!

发帖 => Forward fill missing values in Spark/Python

代码示例片段

# sample data
df = spark.createDataFrame([('2019-05-10 7:30:05', '10', '0.5'),\
                            ('2019-05-10 7:30:10', 'UNKNOWN', '0.24'),\
                            ('2019-05-10 7:30:15', '6', 'UNKNOWN'),\
                            ('2019-05-10 7:30:20', '7', 'UNKNOWN'),\
                            ('2019-05-10 7:30:25', '10', '1.1'),\
                            ('2019-05-10 7:30:30', 'UNKNOWN', '1.1'),\
                            ('2019-05-10 7:30:35', 'UNKNOWN', 'UNKNOWN'),\
                            ('2019-05-10 7:30:49', '50', 'UNKNOWN')], ["date", "v1", "v2"])

df = df.withColumn("date", F.col("date").cast("timestamp"))

# schema
root
 |-- date: timestamp (nullable = true)
 |-- v1: string (nullable = true)
 |-- v2: string (nullable = true)

# imputer process / all cols that need filled are strings
def stringReplaceFunc(x, y):
    '''
    this function replaces column values:
    ex: replace 'UNKNOWN' reading with nulls for forward filling function
    : x => source col
    : y => replace value
    '''
    return F.when(x != y, x).otherwise(F.lit(None)) # replace with NULL

# this windows function triggers forward filling for null values created from StringReplaceFunc
window = Window\
.partitionBy(F.month("date"))\
.orderBy('date')\
.rowsBetween(-sys.maxsize, 0)

# here is where I am trying to make a function so I don't have to code each col that needs filled individually
df = df\
.withColumn("v1", stringReplaceFunc(F.col("v1"), "UNKNOWN"))

fill_v1 = F.last(df['v1'], ignorenulls=True).over(window)
df = df.withColumn('v1',  fill_v1)

df = df\
.withColumn("v2", stringReplaceFunc(F.col("v2"), "UNKNOWN"))

fill_v1 = F.last(df['v2'], ignorenulls=True).over(window)
df = df.withColumn('v2',  fill_v1)

# imputing results of the output needed
df.show()

+-------------------+---+----+
|               date| v1|  v2|
+-------------------+---+----+
|2019-05-10 07:30:05| 10| 0.5|
|2019-05-10 07:30:10| 10|0.24|
|2019-05-10 07:30:15|  6|0.24|
|2019-05-10 07:30:20|  7|0.24|
|2019-05-10 07:30:25| 10| 1.1|
|2019-05-10 07:30:30| 10| 1.1|
|2019-05-10 07:30:35| 10| 1.1|
|2019-05-10 07:30:49| 50| 1.1|
+-------------------+---+----+

【问题讨论】:

【参考方案1】:

我不是 100% 正确理解了这个问题,但这是将您提到的代码包含在 python 函数中的一种方式:

def forward_fill(df, col_name):
    df = df.withColumn(col_name, stringReplaceFunc(F.col(col_name), "UNKNOWN"))

    last_func = F.last(df[col_name], ignorenulls=True).over(window)
    df = df.withColumn(col_name,  last_func)
    return df

那么你可以这样称呼它:df = forward_fill(df, 'v1')

【讨论】:

谢谢!我能够通过带有嵌套函数的 for 循环找到解决方案。 那么完美:)【参考方案2】:

这是一个有效的解决方案

def stringReplaceFunc(x, y):
    return F.when(x != y, x).otherwise(F.lit(None)) # replace with NULL

def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
  for i in cols:
    window = Window\
    .partitionBy(F.month(partitioner))\
    .orderBy(partitioner)\
    .rowsBetween(-sys.maxsize, 0)
    df = df\
    .withColumn(i, stringReplacer(F.col(i), value))
    fill = F.last(df[i], ignorenulls=True).over(window)
    df = df.withColumn(i,  fill)
  return df
df = forwardFillImputer(df, cols=[i for i in df.columns])

【讨论】:

以上是关于前向填充多列可重用功能代码的主要内容,如果未能解决你的问题,请参考以下文章

代码质量第 2 层 - 可重用的代码

代码质量第2层-可重用的代码!

如何通过代码填充多列组合框?

将 const/function 移入自定义钩子/可重用代码段

编写可移植的PHP代码

通过填充前向/LOCF 在 SQL 中的一系列连续行上计算一列?