高效的前向填充 bigquery

Posted

技术标签:

【中文标题】高效的前向填充 bigquery【英文标题】:efficient forward fill bigquery 【发布时间】:2021-07-29 23:28:53 【问题描述】:

我正在尝试在 bigquery 中转发填充表,但在执行查询时资源不足。表大小为 2GB。 这张桌子看起来像这样:

with t as (
    select timestamp '2021-05-01 00:00:01' as time, 10 as number union all
    select timestamp '2021-05-01 05:00:01' as time, NULL as number union all
    select timestamp '2021-05-01 23:00:01' as time, 20 as number union all
    select timestamp '2021-05-02 00:00:01' as time, NULL as number union all
    select timestamp '2021-05-02 01:00:01' as time, NULL as number union all 
    select timestamp '2021-05-02 05:00:01' as time, 12 as number
)
time number
2021-05-01 00:00:01 10
2021-05-01 05:00:01 NULL
2021-05-01 23:00:01 20
2021-05-02 00:00:01 NULL
2021-05-02 01:00:01 NULL
2021-05-02 05:00:01 12

想要的输出是:

time number
2021-05-01 00:00:01 10
2021-05-01 05:00:01 10
2021-05-01 23:00:01 20
2021-05-02 00:00:01 20
2021-05-02 01:00:01 20
2021-05-02 05:00:01 12

我目前的解决方案是:

SELECT time,
LAST_VALUE(number IGNORE NULLS) OVER(ORDER BY time) AS number
FROM t

它抛出:

Resources exceeded during query execution: The query could not be executed in the allotted memory.

问题在于 ORDER BY 的 OVER。 我尝试按天使用分区运行查询,并成功执行。

SELECT time,
LAST_VALUE(number IGNORE NULLS) OVER(PARTITION BY DATETIME_TRUNC(time, day) ORDER BY time) AS number
FROM t
time number
2021-05-01 00:00:01 10
2021-05-01 05:00:01 10
2021-05-01 23:00:01 20
2021-05-02 00:00:01 NULL
2021-05-02 01:00:01 NULL
2021-05-02 05:00:01 12

问题是它仍然有空值,但比原始表少了大约 500 倍。不确定是否可以基于此解决问题。 有什么有效的方法可以解决这个问题吗?

【问题讨论】:

【参考方案1】:

试试下面

SELECT time, 
NTH_VALUE(number, 1 IGNORE NULLS) OVER(ORDER BY time DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS number
FROM t

SELECT time, 
  FIRST_VALUE(number IGNORE NULLS) OVER(ORDER BY time DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) AS number
FROM t    

我没有要测试的真实数据的好例子 - 所以只是猜测

【讨论】:

以上是关于高效的前向填充 bigquery的主要内容,如果未能解决你的问题,请参考以下文章

具有基于索引的限制的前向填充列

前向填充多列可重用功能代码

Pandas:使用日期时间索引进行分组前向填充

成员结构的前向声明

C++ 中嵌套类型/类的前向声明

类成员的前向类声明