通过 BigQuery 识别缺失的时间戳
Posted
技术标签:
【中文标题】通过 BigQuery 识别缺失的时间戳【英文标题】:Identify Missing Timestamp Over BigQuery 【发布时间】:2021-03-10 10:48:06 【问题描述】:我有一个要求,我需要找到丢失的时间戳。 输入数据如下:-
Row id date
1 x 2021-01-01 10:00:00 UTC
2 x 2021-01-01 10:03:00 UTC
3 x 2021-01-01 10:05:00 UTC
4 x 2021-01-01 10:08:00 UTC
5 y 2021-01-06 10:05:00 UTC
6 y 2021-01-06 10:07:00 UTC
7 y 2021-01-06 10:10:00 UTC
我想要输出为,它会在 2 个连续的时间戳之间给出缺失的时间戳:-
1 x 2021-01-01 10:01:00 UTC
2 x 2021-01-01 10:02:00 UTC
3 x 2021-01-01 10:04:00 UTC
4 x 2021-01-01 10:06:00 UTC
5 x 2021-01-01 10:07:00 UTC
6 y 2021-01-06 10:06:00 UTC
7 y 2021-01-06 10:08:00 UTC
8 y 2021-01-06 10:09:00 UTC
【问题讨论】:
【参考方案1】:考虑下面的解决方案 - 使用较少的连接,最重要的是不会在最开始和最结束数据之间的所有分钟内生成巨大的数组 - 而是只为丢失的分钟生成如此小的数组。数组会占用内存并影响查询的性能
select id, missing_date
from (
select *,
lag(date) over(partition by id order by date) prev_date
from `project.dataset.table` t
),
unnest(generate_timestamp_array(timestamp_add(prev_date, interval 1 minute), timestamp_sub(date, interval 1 minute), interval 1 minute)) missing_date
where timestamp_diff(date, prev_date, minute) > 1
如果应用于您问题中的样本数据 - 输出是
【讨论】:
【参考方案2】:试试GENERATE_TIMESTAMP_ARRAY:
with mytable as (
select 'x' as id, timestamp '2021-01-01 10:00:00 UTC' as date union all
select 'x', timestamp '2021-01-01 10:03:00 UTC' union all
select 'x', timestamp '2021-01-01 10:05:00 UTC' union all
select 'x', timestamp '2021-01-01 10:08:00 UTC' union all
select 'y', timestamp '2021-01-06 10:05:00 UTC' union all
select 'y', timestamp '2021-01-06 10:07:00 UTC' union all
select 'y', timestamp '2021-01-06 10:10:00 UTC'
)
select id, missing.date
from mytable full join (
select *
from (
select id, GENERATE_TIMESTAMP_ARRAY(min(date), max(date), interval 1 minute) as date_array
from mytable
group by id
), unnest(date_array) as date
) as missing using (id, date)
where mytable.date is null
【讨论】:
以上是关于通过 BigQuery 识别缺失的时间戳的主要内容,如果未能解决你的问题,请参考以下文章
通过 BigQuery 库发送的时间戳对象返回错误“此字段不是记录”
BigQuery 问题:同一访问者的时间戳如何以及为啥相同?