BigQuery - 组合零散的事件
Posted
技术标签:
【中文标题】BigQuery - 组合零散的事件【英文标题】:BigQuery - combine fragmented events 【发布时间】:2020-02-26 10:24:50 【问题描述】:这是一个示例数据:
create table activity as
select "2020-02-25T09:06:12" as datetime_start, "2020-02-25T09:07:31" as datetime_end, 0 as flag uniuon all
select "2020-02-25T09:16:08" as datetime_start, "2020-02-25T09:17:31" as datetime_end, 0 as flag uniuon all
select "2020-02-25T09:17:31" as datetime_start, "2020-02-25T09:27:31" as datetime_end, 1 as flag uniuon all
select "2020-02-25T09:27:31" as datetime_start, "2020-02-25T09:32:41" as datetime_end, 1 as flag uniuon all
select "2020-02-25T09:35:57" as datetime_start, "2020-02-25T09:37:31" as datetime_end, 0 as flag uniuon all
select "2020-02-25T09:49:23" as datetime_start, "2020-02-25T09:51:16" as datetime_end, 0 as flag uniuon all
select "2020-02-25T09:51:16" as datetime_start, "2020-02-25T10:03:46" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:03:46" as datetime_start, "2020-02-25T10:05:57" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:05:57" as datetime_start, "2020-02-25T10:07:31" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:07:31" as datetime_start, "2020-02-25T10:10:22" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:10:22" as datetime_start, "2020-02-25T10:12:55" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:12:55" as datetime_start, "2020-02-25T10:20:17" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:20:17" as datetime_start, "2020-02-25T10:27:40" as datetime_end, 1 as flag uniuon all
select "2020-02-25T10:27:40" as datetime_start, "2020-02-25T10:39:51" as datetime_end, 1 as flag;
我正在寻找将根据标志列计算活动块的查询。 如果标志设置为 1,则直到标志更改为 0 之后的行需要合并到单个活动块中。
上面的例子产生了 6 个活动块。
-
2020-02-25T09:06:12 - 2020-02-25T09:07:31
2020-02-25T09:16:08 - 2020-02-25T09:17:31
2020-02-25T09:17:31 - 2020-02-25T09:32:41
2020-02-25T09:35:57 - 2020-02-25T09:37:31
2020-02-25T09:49:23 - 2020-02-25T09:51:16
2020-02-25T09:51:16 - 2020-02-25T10:39:51
【问题讨论】:
【参考方案1】:这回答了问题的原始版本。
GMB 的答案可能有效,但它似乎是定制的,因为它硬编码了标志的值。我更喜欢更通用的方法:
with activity as (
select "2020-02-25T09:06:12" as datetime_start, "2020-02-25T09:07:31" as datetime_end, 0 as flag union all
select "2020-02-25T09:16:08" as datetime_start, "2020-02-25T09:17:31" as datetime_end, 0 as flag union all
select "2020-02-25T09:17:31" as datetime_start, "2020-02-25T09:27:31" as datetime_end, 1 as flag union all
select "2020-02-25T09:27:31" as datetime_start, "2020-02-25T09:32:41" as datetime_end, 1 as flag union all
select "2020-02-25T09:35:57" as datetime_start, "2020-02-25T09:37:31" as datetime_end, 0 as flag union all
select "2020-02-25T09:49:23" as datetime_start, "2020-02-25T09:51:16" as datetime_end, 0 as flag union all
select "2020-02-25T09:51:16" as datetime_start, "2020-02-25T10:03:46" as datetime_end, 1 as flag union all
select "2020-02-25T10:03:46" as datetime_start, "2020-02-25T10:05:57" as datetime_end, 1 as flag union all
select "2020-02-25T10:05:57" as datetime_start, "2020-02-25T10:07:31" as datetime_end, 1 as flag union all
select "2020-02-25T10:07:31" as datetime_start, "2020-02-25T10:10:22" as datetime_end, 1 as flag union all
select "2020-02-25T10:10:22" as datetime_start, "2020-02-25T10:12:55" as datetime_end, 1 as flag union all
select "2020-02-25T10:12:55" as datetime_start, "2020-02-25T10:20:17" as datetime_end, 1 as flag union all
select "2020-02-25T10:20:17" as datetime_start, "2020-02-25T10:27:40" as datetime_end, 1 as flag union all
select "2020-02-25T10:27:40" as datetime_start, "2020-02-25T10:39:51" as datetime_end, 1 as flag
)
select min(datetime_start) as datetime_stat,
max(datetime_end) as datetime_end,
flag
from (select a.*,
countif( datetime_start <> prev_datetime_end OR
prev_flag <> flag
) over (order by datetime_start) as grp
from (select a.*,
lag(flag) over (order by datetime_start) as prev_flag,
lag(datetime_end) over (order by datetime_start) as prev_datetime_end
from activity a
) a
) t
group by flag, grp
【讨论】:
@ronencozen 。 . .不。如果修改后的问题使答案无效,我不会查看它们。更好的方法是提出一个新问题。 我明白了,谢谢你对我最初的问题的一个很好的回答。【参考方案2】:这是一个gaps-and-island的变种。这是一种使用lag()
和窗口总和来定义连续1
s 组的方法:
select
min(datetime_start) datetime_stat,
max(datetime_end) datetime_end,
flag
from (
select
t.*,
sum(case when flag = 1 and lag_flag = 1 then 0 else 1 end)
over(order by datetime_start) grp
from (
select
t.*,
lag(flag) over(order by datetime_start) lag_flag
from mytable t
) t
) t
group by flag, grp
【讨论】:
感谢您的及时回复。我将您的解决方案插入到我的查询中,但它没有输出我要查找的内容。 让我们以第 3 行和第 4 行组成的活动块编号 3 为例。我没有将它组合成一行,而是返回原始行,其中 datetime_end 等于前一行的 datetime_end。 @ronencozen:here is a db fiddle。这似乎为您的示例数据产生了正确的结果(这是一个 mysql 小提琴,因为据我所知没有 BQ 小提琴,但逻辑是相同的,这是标准 SQL)。 这里是 BigQuery 沙盒的链接。 cloud.google.com/bigquery/docs/sandbox以上是关于BigQuery - 组合零散的事件的主要内容,如果未能解决你的问题,请参考以下文章
BigQuery AEAD 功能的密钥集管理最佳实践 [关闭]
使用 Apache Beam 向 BigQuery 传播插入时如何指定 insertId
Google Cloud Dataproc 删除 BigQuery 表不起作用
AppEngine BigQuery PHP 库在运行时不隐含?