BigQuery - 组合零散的事件

Posted

技术标签:

【中文标题】BigQuery - 组合零散的事件【英文标题】:BigQuery - combine fragmented events 【发布时间】:2020-02-26 10:24:50 【问题描述】:

这是一个示例数据:

create table activity as
select "2020-02-25T09:06:12" as datetime_start,  "2020-02-25T09:07:31" as datetime_end, 0 as flag uniuon all 
select "2020-02-25T09:16:08" as datetime_start,  "2020-02-25T09:17:31" as datetime_end, 0 as flag uniuon all 
select "2020-02-25T09:17:31" as datetime_start,  "2020-02-25T09:27:31" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T09:27:31" as datetime_start,  "2020-02-25T09:32:41" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T09:35:57" as datetime_start,  "2020-02-25T09:37:31" as datetime_end, 0 as flag uniuon all 
select "2020-02-25T09:49:23" as datetime_start,  "2020-02-25T09:51:16" as datetime_end, 0 as flag uniuon all 
select "2020-02-25T09:51:16" as datetime_start,  "2020-02-25T10:03:46" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:03:46" as datetime_start,  "2020-02-25T10:05:57" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:05:57" as datetime_start,  "2020-02-25T10:07:31" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:07:31" as datetime_start,  "2020-02-25T10:10:22" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:10:22" as datetime_start,  "2020-02-25T10:12:55" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:12:55" as datetime_start,  "2020-02-25T10:20:17" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:20:17" as datetime_start,  "2020-02-25T10:27:40" as datetime_end, 1 as flag uniuon all 
select "2020-02-25T10:27:40" as datetime_start,  "2020-02-25T10:39:51" as datetime_end, 1 as flag;

我正在寻找将根据标志列计算活动块的查询。 如果标志设置为 1,则直到标志更改为 0 之后的行需要合并到单个活动块中。

上面的例子产生了 6 个活动块。

    2020-02-25T09:06:12 - 2020-02-25T09:07:31 2020-02-25T09:16:08 - 2020-02-25T09:17:31 2020-02-25T09:17:31 - 2020-02-25T09:32:41 2020-02-25T09:35:57 - 2020-02-25T09:37:31 2020-02-25T09:49:23 - 2020-02-25T09:51:16 2020-02-25T09:51:16 - 2020-02-25T10:39:51

【问题讨论】:

【参考方案1】:

这回答了问题的原始版本。

GMB 的答案可能有效,但它似乎是定制的,因为它硬编码了标志的值。我更喜欢更通用的方法:

with activity as (
    select "2020-02-25T09:06:12" as datetime_start,  "2020-02-25T09:07:31" as datetime_end, 0 as flag union all 
    select "2020-02-25T09:16:08" as datetime_start,  "2020-02-25T09:17:31" as datetime_end, 0 as flag union all 
    select "2020-02-25T09:17:31" as datetime_start,  "2020-02-25T09:27:31" as datetime_end, 1 as flag union all 
    select "2020-02-25T09:27:31" as datetime_start,  "2020-02-25T09:32:41" as datetime_end, 1 as flag union all 
    select "2020-02-25T09:35:57" as datetime_start,  "2020-02-25T09:37:31" as datetime_end, 0 as flag union all 
    select "2020-02-25T09:49:23" as datetime_start,  "2020-02-25T09:51:16" as datetime_end, 0 as flag union all 
    select "2020-02-25T09:51:16" as datetime_start,  "2020-02-25T10:03:46" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:03:46" as datetime_start,  "2020-02-25T10:05:57" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:05:57" as datetime_start,  "2020-02-25T10:07:31" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:07:31" as datetime_start,  "2020-02-25T10:10:22" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:10:22" as datetime_start,  "2020-02-25T10:12:55" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:12:55" as datetime_start,  "2020-02-25T10:20:17" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:20:17" as datetime_start,  "2020-02-25T10:27:40" as datetime_end, 1 as flag union all 
    select "2020-02-25T10:27:40" as datetime_start,  "2020-02-25T10:39:51" as datetime_end, 1 as flag
    )
select min(datetime_start) as datetime_stat,
       max(datetime_end) as datetime_end,
       flag
from (select a.*,
             countif( datetime_start <> prev_datetime_end OR
                      prev_flag <> flag
                    ) over (order by datetime_start) as grp
       from (select a.*,
                    lag(flag) over (order by datetime_start) as prev_flag,
                    lag(datetime_end) over (order by datetime_start) as prev_datetime_end
             from activity a
            ) a
) t
group by flag, grp

【讨论】:

@ronencozen 。 . .不。如果修改后的问题使答案无效,我不会查看它们。更好的方法是提出一个新问题。 我明白了,谢谢你对我最初的问题的一个很好的回答。【参考方案2】:

这是一个gaps-and-island的变种。这是一种使用lag() 和窗口总和来定义连续1s 组的方法:

select
    min(datetime_start) datetime_stat,
    max(datetime_end) datetime_end,
    flag
from (
    select
        t.*,
        sum(case when flag = 1 and lag_flag = 1 then 0 else 1 end) 
            over(order by datetime_start) grp
    from (
        select 
            t.*,
            lag(flag) over(order by datetime_start) lag_flag
        from mytable t
    ) t
) t
group by flag, grp

【讨论】:

感谢您的及时回复。我将您的解决方案插入到我的查询中,但它没有输出我要查找的内容。 让我们以第 3 行和第 4 行组成的活动块编号 3 为例。我没有将它组合成一行,而是返回原始行,其中 datetime_end 等于前一行的 datetime_end。 @ronencozen:here is a db fiddle。这似乎为您的示例数据产生了正确的结果(这是一个 mysql 小提琴,因为据我所知没有 BQ 小提琴,但逻辑是相同的,这是标准 SQL)。 这里是 BigQuery 沙盒的链接。 cloud.google.com/bigquery/docs/sandbox

以上是关于BigQuery - 组合零散的事件的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery AEAD 功能的密钥集管理最佳实践 [关闭]

使用 Apache Beam 向 BigQuery 传播插入时如何指定 insertId

Google Cloud Dataproc 删除 BigQuery 表不起作用

AppEngine BigQuery PHP 库在运行时不隐含?

是否可以使用架构自动检测加载 BigQuery 但修改自动检测的架构?

BigQuery 视图可以引用来自不同数据集/项目的其他表和视图吗?