BigQuery SQL:滚动计数在两个条件之间有界

Posted

技术标签:

【中文标题】BigQuery SQL:滚动计数在两个条件之间有界【英文标题】:BigQuery SQL : Rolling count distinct bounded between two conditions 【发布时间】:2019-10-12 07:48:14 【问题描述】:

我正在尝试查找两个事件之间(在 Bigquery SQL 中的两个不同列中)的 ip_var 的滚动计数。

例如,我有一张表格:

id  TIME_STAMP  event_1 event_2 ip_var
A   1               0   0         1
A   2               1   0         1
A   2               0   0         2
A   3               0   0         2
A   4               0   0         3
A   5               0   1         4
A   6               0   0         1
A   7               0   0         1
B   1               0   0         2
B   2               0   0         2
B   2               1   0         3
B   3               0   0         3
B   4               0   0         3
B   4               0   1         4
B   6               0   0         5
B   7               0   0         6

对于每个 id,当 event_1 发生直到 event_2 发生时,我需要 ip_var 的 countdistinct,它始终保证 even2 在 event_1 之后发生。

我尝试使用滚动计数来解决问题,但没有取得多大成功。

最终输出看起来像

id  bounded_count
A   2
B   1

【问题讨论】:

不确定您的逻辑 - 我清楚地为 A 计算了 4 个不同的 ip_var,为 B 计算了 2 个不同的 ip_var。所以您是否从逻辑中排除了岛的开始和结束?您也有 A 和 B 的重复时间戳,在这种情况下您如何知道哪个先行。最后 - 在你的真实表中 - 每个 id 是否只有一对 event_1 和 event_2 开始(如你的示例)或多个? @MikhailBerlyant.Thanks。关于重复时间戳的所有好问题 - 无法确定哪个应该先去,所以在这里可以妥协。关于第二个问题,只能每个 id 只有一对,如示例中所示。 (但这个问题非常有趣,可能是它的问题。我可以在虚拟数据集上尝试) @MikhailBerlyant 关于排除他们的第一个问题,是的,我正在排除他们 知道了 - 将我的答案添加到一堆 :o) 【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT id, COUNT(DISTINCT ip_var) bounded_count
FROM (
  SELECT *, 
    COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
    COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
  FROM `project.dataset.table` 
  WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp   

如果适用于您问题的样本数据 - 结果是

Row id  bounded_count 
1   A   2
2   B   1   

注意:如果您有多个合格对,上述解决方案也适用,如下例所示(相同的代码,我只是在示例数据中添加了更多行)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'A' id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
  SELECT 'A', 2, 1, 0, 1 UNION ALL
  SELECT 'A', 2, 0, 0, 2 UNION ALL
  SELECT 'A', 3, 0, 0, 2 UNION ALL
  SELECT 'A', 4, 0, 0, 3 UNION ALL
  SELECT 'A', 5, 0, 1, 4 UNION ALL
  SELECT 'A', 6, 0, 0, 1 UNION ALL
  SELECT 'A', 7, 0, 0, 1 UNION ALL

  SELECT 'A', 12, 1, 0, 1 UNION ALL
  SELECT 'A', 13, 0, 0, 2 UNION ALL
  SELECT 'A', 14, 0, 0, 3 UNION ALL
  SELECT 'A', 15, 0, 0, 4 UNION ALL
  SELECT 'A', 16, 0, 0, 5 UNION ALL
  SELECT 'A', 17, 0, 1, 1 UNION ALL
  SELECT 'A', 18, 0, 0, 1 UNION ALL

  SELECT 'B', 1, 0, 0, 2 UNION ALL
  SELECT 'B', 2, 0, 0, 2 UNION ALL
  SELECT 'B', 2, 1, 0, 3 UNION ALL
  SELECT 'B', 3, 0, 0, 3 UNION ALL
  SELECT 'B', 4, 0, 0, 3 UNION ALL
  SELECT 'B', 5, 0, 1, 4 UNION ALL
  SELECT 'B', 6, 0, 0, 5 UNION ALL
  SELECT 'B', 7, 0, 0, 6 
)
SELECT id, COUNT(DISTINCT ip_var) bounded_count, grp
FROM (
  SELECT *, 
    COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
    COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
  FROM `project.dataset.table` 
  WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp   

结果

Row id  bounded_count   grp  
1   A   2               1    
2   A   4               2    
3   B   1               1    

【讨论】:

【参考方案2】:

嗯。 . .您可以使用窗口函数来计算每个事件的时间戳。剩下的只是过滤和聚合:

WITH t as (
      SELECT "A" as id, 1 as time_stamp, 0 as event_1, 0 as event_2, 1 as ip_var UNION ALL
      SELECT "A", 2,  1, 0, 1 UNION ALL
      SELECT "A", 2,  0, 0, 2 UNION ALL
      SELECT "A", 3,  0, 0, 2 UNION ALL
      SELECT "A", 4,  0, 0, 3 UNION ALL
      SELECT "A", 5,  0, 1, 4 UNION ALL
      SELECT "A", 6,  0, 0, 1 UNION ALL
      SELECT "A", 7,  0, 0, 1 UNION ALL
      SELECT "B", 1,  0, 0, 2 UNION ALL
      SELECT "B", 2,  0, 0, 2 UNION ALL
      SELECT "B", 2,  1, 0, 3 UNION ALL
      SELECT "B", 3,  0, 0, 3 UNION ALL
      SELECT "B", 4,  0, 0, 3 UNION ALL
      SELECT "B", 4,  0, 1, 4 UNION ALL
      SELECT "B", 6,  0, 0, 5 UNION ALL
      SELECT "B", 7,  0, 0, 6
     )
select id, count(distinct ip_var) as bounded_count
from (select t.*,
             min(case when event_1 = 1 then time_stamp end) over (partition by id) as timestamp_1,
             max(case when event_2 = 1 then time_stamp end) over (partition by id) as timestamp_2
      from t
     ) t
where time_stamp > timestamp_1 and time_stamp < timestamp_2
group by id

【讨论】:

【参考方案3】:

一种方法是:

    找出每个 ID 的 start_time 和 end_time 对于每个 ID,过滤掉不在计数窗口中的事件 计算不同的 ip_var

为了打印出中间步骤,我使用了临时表来演示这个想法。您应该将第二个临时表 id_start_end 设置为 WITH 子句以提高效率。

CREATE TEMP TABLE t as 
SELECT "A" id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
SELECT "A", 2,  1, 0, 1 UNION ALL
SELECT "A", 2,  0, 0, 2 UNION ALL
SELECT "A", 3,  0, 0, 2 UNION ALL
SELECT "A", 4,  0, 0, 3 UNION ALL
SELECT "A", 5,  0, 1, 4 UNION ALL
SELECT "A", 6,  0, 0, 1 UNION ALL
SELECT "A", 7,  0, 0, 1 UNION ALL
SELECT "B", 1,  0, 0, 2 UNION ALL
SELECT "B", 2,  0, 0, 2 UNION ALL
SELECT "B", 2,  1, 0, 3 UNION ALL
SELECT "B", 3,  0, 0, 3 UNION ALL
SELECT "B", 4,  0, 0, 3 UNION ALL
SELECT "B", 4,  0, 1, 4 UNION ALL
SELECT "B", 6,  0, 0, 5 UNION ALL
SELECT "B", 7,  0, 0, 6;

CREATE TEMP TABLE id_start_end AS
SELECT ids.id, t_start.time_stamp as start_time, t_end.time_stamp as end_time FROM
(SELECT DISTINCT id FROM t) ids 
JOIN t AS t_start ON ids.id = t_start.id AND t_start.event_1 = 1
JOIN t AS t_end ON ids.id = t_end.id AND t_end.event_2 = 1;

SELECT * FROM id_start_end;

SELECT t.id, COUNT(DISTINCT ip_var)
FROM t JOIN id_start_end 
  ON t.id = id_start_end.id
    AND t.time_stamp < id_start_end.end_time
    AND t.time_stamp > id_start_end.start_time
GROUP BY t.id

输出表id_start_end:

+----+------------+----------+
| id | start_time | end_time |
+----+------------+----------+
| A  |          2 |        5 |
| B  |          2 |        4 |
+----+------------+----------+

最终输出:

+----+-----+
| id | f0_ |
+----+-----+
| B  |   1 |
| A  |   2 |
+----+-----+

【讨论】:

以上是关于BigQuery SQL:滚动计数在两个条件之间有界的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery:如何执行滚动时间戳窗口组计数,每天产生行

Google BigQuery:如何查询两个不同值之间的共享值计数?

BigQuery 减去两个表的计数?

在 Big Query 的表中查找特定条件的属性计数

时间序列中的 SQL 滚动计数

在 bigquery 标准 sql 上提取两个日期之间的小时数