BigQuery SQL:滚动计数在两个条件之间有界
Posted
技术标签:
【中文标题】BigQuery SQL:滚动计数在两个条件之间有界【英文标题】:BigQuery SQL : Rolling count distinct bounded between two conditions 【发布时间】:2019-10-12 07:48:14 【问题描述】:我正在尝试查找两个事件之间(在 Bigquery SQL 中的两个不同列中)的 ip_var 的滚动计数。
例如,我有一张表格:
id TIME_STAMP event_1 event_2 ip_var
A 1 0 0 1
A 2 1 0 1
A 2 0 0 2
A 3 0 0 2
A 4 0 0 3
A 5 0 1 4
A 6 0 0 1
A 7 0 0 1
B 1 0 0 2
B 2 0 0 2
B 2 1 0 3
B 3 0 0 3
B 4 0 0 3
B 4 0 1 4
B 6 0 0 5
B 7 0 0 6
对于每个 id,当 event_1 发生直到 event_2 发生时,我需要 ip_var 的 countdistinct,它始终保证 even2 在 event_1 之后发生。
我尝试使用滚动计数来解决问题,但没有取得多大成功。
最终输出看起来像
id bounded_count
A 2
B 1
【问题讨论】:
不确定您的逻辑 - 我清楚地为 A 计算了 4 个不同的 ip_var,为 B 计算了 2 个不同的 ip_var。所以您是否从逻辑中排除了岛的开始和结束?您也有 A 和 B 的重复时间戳,在这种情况下您如何知道哪个先行。最后 - 在你的真实表中 - 每个 id 是否只有一对 event_1 和 event_2 开始(如你的示例)或多个? @MikhailBerlyant.Thanks。关于重复时间戳的所有好问题 - 无法确定哪个应该先去,所以在这里可以妥协。关于第二个问题,只能每个 id 只有一对,如示例中所示。 (但这个问题非常有趣,可能是它的问题。我可以在虚拟数据集上尝试) @MikhailBerlyant 关于排除他们的第一个问题,是的,我正在排除他们 知道了 - 将我的答案添加到一堆 :o) 【参考方案1】:以下是 BigQuery 标准 SQL
#standardSQL
SELECT id, COUNT(DISTINCT ip_var) bounded_count
FROM (
SELECT *,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp
如果适用于您问题的样本数据 - 结果是
Row id bounded_count
1 A 2
2 B 1
注意:如果您有多个合格对,上述解决方案也适用,如下例所示(相同的代码,我只是在示例数据中添加了更多行)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'A' id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
SELECT 'A', 2, 1, 0, 1 UNION ALL
SELECT 'A', 2, 0, 0, 2 UNION ALL
SELECT 'A', 3, 0, 0, 2 UNION ALL
SELECT 'A', 4, 0, 0, 3 UNION ALL
SELECT 'A', 5, 0, 1, 4 UNION ALL
SELECT 'A', 6, 0, 0, 1 UNION ALL
SELECT 'A', 7, 0, 0, 1 UNION ALL
SELECT 'A', 12, 1, 0, 1 UNION ALL
SELECT 'A', 13, 0, 0, 2 UNION ALL
SELECT 'A', 14, 0, 0, 3 UNION ALL
SELECT 'A', 15, 0, 0, 4 UNION ALL
SELECT 'A', 16, 0, 0, 5 UNION ALL
SELECT 'A', 17, 0, 1, 1 UNION ALL
SELECT 'A', 18, 0, 0, 1 UNION ALL
SELECT 'B', 1, 0, 0, 2 UNION ALL
SELECT 'B', 2, 0, 0, 2 UNION ALL
SELECT 'B', 2, 1, 0, 3 UNION ALL
SELECT 'B', 3, 0, 0, 3 UNION ALL
SELECT 'B', 4, 0, 0, 3 UNION ALL
SELECT 'B', 5, 0, 1, 4 UNION ALL
SELECT 'B', 6, 0, 0, 5 UNION ALL
SELECT 'B', 7, 0, 0, 6
)
SELECT id, COUNT(DISTINCT ip_var) bounded_count, grp
FROM (
SELECT *,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) grp,
COUNTIF(event_1 = 1) OVER(win ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) != COUNTIF(event_2 = 1) OVER(win) qualify
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY id ORDER BY time_stamp)
)
WHERE qualify
GROUP BY id, grp
结果
Row id bounded_count grp
1 A 2 1
2 A 4 2
3 B 1 1
【讨论】:
【参考方案2】:嗯。 . .您可以使用窗口函数来计算每个事件的时间戳。剩下的只是过滤和聚合:
WITH t as (
SELECT "A" as id, 1 as time_stamp, 0 as event_1, 0 as event_2, 1 as ip_var UNION ALL
SELECT "A", 2, 1, 0, 1 UNION ALL
SELECT "A", 2, 0, 0, 2 UNION ALL
SELECT "A", 3, 0, 0, 2 UNION ALL
SELECT "A", 4, 0, 0, 3 UNION ALL
SELECT "A", 5, 0, 1, 4 UNION ALL
SELECT "A", 6, 0, 0, 1 UNION ALL
SELECT "A", 7, 0, 0, 1 UNION ALL
SELECT "B", 1, 0, 0, 2 UNION ALL
SELECT "B", 2, 0, 0, 2 UNION ALL
SELECT "B", 2, 1, 0, 3 UNION ALL
SELECT "B", 3, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 1, 4 UNION ALL
SELECT "B", 6, 0, 0, 5 UNION ALL
SELECT "B", 7, 0, 0, 6
)
select id, count(distinct ip_var) as bounded_count
from (select t.*,
min(case when event_1 = 1 then time_stamp end) over (partition by id) as timestamp_1,
max(case when event_2 = 1 then time_stamp end) over (partition by id) as timestamp_2
from t
) t
where time_stamp > timestamp_1 and time_stamp < timestamp_2
group by id
【讨论】:
【参考方案3】:一种方法是:
-
找出每个 ID 的 start_time 和 end_time
对于每个 ID,过滤掉不在计数窗口中的事件
计算不同的 ip_var
为了打印出中间步骤,我使用了临时表来演示这个想法。您应该将第二个临时表 id_start_end
设置为 WITH 子句以提高效率。
CREATE TEMP TABLE t as
SELECT "A" id, 1 time_stamp, 0 event_1, 0 event_2, 1 ip_var UNION ALL
SELECT "A", 2, 1, 0, 1 UNION ALL
SELECT "A", 2, 0, 0, 2 UNION ALL
SELECT "A", 3, 0, 0, 2 UNION ALL
SELECT "A", 4, 0, 0, 3 UNION ALL
SELECT "A", 5, 0, 1, 4 UNION ALL
SELECT "A", 6, 0, 0, 1 UNION ALL
SELECT "A", 7, 0, 0, 1 UNION ALL
SELECT "B", 1, 0, 0, 2 UNION ALL
SELECT "B", 2, 0, 0, 2 UNION ALL
SELECT "B", 2, 1, 0, 3 UNION ALL
SELECT "B", 3, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 0, 3 UNION ALL
SELECT "B", 4, 0, 1, 4 UNION ALL
SELECT "B", 6, 0, 0, 5 UNION ALL
SELECT "B", 7, 0, 0, 6;
CREATE TEMP TABLE id_start_end AS
SELECT ids.id, t_start.time_stamp as start_time, t_end.time_stamp as end_time FROM
(SELECT DISTINCT id FROM t) ids
JOIN t AS t_start ON ids.id = t_start.id AND t_start.event_1 = 1
JOIN t AS t_end ON ids.id = t_end.id AND t_end.event_2 = 1;
SELECT * FROM id_start_end;
SELECT t.id, COUNT(DISTINCT ip_var)
FROM t JOIN id_start_end
ON t.id = id_start_end.id
AND t.time_stamp < id_start_end.end_time
AND t.time_stamp > id_start_end.start_time
GROUP BY t.id
输出表id_start_end:
+----+------------+----------+
| id | start_time | end_time |
+----+------------+----------+
| A | 2 | 5 |
| B | 2 | 4 |
+----+------------+----------+
最终输出:
+----+-----+
| id | f0_ |
+----+-----+
| B | 1 |
| A | 2 |
+----+-----+
【讨论】:
以上是关于BigQuery SQL:滚动计数在两个条件之间有界的主要内容,如果未能解决你的问题,请参考以下文章