如何对不同表和不同过滤器的多个计数求和
Posted
技术标签:
【中文标题】如何对不同表和不同过滤器的多个计数求和【英文标题】:how to sum of multiple count on different table and different filter 【发布时间】:2020-02-03 04:51:48 【问题描述】:我想得到box_id
、date
、hour
和sum
的多个count id
在不同的表中,每个表中有不同的status
,但具有相同的box_id
,
例子
table 1
(filter by status = finished)
id box_id date status
i 20 2019-01-01 01:00:00.000 UTC finished
2 21 2019-01-01 02:00:00.000 UTC finished
3 21 2019-01-01 01:00:00.000 UTC unfinished
table 2
(filter by status = start)
id box_id date status
i 21 2019-01-01 01:00:00.000 UTC start
2 22 2019-01-01 02:00:00.000 UTC end
3 23 2019-01-01 01:00:00.000 UTC start
4 24 2019-01-01 01:00:00.000 UTC start
table 3
(filter by status = close)
id box_id date status
i 21 2019-01-01 03:00:00.000 UTC close
2 22 2019-01-01 02:00:00.000 UTC end
3 24 2019-01-01 01:00:00.000 UTC close
result that i want:
box_id date hour count
20 2019-01-01 1 1
21 2019-01-01 1 1
21 2019-01-01 2 1
21 2019-01-01 3 1
23 2019-01-01 1 1
24 2019-01-01 1 2
这是适用于表 1 的查询: 我如何在一张桌子上得到所有的东西?
select box_id,
date(date_update),
EXTRACT(hour FROM date_update) as hourly,
count(id)
from table1
where status = "finished"
group by box_id, date(date_update), EXTRACT(hour FROM date_update)
格式小时 = 0 - 23
【问题讨论】:
这样的表结构替换成CREATE TABLE。 结果必须显示必须给出此结果的源数据(以 INSERT INTO 形式)(当然会因记录数限制而减少count
)。 PS。您的查询必须给出期望的结果...有什么问题?
我一直在编辑我的问题,每个表都有唯一的 id,我想根据上面每个表中应用的过滤器获得 3 个表上的多个计数 id 的总和
在每个表中都有 box_id 并且该字段存在于所有三个表中,对吗?此外,您想计算每个表的每个 box_id,然后将它们加在一起作为每个 box_id 的最终结果,对吗?
对对!与每个表中不同的status
过滤器相加
@NFA,每个表的 box_id 值都相同吗?
【参考方案1】:
假设您的 date
字段是 TIMESTAMP 数据类型 - 以下是 BigQuery 标准 SQL
#standardSQL
SELECT box_id, date, hour, COUNT(1) cnt
FROM (
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table1` WHERE status = 'finished' UNION ALL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table2` WHERE status = 'start' UNION ALL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table3` WHERE status = 'close'
)
GROUP BY box_id, date, hour
您可以使用您问题中的样本/虚拟数据进行测试,如以下示例所示
#standardSQL
WITH `project.dataset.table1` AS (
SELECT 1 id, 20 box_id, TIMESTAMP '2019-01-01 01:00:00.000 UTC'date, 'finished' status UNION ALL
SELECT 2, 21, '2019-01-01 02:00:00.000 UTC', 'finished' UNION ALL
SELECT 3, 21, '2019-01-01 01:00:00.000 UTC', 'unfinished'
), `project.dataset.table2` AS (
SELECT 1 id, 21 box_id, TIMESTAMP '2019-01-01 01:00:00.000 UTC' date, 'start' status UNION ALL
SELECT 2, 22, '2019-01-01 02:00:00.000 UTC', 'end' UNION ALL
SELECT 3, 23, '2019-01-01 01:00:00.000 UTC', 'start' UNION ALL
SELECT 4, 24, '2019-01-01 01:00:00.000 UTC', 'start'
), `project.dataset.table3` AS (
SELECT 1 id, 21 box_id, TIMESTAMP '2019-01-01 03:00:00.000 UTC' date, 'close' status UNION ALL
SELECT 2, 22, '2019-01-01 02:00:00.000 UTC', 'end' UNION ALL
SELECT 3, 24, '2019-01-01 01:00:00.000 UTC', 'close'
)
SELECT box_id, date, hour, COUNT(1) cnt
FROM (
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table1` WHERE status = 'finished' UNION ALL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table2` WHERE status = 'start' UNION ALL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour
FROM `project.dataset.table3` WHERE status = 'close'
)
GROUP BY box_id, date, hour
-- ORDER BY box_id, date, hour
结果
Row box_id date hour cnt
1 20 2019-01-01 1 1
2 21 2019-01-01 1 1
3 21 2019-01-01 2 1
4 21 2019-01-01 3 1
5 23 2019-01-01 1 1
6 24 2019-01-01 1 2
以下是相同的稍微重构的版本(显然具有相同的输出)
#standardSQL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour,
COUNTIF(
(t = 1 AND status = 'finished') OR
(t = 2 AND status = 'start') OR
(t = 3 AND status = 'close')
) cnt
FROM (
SELECT 1 t, * FROM `project.dataset.table1` UNION ALL
SELECT 2, * FROM `project.dataset.table2` UNION ALL
SELECT 3, * FROM `project.dataset.table3`
)
GROUP BY box_id, date, hour
HAVING cnt > 0
或
#standardSQL
SELECT box_id, DATE(date) date, EXTRACT(HOUR FROM date) hour, COUNT(1) cnt
FROM (
SELECT * FROM `project.dataset.table1` WHERE status = 'finished' UNION ALL
SELECT * FROM `project.dataset.table2` WHERE status = 'start' UNION ALL
SELECT * FROM `project.dataset.table3` WHERE status = 'close'
)
GROUP BY box_id, date, hour
【讨论】:
我正在尝试第一个查询,太棒了!它为我工作,感谢一吨大师!【参考方案2】:正如 cmets 中所讨论的,由于您想从多个表中添加字段,我建议您使用 JOIN。 JOIN 类型有多种,如果所有表的 box_id 数量和值都相同,则可以使用INNER JOIN。但是,如果不是这种情况,并且您仍然希望查看每个 box_id 的计数,即使此 box_id 可能不会出现在我建议的所有三个表中使用FULL JOIN。
下面我编写了一个简化示例,其中我使用了 FULL JOIN 以及 BigQuery 中的其他内置函数。
SELECT DISTINCT
coalesce(t1.box_id, t2.box_id, t3.box_id) AS id,
(ifnull(t1.count,0)+ifnull(t2.count,0)+ifnull(t3.count,0)) AS count
FROM (
SELECT
box_id,
count(box_id) AS count
FROM
`source_table1`
WHERE status = 'finished'
GROUP BY
box_id) t1
FULL JOIN (
SELECT
box_id,
count (box_id) AS count
FROM
`source_table2`
WHERE status = 'finished'
GROUP BY
box_id ) t2
ON
t1.box_id=t2.box_id
FULL JOIN (
SELECT
box_id,
count (box_id) AS count
FROM
`source_table3`
WHERE status = 'finished'
GROUP BY
box_id) AS t3
ON
t1.box_id=t3.box_id
WHERE
t1.box_id IS NOT NULL
OR t2.box_id IS NOT NULL
OR t3.box_id IS NOT NULL
ORDER BY
id
请注意,我使用COALESCE 选择了box_id,如果表1 中不存在此字段,它将移至表2,依此类推。随后,我使用IFNULL 对每个表中遇到的计数结果求和,该方法用于确保当 box_id 不在表中时计数设置为零。最后,我使用了 WHERE 子句,因此计数会尊重您施加的条件。
我使用下面的示例数据来重现您的案例:
表 1:
表 2 和表 3:
因此,输出为:
希望对你有帮助。
【讨论】:
它有效,但我没有得到日期(我想按 box_id 和日期分组)。怎么样? 你能告诉我你想如何选择表格,因为你要计算一个字段,我需要知道数据字段的行为。每个 box_id 都一样吗?或者你有选择它们的规则吗?以上是关于如何对不同表和不同过滤器的多个计数求和的主要内容,如果未能解决你的问题,请参考以下文章