Google-Bigquery:整合聚合
Posted
技术标签:
【中文标题】Google-Bigquery:整合聚合【英文标题】:Google-Bigquery: consolidate aggregate 【发布时间】:2014-08-22 21:18:11 【问题描述】:我正在尝试创建一个执行一些复杂操作的查询,但我无法找到任何可能为我指明正确方向的内容。也许你可以帮忙!
这是源数据:
7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:39 UTC", 15
5678, "05:12:39 UTC", 15
1111, "06:00:00 UTC", 10
2222, "07:00:00 UTC", 15
3333, "08:00:00 UTC", 10
我有一个查询要查找重复的统计信息:
SELECT userID, timestamp, statType, COUNT(*) - 1 AS DuplCount
FROM [dataset1.table1]
GROUP BY userID, timestamp, statType
HAVING DuplCount > 0;
(请注意,只有具有相同 userID 和时间戳的统计数据才能被视为重复。)
这会生成一个看起来像
的表格userID timestamp statType DuplCount
7457 05:06:26 UTC 15 3
2341 05:12:34 UTC 10 3
5678 05:12:34 UTC 15 7
5678 05:12:39 UTC 15 1
我想进一步合并这些数据,以便它可以作为一行插入到另一个表中:相同 statType 的重复计数的总和。我希望它看起来像
table stat10DuplCount stat15DuplCount
dataset1.table1 3 11
我不确定如何继续...这一切可以在一个查询中完成(首选),还是我需要执行多个查询或进行一些查询后数据处理?
【问题讨论】:
【参考方案1】:子查询:
SELECT "dataset1.table1" table, COUNT(IF(statType=10,1,null)) stat10DuplCount, COUNT(IF(statType=15,1,null)) stat15DuplCount
FROM (
SELECT userID, timestamp, statType, COUNT(*) - 1 AS DuplCount
FROM [dataset1.table1]
GROUP BY userID, timestamp, statType
HAVING DuplCount > 0
)
(如果您提供了一个覆盖公共数据集的有效查询,或者发布您的数据样本,那么回答和测试总是更容易)
工作示例:
SELECT "dataset1.table1" tablename,
COUNT(IF(statType=10,1,null)) stat10DuplCount,
COUNT(IF(statType=15,1,null)) stat15DuplCount
FROM (SELECT 15 statType),(SELECT 10 statType),(SELECT 15 statType),(SELECT 15 statType)
tablename stat10DuplCount stat15DuplCount
dataset1.table1 1 3
【讨论】:
谢谢!这绝对让我走上了正轨!【参考方案2】:我已经弄清楚了如何做我想做的事;此查询与 Felipe 的唯一区别在于,它采用重复项的总和,而不是将每组重复项计数为一次。
SELECT "dataset1.table1" table, SUM(IF(statID=10,DuplCount,null)) stat10DuplCount, SUM(IF(statID=15,DuplCount,null)) stat15DuplCount,
FROM (
SELECT userID, timestamp, statType, COUNT(*) - 1 AS DuplCount
FROM [dataset1.table1] AS statsTable
GROUP BY userID, timestamp, statType
HAVING DuplCount > 0
);
结果:
table stat10DuplCount stat15DuplCount
dataset1.table1 3 11
【讨论】:
以上是关于Google-Bigquery:整合聚合的主要内容,如果未能解决你的问题,请参考以下文章
google-bigquery 如何使用 https 获取数据集列表?
如何同步调用 google-bigquery 删除和插入 API?
google-bigquery 在查询结果中将日期格式设置为 mm/dd/yyyy