BigQuery:如何随时间对运行总计进行抽样
Posted
技术标签:
【中文标题】BigQuery:如何随时间对运行总计进行抽样【英文标题】:BigQuery: how to sample running totals over time 【发布时间】:2017-03-22 12:50:32 【问题描述】:我有一个 BigQuery 表记录从商店购买商品的时间。它包含一个 ItemID 和一个时间戳。我对购买的每件商品的运行总计感兴趣。我有这个生成运行总数的查询:
SELECT ItemID,timestamp,count(*)
OVER
(PARTITION BY ItemID
ORDER BY timestamp ASC, ItemID) AS runningtotal
from
(
SELECT * FROM [mydb.purchases]
)
ORDER BY timestamp
此表有数十万行。 我现在想做的是花费一段时间(例如一周)并在该周内为每个 ItemID 获取 100 个运行总计的样本(以便绘制没有太多数据点的图表)。 我不知道该怎么做。通过过滤诸如“where(rownumber %(rowcount / 100)= 0”之类的内容,我可以得到 100 个样本,但是如何为表中的每个 ItemID 执行此操作?我是否需要为每个 ItemID 执行多个子查询,然后创建工会?谢谢
【问题讨论】:
【参考方案1】:使用标准 SQL,您可以首先使用 LIMIT
函数内的 LIMIT
子句收集 100 个时间戳的样本:
#standardSQL
SELECT ItemID, timestamp, COUNT(*)
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total
FROM (
SELECT ItemID, ARRAY_AGG(timestamp LIMIT 100) timestamps
FROM `mydb.purchases`) t, t.timestamps timestamp
ORDER BY timestamp
如果这不进行随机抽样,您可以使用 RAND()
重新调整时间戳:
#standardSQL
SELECT ItemID, timestamp, COUNT(*)
OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS running_total
FROM (
SELECT ItemID, ARRAY_AGG(timestamp ORDER BY RAND() LIMIT 100) timestamps
FROM `mydb.purchases`) t, t.timestamps timestamp
ORDER BY timestamp
【讨论】:
【参考方案2】:下面的内容与您在抽样意义上描述的完全一样
我离开了selecting week worse of data
,因为它是微不足道的
#standardSQL
SELECT
ItemID,
timestamp,
runningtotal
FROM (
SELECT
ItemID,
timestamp,
COUNT(1) OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS runningtotal,
ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY timestamp ASC) AS rownumber,
COUNT(1) OVER(PARTITION BY ItemID) AS rowcount
FROM `mydb.purchases`
)
WHERE MOD(rownumber, CAST(rowcount/100 AS INT64)) = 0
-- ORDER BY ItemID, timestamp
【讨论】:
以上是关于BigQuery:如何随时间对运行总计进行抽样的主要内容,如果未能解决你的问题,请参考以下文章