在 BigQuery 中获取所有时间唯一值
Posted
技术标签:
【中文标题】在 BigQuery 中获取所有时间唯一值【英文标题】:Get all time unique values in BigQuery 【发布时间】:2021-09-09 10:26:48 【问题描述】:我有一个这样的数据库:
ID | Day | Value |
---|---|---|
1 | 2021-09-01 | a |
2 | 2021-09-01 | b |
3 | 2021-09-01 | c |
4 | 2021-09-02 | d |
5 | 2021-09-02 | a |
6 | 2021-09-02 | a |
7 | 2021-09-02 | e |
8 | 2021-09-03 | c |
9 | 2021-09-03 | f |
10 | 2021-09-03 | a |
我想计算我每天和所有时间有多少不同的行,但所有时间的唯一性应该只计算之前的日期(如果用户是新用户,我想计算其背后的业务逻辑) . 所以我想看看这个输出:
Day | Daily Unique Counts | All Time Unique Counts |
---|---|---|
2021-09-01 | 3 | 3 |
2021-09-02 | 3 | 2 |
2021-09-03 | 3 | 1 |
注意:2021-09-02 每日唯一计数计数“d”、“a”和“e”,但“所有时间唯一计数”根本不计算“a”,因为它是在前一天计算的。
现在我可以正确计算每日唯一计数,但我不知道如何计算所有时间唯一计数列。
选择 日期, COUNT (DISTINCT id) AS 每日唯一计数,
来自table
按 1 分组
按 1 DESC 排序
我希望很清楚我想看到什么,请帮助解决这个问题,因为它让我发疯:)
【问题讨论】:
【参考方案1】:考虑以下使用 HyperLogLog++ 函数的方法,该函数从草图估计每日基数,然后用于最终数学
select day, Daily_Unique_Count,
( select hll_count.merge(sketch) - hll_count.merge(if(offset = 0, null, sketch))
from unnest(array_reverse(prev_Sketches)) sketch with offset
) All_Time_Unique_Counts
from (
select day, Daily_Unique_Count,
array_agg(Daily_Sketch) over(order by day) prev_Sketches
from (
select day, count(distinct value) as Daily_Unique_Count,
hll_count.init(value) as Daily_Sketch
from table
group by day
)
)
如果应用于您问题中的样本数据 - 输出是
它的工作方式(由内而外):
-
首先计算每日不同计数和每日草图
然后,每一天的所有草图都会汇总到当天和之前所有日子的数组中
最后(在最外面的选择中)每一天 - 你计算前一天和当天所有天的基数并减去前几天所有的基数 - kaboom! :o)
【讨论】:
【参考方案2】:我想这就是你需要的
WITH SAMPLE AS
(
SELECT 1 AS ID, '2021-09-01' AS DAY,'A' AS VALUE UNION ALL
SELECT 2 AS ID, '2021-09-01' AS DAY,'B' AS VALUE UNION ALL
SELECT 3 AS ID, '2021-09-01' AS DAY,'C' AS VALUE UNION ALL
SELECT 4 AS ID, '2021-09-02' AS DAY,'D' AS VALUE UNION ALL
SELECT 5 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 6 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 7 AS ID, '2021-09-02' AS DAY,'E' AS VALUE UNION ALL
SELECT 8 AS ID, '2021-09-03' AS DAY,'C' AS VALUE UNION ALL
SELECT 9 AS ID, '2021-09-03' AS DAY,'F' AS VALUE UNION ALL
SELECT 10 AS ID, '2021-09-03' AS DAY,'A' AS VALUE UNION ALL
SELECT 11 AS ID, '2021-09-04' AS DAY,'A' AS VALUE UNION ALL
SELECT 12 AS ID, '2021-09-04' AS DAY,'K' AS VALUE UNION ALL
SELECT 13 AS ID, '2021-09-04' AS DAY,'D' AS VALUE UNION ALL
SELECT 14 AS ID, '2021-09-05' AS DAY,'A' AS VALUE
),
DISTINCT_COUNT AS
(
SELECT DAY,
COUNT (DISTINCT VALUE) AS DAILY_UNIQUE_COUNTS
FROM SAMPLE
GROUP BY DAY
)
,CTE3_SUBTRACT_COUNT
AS
(
SELECT A.DAY AS A_DAY,
COUNT(DISTINCT A.VALUE) AS SUBTRACT_ME
FROM SAMPLE A
JOIN SAMPLE B
ON A.DAY > B.DAY AND A.VALUE = B.VALUE
GROUP BY A.DAY
)
SELECT A.DAY,
MAX(DAILY_UNIQUE_COUNTS) AS DAILY_UNIQUE_COUNTS,
MIN(DAILY_UNIQUE_COUNTS - IFNULL(SUBTRACT_ME,0)) AS ALL_TIME_COUNT
FROM SAMPLE A
LEFT JOIN CTE3_SUBTRACT_COUNT B
ON A.DAY = B.A_DAY
LEFT JOIN DISTINCT_COUNT C ON
C.DAY = A.DAY
GROUP BY A.DAY
【讨论】:
@athew 您能否将其标记为答案。谢谢!【参考方案3】:如果您使用窗口分析函数来构建它会更好,因为它允许您为这种特殊情况设置日期窗口。 文档是 here 和 here,这是一篇很好的文章,解释了它的工作原理。
您的查询将类似于:
SELECT day, count(distinct id) OVER part AS daily_unique_counts
FROM table
WINDOW part AS (PARTITION BY day ORDER BY day DESC
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
更新:
刚刚看到它不会使用distinct
规范运行。但是可以解决,如in this other post所见。
【讨论】:
它看起来很有趣,但我收到此错误“如果在 [5:32] 指定了 DISTINCT,则不允许窗口 ORDER BY” @athew 刚刚更新了考虑到这个错误 ;)以上是关于在 BigQuery 中获取所有时间唯一值的主要内容,如果未能解决你的问题,请参考以下文章
连接 BigQuery 和 Google 表格 - 日期参数问题
如何从 CIDR BigQuery 获取 From & To IP 地址