在 BigQuery 中获取所有时间唯一值

Posted

技术标签:

【中文标题】在 BigQuery 中获取所有时间唯一值【英文标题】:Get all time unique values in BigQuery 【发布时间】:2021-09-09 10:26:48 【问题描述】:

我有一个这样的数据库:

ID Day Value
1 2021-09-01 a
2 2021-09-01 b
3 2021-09-01 c
4 2021-09-02 d
5 2021-09-02 a
6 2021-09-02 a
7 2021-09-02 e
8 2021-09-03 c
9 2021-09-03 f
10 2021-09-03 a

我想计算我每天和所有时间有多少不同的行,但所有时间的唯一性应该只计算之前的日期(如果用户是新用户,我想计算其背后的业务逻辑) . 所以我想看看这个输出:

Day Daily Unique Counts All Time Unique Counts
2021-09-01 3 3
2021-09-02 3 2
2021-09-03 3 1

注意:2021-09-02 每日唯一计数计数“d”、“a”和“e”,但“所有时间唯一计数”根本不计算“a”,因为它是在前一天计算的。

现在我可以正确计算每日唯一计数,但我不知道如何计算所有时间唯一计数列。

选择 日期, COUNT (DISTINCT id) AS 每日唯一计数,

来自table 按 1 分组 按 1 DESC 排序

我希望很清楚我想看到什么,请帮助解决这个问题,因为它让我发疯:)

【问题讨论】:

【参考方案1】:

考虑以下使用 HyperLogLog++ 函数的方法,该函数从草图估计每日基数,然后用于最终数学

select day, Daily_Unique_Count, 
  ( select hll_count.merge(sketch) - hll_count.merge(if(offset = 0, null, sketch))  
    from unnest(array_reverse(prev_Sketches)) sketch with offset
  ) All_Time_Unique_Counts
from (
  select day, Daily_Unique_Count, 
    array_agg(Daily_Sketch) over(order by day) prev_Sketches
  from (
    select day, count(distinct value) as Daily_Unique_Count,
      hll_count.init(value) as Daily_Sketch
    from table
    group by day
  )
)     

如果应用于您问题中的样本数据 - 输出是

它的工作方式(由内而外):

    首先计算每日不同计数和每日草图 然后,每一天的所有草图都会汇总到当天和之前所有日子的数组中 最后(在最外面的选择中)每一天 - 你计算前一天和当天所有天的基数并减去前几天所有的基数 - kaboom! :o)

【讨论】:

【参考方案2】:

我想这就是你需要的

WITH SAMPLE AS
(
SELECT 1 AS ID, '2021-09-01' AS DAY,'A' AS VALUE UNION ALL
SELECT 2 AS ID, '2021-09-01' AS DAY,'B' AS VALUE UNION ALL
SELECT 3 AS ID, '2021-09-01' AS DAY,'C' AS VALUE UNION ALL
SELECT 4 AS ID, '2021-09-02' AS DAY,'D' AS VALUE UNION ALL
SELECT 5 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 6 AS ID, '2021-09-02' AS DAY,'A' AS VALUE UNION ALL
SELECT 7 AS ID, '2021-09-02' AS DAY,'E' AS VALUE UNION ALL
SELECT 8 AS ID, '2021-09-03' AS DAY,'C' AS VALUE UNION ALL
SELECT 9 AS ID, '2021-09-03' AS DAY,'F' AS VALUE UNION ALL
SELECT 10 AS ID,    '2021-09-03' AS DAY,'A' AS VALUE UNION ALL

SELECT 11 AS ID,    '2021-09-04' AS DAY,'A' AS VALUE UNION ALL
SELECT 12 AS ID,    '2021-09-04' AS DAY,'K' AS VALUE UNION ALL
SELECT 13 AS ID,    '2021-09-04' AS DAY,'D' AS VALUE UNION ALL
SELECT 14 AS ID,    '2021-09-05' AS DAY,'A' AS VALUE 
),
DISTINCT_COUNT AS
(
SELECT  DAY, 
        COUNT (DISTINCT VALUE) AS DAILY_UNIQUE_COUNTS
FROM SAMPLE
GROUP BY DAY
)
,CTE3_SUBTRACT_COUNT
AS
(
SELECT  A.DAY AS A_DAY,
        COUNT(DISTINCT A.VALUE) AS SUBTRACT_ME
FROM SAMPLE A
JOIN SAMPLE B 
ON A.DAY > B.DAY AND A.VALUE = B.VALUE
GROUP BY A.DAY
)
SELECT  A.DAY,
        MAX(DAILY_UNIQUE_COUNTS) AS DAILY_UNIQUE_COUNTS,
        MIN(DAILY_UNIQUE_COUNTS - IFNULL(SUBTRACT_ME,0)) AS ALL_TIME_COUNT
FROM SAMPLE A
LEFT JOIN CTE3_SUBTRACT_COUNT B
ON A.DAY = B.A_DAY
LEFT JOIN DISTINCT_COUNT C ON
C.DAY = A.DAY
GROUP BY A.DAY

【讨论】:

@athew 您能否将其标记为答案。谢谢!【参考方案3】:

如果您使用窗口分析函数来构建它会更好,因为它允许您为这种特殊情况设置日期窗口。 文档是 here 和 here,这是一篇很好的文章,解释了它的工作原理。

您的查询将类似于:

SELECT day, count(distinct id) OVER part AS daily_unique_counts
FROM table
WINDOW part AS (PARTITION BY day ORDER BY day DESC
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

更新: 刚刚看到它不会使用distinct 规范运行。但是可以解决,如in this other post所见。

【讨论】:

它看起来很有趣,但我收到此错误“如果在 [5:32] 指定了 DISTINCT,则不允许窗口 ORDER BY” @athew 刚刚更新了考虑到这个错误 ;)

以上是关于在 BigQuery 中获取所有时间唯一值的主要内容,如果未能解决你的问题,请参考以下文章

查找要插入 BigQuery 的列名

连接 BigQuery 和 Google 表格 - 日期参数问题

如何从 CIDR BigQuery 获取 From & To IP 地址

如何使用 BigQuery 在没有每个元素的情况下获取所有总和值?

如何关联多个 BigQuery 数组字段?

如何在 bigquery 中使用 rowid 按日期获取数据集的第一个值,并将给定日期的所有其他值设为 0