BigQuery:选择重复记录中字段之间的最小差异

Posted

技术标签:

【中文标题】BigQuery:选择重复记录中字段之间的最小差异【英文标题】:BigQuery: Selecting the smallest difference among fields in a repeated record 【发布时间】:2016-02-23 03:00:09 【问题描述】:

考虑一下 BigQuery 上的这个表架构:

Table User

user_id: STRING (REQUIRED)
user_name: STRING (REQUIRED)
actions: RECORD (REPEATED) 
    
        action_id: STRING (REQUIRED)
        action_type: INTEGER (REQUIRED)
        action_date: TIMESTAMP (REQUIRED)
    

我想查找多次创建某种类型操作的所有用户(user_id 和 user_name),并且这些操作之间的最短时间少于 X 天。

未定义每个用户存储的操作数(可以是 1、2 或 n)。这些操作没有按任何标准排序(但我认为这可以通过使用ORDER BY 来解决)。

例如,与用户:


    user_id: "u1", 
    user_name: "User 1", 
    actions: 
    action_id: "a1", action_type: 1, action_date: "2016-02-22",
    action_id: "a2", action_type: 1, action_date: "2016-01-22",
    action_id: "a3", action_type: 1, action_date: "2015-12-22"
,

    user_id: "u2", 
    user_name: "User 2", 
    actions: 
    action_id: "a4", action_type: 1, action_date: "2016-02-22",
    action_id: "a5", action_type: 2, action_date: "2016-01-22",
    action_id: "a6", action_type: 1, action_date: "2015-12-22"
,

    user_id: "u3", 
    user_name: "User 3", 
    actions: 
    action_id: "a7", action_type: 1, action_date: "2016-02-22"
,

    user_id: "u4", 
    user_name: "User 4", 
    actions: 
    action_id: "a8", action_type: 1, action_date: "2016-02-22",
    action_id: "a9", action_type: 1, action_date: "2015-02-22",
    action_id: "a10", action_type: 1, action_date: "2015-01-22"
,

查询“选择多次执行1类型操作的用户,且每次执行之间的最小时间小于45天”应该返回User 1User 4

关于如何在 BigQuery 上执行此操作的任何想法?

【问题讨论】:

@MikhailBerlyant 我还没有标记为已接受,因为我没有时间测试它,所以请耐心等待。 【参考方案1】:

试试下面 写在旅途中,因此没有经过测试,但我觉得它应该可以工作并做你需要的

SELECT 
  user_id, 
  user_name, 
  action_type, 
  MIN(DATEDIFF(action_date_next, action_date)) AS min_distance
FROM (
  SELECT 
    user_id, 
    user_name, 
    action_type, 
    action_date, 
    LAG(action_date) 
        OVER(PARTITION BY user_id, action_type 
        ORDER BY action_date DESC) AS action_date_next
  FROM (
    SELECT 
      user_id, 
      user_name, 
      actions.action_type AS action_type, 
      actions.action_date AS action_date 
    FROM table_users 
  )
)
WHERE action_date_next IS NOT NULL
GROUP BY user_id, user_name, action_type
HAVING action_type = 1 AND min_distance < 45

以下版本更紧凑-也可以尝试一下

SELECT 
  user_id, 
  user_name, 
  action_type, 
  MIN(DATEDIFF(action_date_next, action_date)) AS min_distance
FROM (
  SELECT 
    user_id, 
    user_name, 
    actions.action_type AS action_type, 
    actions.action_date AS action_date, 
    LAG(actions.action_date) 
        OVER(PARTITION BY user_id, actions.action_type 
        ORDER BY actions.action_date DESC) AS action_date_next
  FROM table_users
)
WHERE action_date_next IS NOT NULL
GROUP BY user_id, user_name, action_type
HAVING action_type = 1 AND min_distance < 45

【讨论】:

BigQuery 上的 LAG 函数和 TIMESTAMP 存在错误,请查看 this question。除此之外,答案似乎可以解决问题,一旦我可以正确测试,我会接受它。

以上是关于BigQuery:选择重复记录中字段之间的最小差异的主要内容,如果未能解决你的问题,请参考以下文章

重复字段的 BigQuery 记录

根据 BigQuery 重复记录中的字段计算聚合

每个着陆内容分组的会话中 Bigquery 和 GA 之间的差异

BigQuery 从查询中创建重复记录字段

在 BigQuery 中嵌套多个重复字段

在 BigQuery 中取消嵌套多个嵌套字段