BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列

Posted

技术标签:

【中文标题】BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列【英文标题】:BigQuery: user metadata described by start date and end date - Create permutations across multiple tables 【发布时间】:2021-05-27 17:01:51 【问题描述】:

我有一些 BigQuery 表,其中包含一些与用户相关的元数据,以及该值有效的时间间隔。

例如:

    此表跟踪用户颜色如何随时间变化:

    此表跟踪用户水果如何随时间变化:

首先,请注意,2个表之间的时间间隔不相等,可以部分重叠、完全重叠或不完全重叠,这取决于单个元数据的性质

这里的目标是合并 2 个表的元数据列,以便创建一个单一的表,该表具有相同的日期开始-结束结构,同时考虑所有值和每次更改的日期。

到目前为止,我是这样做的:


input_1 AS (
    SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-2" AS user, "red" AS color
),

input_2 AS (
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit
    UNION ALL
    SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit
    UNION ALL
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-11') AS date_end, "user-2" AS user, "banana" AS fruit
),

-------------------

input_1_day_by_day AS (
  SELECT day, input_1.* EXCEPT(date_start, date_end)
  FROM input_1  
    CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
),

input_2_day_by_day AS (
  SELECT day, input_2.* EXCEPT(date_start, date_end)
  FROM input_2
    CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
)

--------------------

SELECT
  -- User
  COALESCE(i1.user,i2.user) AS user,

  -- Date interval
  MIN(COALESCE(i1.day, i2.day)) AS date_start,
  MAX(COALESCE(i1.day, i2.day)) AS date_end,
  
  -- Data
  i1.color,
  i2.fruit,

FROM input_1_day_by_day AS i1
  FULL JOIN input_2_day_by_day AS i2 ON i1.user = i2.user AND i1.day = i2.day
  
GROUP BY 1,4,5

ORDER BY 1,2

基本上:

我首先分解日期间隔,为(每个表的)每天创建一行 然后,我创建了一个新表,每天匹配(和用户),以便将数据连接在一起。 最后,我将所有元数据值组合在一起,以跟踪 MIN 日期和 MAX 日期,从而重新创建来自不同表的所有不同元数据的间隔

结果是这个:

现在,虽然该解决方案似乎有效,但它很有用,因为表格是这样的示例,每个表格一列,2 个表格。

我的实际场景是由更多的表组成的,每个表都有多个相关的列。假设有 10 个表,每个表大约有 20 列。

我的问题是:

为了创建 MIN/MAX 日期间隔,我现在对我拥有的每个元数据字段进行分组

GROUP BY 1,4,5

如果我有我提到的表/列的数量,那么生成的分组会是这样的

GROUP BY 1,4,5,6,7,8,9,......,30,31,32,......40,41,42,...

有没有更智能的方法来实现这种不分组每一列的结果?

像某种符号 GROUP BY ALL EXCEPT 2,3 或其他类型的分组可以帮助解决这种情况?

在这里你可以有另一种输入,有一些额外的列

input_1 AS (
    SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color, "new-york" as city, "america" as country
    UNION ALL
    SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color, "paris" as city, "france" as country
),

input_2 AS (
    SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit, "dog" as animal, "daisy" AS flower, "iron" AS metal
    UNION ALL
    SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-09') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "steel" as metal
    UNION ALL
    SELECT DATE('2021-05-10') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "iron" as metal
),

【问题讨论】:

【参考方案1】:

有没有更智能的方法来实现这种不分组每一列的结果?

考虑以下通用解决方案

select user, 
  min(day) date_start, 
  max(day) date_end, 
  any_value(t).* except(user, day, grp)
from (
  select * except(flag),
    countif(ifnull(flag, true)) over(partition by user order by day) grp
  from (
    select * except(mask1, mask2),
      lag(ifnull(mask1, 'null') || ifnull(mask2, 'null')) over(partition by user order by day) != ifnull(mask1, 'null') || ifnull(mask2, 'null') as flag
    from (
      select user, day, t.* except(user, date_start, date_end),
        to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask1
      from input_1 t, 
      unnest(generate_date_array(date_start, date_end)) day
    ) 
    full outer join (
      select user, day, t.* except(user, date_start, date_end),
        to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask2
      from input_2 t, 
      unnest(generate_date_array(date_start, date_end)) day
    ) 
    using(user, day)
  )
) t
group by t.user, t.grp
# order by t.user, date_start            

如果应用于您问题中的样本数据 - 输出是

【讨论】:

以上是关于BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列的主要内容,如果未能解决你的问题,请参考以下文章

Python Google BigQuery 参数化 SELECT

使用 Google BigQuery 上的开始/结束日期优化活动帐户查询

连接 BigQuery 和 Google 表格 - 日期参数问题

BigQuery 重复的 rank() 数字

MS Access:按开始日期和结束日期之间每个月的月份分组

需要在具有开始和结束日期的选定日期之间按天计算活跃客户列表(也为空)