BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列
Posted
技术标签:
【中文标题】BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列【英文标题】:BigQuery: user metadata described by start date and end date - Create permutations across multiple tables 【发布时间】:2021-05-27 17:01:51 【问题描述】:我有一些 BigQuery 表,其中包含一些与用户相关的元数据,以及该值有效的时间间隔。
例如:
-
此表跟踪用户颜色如何随时间变化:
-
此表跟踪用户水果如何随时间变化:
首先,请注意,2个表之间的时间间隔不相等,可以部分重叠、完全重叠或不完全重叠,这取决于单个元数据的性质
这里的目标是合并 2 个表的元数据列,以便创建一个单一的表,该表具有相同的日期开始-结束结构,同时考虑所有值和每次更改的日期。
到目前为止,我是这样做的:
input_1 AS (
SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-2" AS user, "red" AS color
),
input_2 AS (
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit
UNION ALL
SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit
UNION ALL
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-11') AS date_end, "user-2" AS user, "banana" AS fruit
),
-------------------
input_1_day_by_day AS (
SELECT day, input_1.* EXCEPT(date_start, date_end)
FROM input_1
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
),
input_2_day_by_day AS (
SELECT day, input_2.* EXCEPT(date_start, date_end)
FROM input_2
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(date_start, date_end)) AS day
)
--------------------
SELECT
-- User
COALESCE(i1.user,i2.user) AS user,
-- Date interval
MIN(COALESCE(i1.day, i2.day)) AS date_start,
MAX(COALESCE(i1.day, i2.day)) AS date_end,
-- Data
i1.color,
i2.fruit,
FROM input_1_day_by_day AS i1
FULL JOIN input_2_day_by_day AS i2 ON i1.user = i2.user AND i1.day = i2.day
GROUP BY 1,4,5
ORDER BY 1,2
基本上:
我首先分解日期间隔,为(每个表的)每天创建一行 然后,我创建了一个新表,每天匹配(和用户),以便将数据连接在一起。 最后,我将所有元数据值组合在一起,以跟踪 MIN 日期和 MAX 日期,从而重新创建来自不同表的所有不同元数据的间隔结果是这个:
现在,虽然该解决方案似乎有效,但它很有用,因为表格是这样的示例,每个表格一列,2 个表格。
我的实际场景是由更多的表组成的,每个表都有多个相关的列。假设有 10 个表,每个表大约有 20 列。
我的问题是:
为了创建 MIN/MAX 日期间隔,我现在对我拥有的每个元数据字段进行分组
GROUP BY 1,4,5
如果我有我提到的表/列的数量,那么生成的分组会是这样的
GROUP BY 1,4,5,6,7,8,9,......,30,31,32,......40,41,42,...
有没有更智能的方法来实现这种不分组每一列的结果?
像某种符号 GROUP BY ALL EXCEPT 2,3
或其他类型的分组可以帮助解决这种情况?
在这里你可以有另一种输入,有一些额外的列
input_1 AS (
SELECT DATE('2021-05-01') AS date_start, DATE('2021-05-05') AS date_end, "user-1" AS user, "white" AS color, "new-york" as city, "america" as country
UNION ALL
SELECT DATE('2021-05-06') AS date_start, DATE('2021-05-10') AS date_end, "user-1" AS user, "blue" AS color, "paris" as city, "france" as country
),
input_2 AS (
SELECT DATE('2021-05-03') AS date_start, DATE('2021-05-07') AS date_end, "user-1" AS user, "apple" AS fruit, "dog" as animal, "daisy" AS flower, "iron" AS metal
UNION ALL
SELECT DATE('2021-05-08') AS date_start, DATE('2021-05-09') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "steel" as metal
UNION ALL
SELECT DATE('2021-05-10') AS date_start, DATE('2021-05-11') AS date_end, "user-1" AS user, "cherry" AS fruit, "dog" as animal, "rose" as flower, "iron" as metal
),
【问题讨论】:
【参考方案1】:有没有更智能的方法来实现这种不分组每一列的结果?
考虑以下通用解决方案
select user,
min(day) date_start,
max(day) date_end,
any_value(t).* except(user, day, grp)
from (
select * except(flag),
countif(ifnull(flag, true)) over(partition by user order by day) grp
from (
select * except(mask1, mask2),
lag(ifnull(mask1, 'null') || ifnull(mask2, 'null')) over(partition by user order by day) != ifnull(mask1, 'null') || ifnull(mask2, 'null') as flag
from (
select user, day, t.* except(user, date_start, date_end),
to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask1
from input_1 t,
unnest(generate_date_array(date_start, date_end)) day
)
full outer join (
select user, day, t.* except(user, date_start, date_end),
to_json_string((select as struct * except(user, date_start, date_end) from unnest([t]))) mask2
from input_2 t,
unnest(generate_date_array(date_start, date_end)) day
)
using(user, day)
)
) t
group by t.user, t.grp
# order by t.user, date_start
如果应用于您问题中的样本数据 - 输出是
【讨论】:
以上是关于BigQuery:按开始日期和结束日期描述的用户元数据 - 创建跨多个表的排列的主要内容,如果未能解决你的问题,请参考以下文章
Python Google BigQuery 参数化 SELECT
使用 Google BigQuery 上的开始/结束日期优化活动帐户查询
连接 BigQuery 和 Google 表格 - 日期参数问题