BigQuery - 在给定的时间点获取不同的值
Posted
技术标签:
【中文标题】BigQuery - 在给定的时间点获取不同的值【英文标题】:BigQuery - Get distinct values up to a given point in time 【发布时间】:2019-04-14 12:12:53 【问题描述】:我有一个包含用户操作的表格(例如查看页面、点击按钮等)。每行包含一个 user_id、一个日期 (created_on) 和操作的名称。我想创建一个查询,它会为每个日期创建一个嵌套字段,其中包含在该日期之前并包括该日期所采取的不同操作。例如,我有一个名为user_actions
的表:
-------------------------------------
| user_id | date | action |
-------------------------------------
| 1 | 2018-04-01 | click |
| 2 | 2018-04-01 | view |
| 1 | 2018-04-02 | view |
| 2 | 2018-04-02 | view |
| 2 | 2018-04-03 | buy |
-------------------------------------
would result in
-------------------------------------
| user_id | date | actions |
-------------------------------------
| 1 | 2018-04-01 | click |
| 2 | 2018-04-01 | view |
| 1 | 2018-04-02 | click |
| 2 | 2018-04-02 | view |
| | | view |
| 2 | 2018-04-03 | view |
| 2 | | buy |
-------------------------------------
在第二个表中,actions 是一个嵌套的重复字段。我知道在某个时间点我可以使用类似于以下内容的内容:
SELECT
user_id,
date,
ARRAY(action)
FROM
user_actions
GROUP BY
1,2
但是我不确定如何扩展它以为原始表中的每个日期提供相同的计算,并且只查看 date
字段之前的时间。
任何帮助将不胜感激。谢谢!
【问题讨论】:
【参考方案1】:创建一个包含在该日期之前(包括该日期)所采取的不同操作的嵌套字段
以下是 BigQuery Standrad SQL
#standardSQL
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(actions) action
) actions
FROM (
SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
FROM `project.dataset.table`
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
您可以使用您问题中的示例数据进行测试,如以下示例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, '2018-04-01' date, 'click' action UNION ALL
SELECT 2, '2018-04-01', 'view' UNION ALL
SELECT 1, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-03', 'buy'
)
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(actions) action
) actions
FROM (
SELECT user_id, date, ARRAY_AGG(action) OVER(win) actions
FROM `project.dataset.table`
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
-- ORDER BY date, user_id
结果
更新
以下版本支持更通用的情况,同一用户在同一天执行多项操作(我意识到这不是我最初回答的情况)
#standardSQL
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
) actions
FROM (
SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
FROM (
SELECT user_id, date, STRING_AGG(DISTINCT action) actions
FROM `project.dataset.table`
GROUP BY user_id, date
)
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
您可以使用以下示例数据对其进行测试(注意 extyra 行与 activity = 'play' )
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 user_id, DATE '2018-04-01' date, 'click' action UNION ALL
SELECT 2, '2018-04-01', 'view' UNION ALL
SELECT 1, '2018-04-02', 'view' UNION ALL
SELECT 1, '2018-04-02', 'play' UNION ALL
SELECT 2, '2018-04-02', 'view' UNION ALL
SELECT 2, '2018-04-03', 'buy'
)
SELECT user_id, date,
ARRAY(
SELECT DISTINCT action FROM UNNEST(SPLIT(actions)) action
) actions
FROM (
SELECT user_id, date , STRING_AGG(actions) OVER(win) actions
FROM (
SELECT user_id, date, STRING_AGG(DISTINCT action) actions
FROM `project.dataset.table`
GROUP BY user_id, date
)
WINDOW win AS (
PARTITION BY user_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
-- ORDER BY date, user_id
结果
【讨论】:
哇!感谢您的精彩回答!窗口功能部分我没有实现。以上是关于BigQuery - 在给定的时间点获取不同的值的主要内容,如果未能解决你的问题,请参考以下文章
如何获取 BigQuery 中给定存储库的 GitHub 星总数?