BigQuery - 获取每个用户的最新数据

Posted

技术标签:

【中文标题】BigQuery - 获取每个用户的最新数据【英文标题】:BigQuery - Get most recent data for each individual user 【发布时间】:2021-04-14 07:57:31 【问题描述】:

我想知道这里是否有人可以为我正在研究的 BigQuery 提供帮助。

这需要为域中的每个用户提取最近的 gplus/currents 活动。 我尝试了以下查询,但这会为每个用户提取所有活动:

SELECT
  TIMESTAMP_MICROS(time_usec) AS date,
  email,
  event_type,
  event_name
FROM
  `bqadminreporting.adminlogtracking.activity`
WHERE
  record_type LIKE 'gplus'
ORDER BY
  email ASC;

我已尝试使用 DISTINCT,但我仍然为同一用户获得多个条目。理想情况下,我需要回顾 90 天以上...(所以在今天和 90 天前之间,获取每个用户的最新活动 - 如果这有意义吗?)这让我遇到了另一个 question 的问题.

编辑: 示例数据和预期输出。

字段:有500多个字段,我只是列出了相关的

+--------------------------------+---------+----------+
|           Field name           |  Type   |   Mode   |
+--------------------------------+---------+----------+
| time_usec                      | INTEGER | NULLABLE |
| email                          | STRING  | NULLABLE |
| event_type                     | STRING  | NULLABLE |
| event_name                     | STRING  | NULLABLE |
| record_type                    | STRING  | NULLABLE |
| gplus                          | RECORD  | NULLABLE |
| gplus. log_event_resource_name | STRING  | NULLABLE |
| gplus. attachment_type         | STRING  | NULLABLE |
| gplus. plusone_context         | STRING  | NULLABLE |
| gplus. post_permalink          | STRING  | NULLABLE |
| gplus. post_resource_name      | STRING  | NULLABLE |
| gplus. comment_resource_name   | STRING  | NULLABLE |
| gplus. post_visibility         | STRING  | NULLABLE |
| gplus. user_type               | STRING  | NULLABLE |
| gplus. post_author_name        | STRING  | NULLABLE |
+--------------------------------+---------+----------+

我的查询的输出:这是我在上面运行查询时得到的输出。

+-----+--------------------------------+------------------+----------------+----------------+
| Row |              date              |      email       |   event_type   |   event_name   |
+-----+--------------------------------+------------------+----------------+----------------+
|   1 | 2020-01-30 07:10:19.088 UTC    | user1@domain.com | post_change    | create_post    |
|   2 | 2020-03-03 08:47:25.086485 UTC | user1@domain.com | coment_change  | create_comment |
|   3 | 2020-03-23 09:10:09.522 UTC    | user1@domain.com | post_change    | create_post    |
|   4 | 2020-03-23 09:49:00.337 UTC    | user1@domain.com | plusone_change | remove_plusone |
|   5 | 2020-03-23 09:48:10.461 UTC    | user1@domain.com | plusone_change | add_plusone    |
|   6 | 2020-01-30 10:04:29.757005 UTC | user1@domain.com | coment_change  | create_comment |
|   7 | 2020-03-28 08:52:50.711359 UTC | user2@domain.com | coment_change  | create_comment |
|   8 | 2020-11-08 10:08:09.161325 UTC | user2@domain.com | coment_change  | create_comment |
|   9 | 2020-04-21 15:28:10.022683 UTC | user3@domain.com | coment_change  | create_comment |
|  10 | 2020-03-28 09:37:28.738863 UTC | user4@domain.com | coment_change  | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+

期望的结果:每个用户只有 1 行数据,只显示最近的事件。

+-----+--------------------------------+------------------+----------------+----------------+
| Row |              date              |      email       |   event_type   |   event_name   |
+-----+--------------------------------+------------------+----------------+----------------+
|   1 | 2020-03-23 09:49:00.337 UTC    | user1@domain.com | plusone_change | remove_plusone |
|   2 | 2020-11-08 10:08:09.161325 UTC | user2@domain.com | coment_change  | create_comment |
|   3 | 2020-04-21 15:28:10.022683 UTC | user3@domain.com | coment_change  | create_comment |
|   4 | 2020-03-28 09:37:28.738863 UTC | user4@domain.com | coment_change  | create_comment |
+-----+--------------------------------+------------------+----------------+----------------+

【问题讨论】:

您能否展示一个示例输入数据和预期输出? 更新了我的问题以显示字段类型、我的当前输出和我想要的输出 【参考方案1】:

如果您想要最近行中的所有列,可以使用以下 BigQuery 语法:

select array_agg(t order by date desc limit 1)[ordinal(1)].*
from mytable t
group by t.email;

如果您想要特定的列,那么 Sergey 的解决方案可能会更简单。

【讨论】:

【参考方案2】:

使用array_agg:

select 
  email,
  array_agg(STRUCT(TIMESTAMP_MICROS(time_usec) as date, event_type, event_name) ORDER BY time_usec desc LIMIT 1)[OFFSET(0)].*
from `bqadminreporting.adminlogtracking.activity`
where
  record_type LIKE 'gplus'
  and time_usec > unix_micros(timestamp_sub(current_timestamp(), interval 90 day))
group by email
order by email

测试示例:

with mytable as (
  select timestamp '2020-01-30 07:10:19.088 UTC' as date, 'user1@domain.com' as email, 'post_change' as event_type, 'create_post' as event_name union all
  select timestamp '2020-03-03 08:47:25.086485 UTC', 'user1@domain.com', 'coment_change', 'create_comment' union all
  select timestamp '2020-03-23 09:10:09.522 UTC', 'user1@domain.com', 'post_change', 'create_post' union all
  select timestamp '2020-03-23 09:49:00.337 UTC', 'user1@domain.com', 'plusone_change', 'remove_plusone' union all
  select timestamp '2020-03-23 09:48:10.461 UTC', 'user1@domain.com', 'plusone_change', 'add_plusone' union all
  select timestamp '2020-01-30 10:04:29.757005 UTC', 'user1@domain.com', 'coment_change', 'create_coment' union all
  select timestamp '2020-03-28 08:52:50.711359 UTC', 'user2@domain.com', 'coment_change', 'create_coment' union all
  select timestamp '2020-11-08 10:08:09.161325 UTC', 'user2@domain.com', 'coment_change', 'create_coment' union all
  select timestamp '2020-04-21 15:28:10.022683 UTC', 'user3@domain.com', 'coment_change', 'create_coment' union all
  select timestamp '2020-03-28 09:37:28.738863 UTC', 'user4@domain.com', 'coment_change', 'create_coment'
)
select 
  email,
  array_agg(STRUCT(date, event_type, event_name) ORDER BY date desc LIMIT 1)[OFFSET(0)].*
from mytable
group by email

【讨论】:

啊!惊人的。如果我只想回顾 90 天,你知道我会怎么做那个 Sergey 吗? 哦,忘了。已添加time_usec > unix_micros(timestamp_sub(current_timestamp(), interval 90 day))【参考方案3】:

解决问题的另一种方法是:-

select * from (
select 
max (date1) max_dt
from  mytable
group by date(date1)), mytable
where date1=max_dt

【讨论】:

以上是关于BigQuery - 获取每个用户的最新数据的主要内容,如果未能解决你的问题,请参考以下文章

Big Query-如何在 Big Query 中按浏览量、用户每周比较数据

7 天用户计数:Big-Query 自加入以获取日期范围和计数?

Big Query:如何提取数据集的每个表创建时间?

BigQuery:仅当字段具有特定值时才获取表中的最新行

减少查询大小:根据最新日期将新数据附加到 Big Query 表

获取 BigQuery 中最新行的属性?