BigQuery:聚合到不同的重复字段

Posted

技术标签:

【中文标题】BigQuery:聚合到不同的重复字段【英文标题】:BigQuery: Aggregate to distinct repeated fields 【发布时间】:2020-02-04 10:10:53 【问题描述】:

如何聚合到不同的重复字段?

想象一下这个数据:

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)

我想要房间 ID 和两组重复字段:学生和教师。但是当我执行下面的查询时,我得到 4 并且任何尝试插入 DISTINCT 都会返回错误。

SELECT room_id, 
        struct(array_agg(name_student) as name, array_agg(age_student) as age) as students,
        struct(array_agg(name_teacher) as name, array_agg(id_teacher) as id) as teachers,

from data
group by 1

如何为学生和教师实现独特的数组?

输出应该是这样的

谢谢!

【问题讨论】:

当您说“两组重复字段”时,您的意思是您的输出中需要两行吗?因此,它会有一个重复的学生姓名? BQ 表结构中定义的重复字段 好的,我明白了。您可以将另一个字段添加到 group by 聚合。此外,可以使用 struct 进行另一个嵌套级别,但是我不明白您希望输出的外观如何。你能在你的问题中详细说明吗? 我看到你更新了你的问题,现在更清楚了输出应该是什么样子。但是,在您的输出中,您忽略了以下行:'5a' as room_id,'george' as name_student,13 as age_student,'Mr.夹'作为name_teacher。是故意的吗?另外当student.name = mick时,你是想把它当作一个新的数据还是嵌套在room_id= 5a里面? 【参考方案1】:

这个答案有点冗长,但应该可以满足您的需求。我更喜欢使用ARRAY_AGG(STRUCT()) 而不是STRUCT(ARRAY_AGG(),ARRAY_AGG()) 来确保你保持'George 13 岁'和'Jane 14 岁'的关系(想象一下在你的列表中添加一个 14 岁的 George,你怎么知道哪个是哪个?)。

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
),
students_distinct as (
  select distinct room_id, name_student as name, age_student as age from data
),
students_agg as (
  select room_id,array_agg(struct(name,age)) as student from students_distinct group by 1
),
teachers_distinct as (
  select distinct room_id, name_teacher as name, id_teacher as id from data
),
teachers_agg as (
  select room_id,array_agg(struct(name,id)) as teacher from teachers_distinct group by 1
)
select room_id, s.student, t.teacher
from students_agg s
inner join teachers_agg t using(room_id)

【讨论】:

【参考方案2】:

我运行您的查询,在所有 array_agg 函数中添加 distinct 并且工作正常。

WITH data as (
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher,
union all 
 select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all 
 select '5a' as room_id,  'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)
SELECT room_id, 
        struct(array_agg(distinct name_student) as name, array_agg(distinct  age_student) as age) as students,
        struct(array_agg(distinct name_teacher) as name, array_agg(distinct  id_teacher) as id) as teachers
from data
group by 1

不过,如果您尝试获取包含年龄的学生列表和包含 ID 的教师列表,我不确定这是否能在真实数据集上正常工作。例如在数据表中添加select '5a' as room_id, 'george' as name_student, 20 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher, 会出现问题,元组george, 20 丢失。

【讨论】:

这正是问题所在。乔治,20 被打散了,迷路了 从问题中不清楚你想这样做。我猜是因为这似乎是最合理的事情。 @rtenha 答案很完美。 但是你需要一个 student_id,因为“姓名,年龄”作为唯一 ID 不是很安全。年龄也从一年变为另一岁。明年会发生什么?您将拥有具有不同“钥匙”的同一个学生。如果您需要跨年运行分析,这将使结果令人困惑。特别是,如果年龄在实际生日当天更新。在这种情况下,即使是跨月或数周的查询也会返回令人困惑的结果。

以上是关于BigQuery:聚合到不同的重复字段的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery 中按键(或对称聚合)函数求和不同

BigQuery:对具有不同字段顺序的重复字段进行联合

Big Query 透视和聚合重复字段

错误“已设置非重复字段”。从 Datastore 加载到 BigQuery 时

sql [BigQuery - Facebook产品目录]查询para obtenerelcatálogodeproductos de Kichink。 #facebook #bigqu

bigquery重复记录中的数据顺序