BigQuery:聚合到不同的重复字段
Posted
技术标签:
【中文标题】BigQuery:聚合到不同的重复字段【英文标题】:BigQuery: Aggregate to distinct repeated fields 【发布时间】:2020-02-04 10:10:53 【问题描述】:如何聚合到不同的重复字段?
想象一下这个数据:
WITH data as (
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)
我想要房间 ID 和两组重复字段:学生和教师。但是当我执行下面的查询时,我得到 4 并且任何尝试插入 DISTINCT
都会返回错误。
SELECT room_id,
struct(array_agg(name_student) as name, array_agg(age_student) as age) as students,
struct(array_agg(name_teacher) as name, array_agg(id_teacher) as id) as teachers,
from data
group by 1
如何为学生和教师实现独特的数组?
输出应该是这样的
谢谢!
【问题讨论】:
当您说“两组重复字段”时,您的意思是您的输出中需要两行吗?因此,它会有一个重复的学生姓名? BQ 表结构中定义的重复字段 好的,我明白了。您可以将另一个字段添加到 group by 聚合。此外,可以使用 struct 进行另一个嵌套级别,但是我不明白您希望输出的外观如何。你能在你的问题中详细说明吗? 我看到你更新了你的问题,现在更清楚了输出应该是什么样子。但是,在您的输出中,您忽略了以下行:'5a' as room_id,'george' as name_student,13 as age_student,'Mr.夹'作为name_teacher。是故意的吗?另外当student.name = mick时,你是想把它当作一个新的数据还是嵌套在room_id= 5a里面? 【参考方案1】:这个答案有点冗长,但应该可以满足您的需求。我更喜欢使用ARRAY_AGG(STRUCT())
而不是STRUCT(ARRAY_AGG(),ARRAY_AGG())
来确保你保持'George 13 岁'和'Jane 14 岁'的关系(想象一下在你的列表中添加一个 14 岁的 George,你怎么知道哪个是哪个?)。
WITH data as (
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
),
students_distinct as (
select distinct room_id, name_student as name, age_student as age from data
),
students_agg as (
select room_id,array_agg(struct(name,age)) as student from students_distinct group by 1
),
teachers_distinct as (
select distinct room_id, name_teacher as name, id_teacher as id from data
),
teachers_agg as (
select room_id,array_agg(struct(name,id)) as teacher from teachers_distinct group by 1
)
select room_id, s.student, t.teacher
from students_agg s
inner join teachers_agg t using(room_id)
【讨论】:
【参考方案2】:我运行您的查询,在所有 array_agg
函数中添加 distinct
并且工作正常。
WITH data as (
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'george' as name_student, 13 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher,
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student , 'Mr. Smith' as name_teacher, 43 as id_teacher
union all
select '5a' as room_id, 'jane' as name_student, 14 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher
)
SELECT room_id,
struct(array_agg(distinct name_student) as name, array_agg(distinct age_student) as age) as students,
struct(array_agg(distinct name_teacher) as name, array_agg(distinct id_teacher) as id) as teachers
from data
group by 1
不过,如果您尝试获取包含年龄的学生列表和包含 ID 的教师列表,我不确定这是否能在真实数据集上正常工作。例如在数据表中添加select '5a' as room_id, 'george' as name_student, 20 as age_student, 'Mr. Climp' as name_teacher, 38 as id_teacher,
会出现问题,元组george, 20
丢失。
【讨论】:
这正是问题所在。乔治,20 被打散了,迷路了 从问题中不清楚你想这样做。我猜是因为这似乎是最合理的事情。 @rtenha 答案很完美。 但是你需要一个 student_id,因为“姓名,年龄”作为唯一 ID 不是很安全。年龄也从一年变为另一岁。明年会发生什么?您将拥有具有不同“钥匙”的同一个学生。如果您需要跨年运行分析,这将使结果令人困惑。特别是,如果年龄在实际生日当天更新。在这种情况下,即使是跨月或数周的查询也会返回令人困惑的结果。以上是关于BigQuery:聚合到不同的重复字段的主要内容,如果未能解决你的问题,请参考以下文章
错误“已设置非重复字段”。从 Datastore 加载到 BigQuery 时
sql [BigQuery - Facebook产品目录]查询para obtenerelcatálogodeproductos de Kichink。 #facebook #bigqu