在 Bigquery 中使用结构数组删除重复项并选择不同的值
Posted
技术标签:
【中文标题】在 Bigquery 中使用结构数组删除重复项并选择不同的值【英文标题】:Removing duplicates and selecting distinct values with struct array in Bigquery 【发布时间】:2020-01-31 15:48:03 【问题描述】:我开始使用 BigQuery。我有一个看起来像 this 的数据库 可以生成为
WITH T AS (
SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(20, "dot"), (20, "dot"), (1, "string")] AS arr, DATE(2020,01,30) AS date UNION ALL
SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (21, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (2, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
SELECT 1 AS id, 'blue' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(4, "cube"), (4, "cube"), (4, "cube")], DATE(2020, 01, 31) AS date UNION ALL
SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,31) AS date UNION ALL
SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,30) AS date
)
SELECT *
FROM T;
我想选择每个不同的日期,并为每个日期取每个形状和每个 id 和每种颜色的最大计数。例如,对于 2020-01-31,对于红色 0,它将是 2 点 1 字符串,对于 2020-01-30,对于 0 黑色,它将是 296 点 21 平面 156 立方体。数据中的行、日期和结构数组内可能存在重复。
更准确地说,我希望查询结果看起来像 this ,可以由
生成WITH T AS (
SELECT DATE(2020,01,31) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(2, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (212, "plane", 0, "black"), (156, "cube", 0, "black"), (4, "cube", 1, "blue"), (5, "string", 2, "orange")] AS res UNION ALL
SELECT DATE(2020,01,30) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(20, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (21, "plane", 0, "black"), (156, "cube", 0, "black"), (5, "string", 2, "orange")] AS res
)
SELECT *
FROM T;
我正在努力解决两个问题:删除重复项以及为数组的每一行选择 id 和 shape。例如查询
SELECT date, ARRAY_CONCAT_AGG(ARRAY((SELECT AS STRUCT MAX(count), shape FROM UNNEST(arr) GROUP BY shape)))
FROM T
GROUP BY date
返回我的副本。然后我需要为每个嵌套行分配 id 和颜色。任何建议将不胜感激。
谢谢!
【问题讨论】:
【参考方案1】:with T AS (),
unnested_and_unique as (
select distinct id, colour, count, shape, date
from T left join unnest(arr) x
)
select date,array_agg(struct(count,shape,id,colour)) as res
from unnested_and_unique
group by 1
【讨论】:
哦,愚蠢的我,我是交叉加入的!谢谢你的回答以上是关于在 Bigquery 中使用结构数组删除重复项并选择不同的值的主要内容,如果未能解决你的问题,请参考以下文章