在 Bigquery 中使用结构数组删除重复项并选择不同的值

Posted

技术标签:

【中文标题】在 Bigquery 中使用结构数组删除重复项并选择不同的值【英文标题】:Removing duplicates and selecting distinct values with struct array in Bigquery 【发布时间】:2020-01-31 15:48:03 【问题描述】:

我开始使用 BigQuery。我有一个看起来像 this 的数据库 可以生成为

    WITH T AS (
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(2, "dot"), (2, "dot"), (1, "string")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'red' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(20, "dot"), (20, "dot"), (1, "string")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (212, "plane"), (156, "cube")] AS arr, DATE(2020,01,31) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (21, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 0 AS id, 'black' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(296, "dot"), (2, "plane"), (156, "cube")] AS arr, DATE(2020,01,30) AS date UNION ALL
  SELECT 1 AS id, 'blue' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(4, "cube"), (4, "cube"), (4, "cube")], DATE(2020, 01, 31) AS date UNION ALL
  SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,31) AS date UNION ALL
  SELECT 2 AS id, 'orange' AS colour, ARRAY<STRUCT<count INT64, shape STRING>>[(5, "string")], DATE(2020,01,30) AS date
)
SELECT *
FROM T;

我想选择每个不同的日期,并为每个日期取每个形状和每个 id 和每种颜色的最大计数。例如,对于 2020-01-31,对于红色 0,它将是 2 点 1 字符串,对于 2020-01-30,对于 0 黑色,它将是 296 点 21 平面 156 立方体。数据中的行、日期和结构数组内可能存在重复。

更准确地说,我希望查询结果看起来像 this ,可以由

生成
WITH T AS (
  SELECT DATE(2020,01,31) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(2, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (212, "plane", 0, "black"), (156, "cube", 0, "black"), (4, "cube", 1, "blue"), (5, "string", 2, "orange")] AS res UNION ALL
  SELECT DATE(2020,01,30) AS date, ARRAY<STRUCT<count INT64, shape STRING, id INT64, colour STRING>>[(20, "dot", 0, "red"), (1, "string", 0, "red"), (296, "dot", 0, "black"), (21, "plane", 0, "black"), (156, "cube", 0, "black"), (5, "string", 2, "orange")] AS res
)
SELECT *
FROM T;

我正在努力解决两个问题:删除重复项以及为数组的每一行选择 id 和 shape。例如查询

SELECT date, ARRAY_CONCAT_AGG(ARRAY((SELECT AS STRUCT MAX(count), shape FROM UNNEST(arr) GROUP BY shape)))
FROM T
GROUP BY date

返回我的副本。然后我需要为每个嵌套行分配 id 和颜色。任何建议将不胜感激。

谢谢!

【问题讨论】:

【参考方案1】:
with T AS (),
unnested_and_unique as (
  select distinct id, colour, count, shape, date
  from T left join unnest(arr) x
)
select date,array_agg(struct(count,shape,id,colour)) as res
from unnested_and_unique
group by 1

【讨论】:

哦,愚蠢的我,我是交叉加入的!谢谢你的回答

以上是关于在 Bigquery 中使用结构数组删除重复项并选择不同的值的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery - 从数组中删除重复项

从 BigQuery 中的数组中删除重复项

根据一列删除重复项并根据另一表中的数据进行过滤

在新的 BigQuery 标准 SQL 的数组中使用结构

删除javascript数组中重复对象的所有实例[重复]

Excel删除重复数据及用公式筛选重复项并标记颜色突出显示