如何在 bigquery 中删除 array_agg

Posted

技术标签:

【中文标题】如何在 bigquery 中删除 array_agg【英文标题】:How to dedup array_agg in bigquery 【发布时间】:2022-01-12 17:21:30 【问题描述】:

我创建了一个包含重复记录的新表。 我正在尝试找到删除重复记录的最有效方法,因为这将运行 在具有数百万条记录的表上。 如果您使用多个 CTE 嵌套,那么您的数据结构是什么,处理是在内存中完成还是在有大量数据时写入临时表是否重要。

create or replace table t1.cte4 as
WITH t1 AS (
  SELECT 1 as id,'eren' AS last_name UNION ALL
  SELECT 1 as id,'yilmaz' AS last_name UNION ALL
  SELECT 1 as id,'kaya' AS last_name UNION ALL
  SELECT 1 as id,'kaya' AS last_name UNION ALL
  SELECT 2 as id,'smith' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'brown' AS last_name
)
SELECT id,ARRAY_AGG(STRUCT(last_name)) AS last_name_rec
FROM t1
GROUP BY id;

我可以按如下方式删除重复项。

QUERY 1 How to dedup the concat_struct ?
select id, 
STRING_AGG( distinct ln.last_name ,'~') as concat_string,
ARRAY_AGG(STRUCT( ln.last_name )) as concat_struct
from `t1.cte4`, unnest(last_name_rec) ln
group by id;

QUERY 1

QUERY 2 Is there a better way then this to dedup?
select distinct id, 
TO_JSON_STRING(ARRAY_AGG(ln.last_name) OVER (PARTITION BY id)) json_string
from `t1.cte4`, unnest(last_name_rec) ln
group by id,
ln.last_name;

QUERY 2

如何将其从表中取出,而不是使用 CTE。这不会重复数据删除。

select id,  ARRAY_AGG(STRUCT( ln.last_name )) as concat_struct 
from t1.cte4, 
unnest(last_name_rec) ln group by id; 

我做不到。

select id,  ARRAY_AGG(distinct STRUCT( ln.last_name )) as concat_struct from t1.cte4, 
unnest(last_name_rec) ln group by id;

【问题讨论】:

您想保留id, last_name 的不同组合还是只希望每个id 有一个姓氏? 每个 id 的姓氏 【参考方案1】:

更新:在去重之前分解结构,然后再组合回来:

select id, ARRAY_AGG(STRUCT(last_name)) as concat_struct 
from (
  select id, ln.last_name
  from cte4, unnest(last_name_rec) ln
  group by id, ln.last_name 
) d
group by id

(基于对表定义的不必要更改的原始答案如下)

只需使用array_agg(distinct ...):

WITH t1 AS (
  SELECT 1 as id,'eren' AS last_name UNION ALL
  SELECT 1 as id,'yilmaz' AS last_name UNION ALL
  SELECT 1 as id,'kaya' AS last_name UNION ALL
  SELECT 1 as id,'kaya' AS last_name UNION ALL
  SELECT 2 as id,'smith' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'jones' AS last_name UNION ALL
  SELECT 2 as id,'brown' AS last_name
)
SELECT id,ARRAY_AGG(distinct last_name) AS last_name_rec
FROM t1
GROUP BY id;

【讨论】:

如何将其从表中取出,而不是使用 CTE。这不会重复数据删除。 -------------------------------------------------- -------------------------------------------------- - select id, ARRAY_AGG(STRUCT( ln.last_name )) as concat_struct from t1.cte4, unnest(last_name_rec) ln group by id;我做不到。 -------------------------------------------------- -------------------------------------------------- - select id, ARRAY_AGG(distinct STRUCT( ln.last_name )) as concat_struct from t1.cte4, unnest(last_name_rec) ln group by id; @DenisTheMenace 我看到您必须将结构保留在表定义中。我更新了答案。 谢谢,感谢您的帮助。

以上是关于如何在 bigquery 中删除 array_agg的主要内容,如果未能解决你的问题,请参考以下文章

Bigquery如何从数据流中删除记录

如何删除 BigQuery 数组中的空值?

如何从 Bigquery Schema 中删除未使用的列名

如何同步调用 google-bigquery 删除和插入 API?

如何在视图或计划查询之间进行选择,以对通过 Stitch 导入的 BigQuery 表进行重复数据删除?

从 BigQuery 中的数组中删除重复项