listagg：删除相邻的重复项

Posted 2023-02-24

技术标签:

【中文标题】listagg：删除相邻的重复项【英文标题】：listagg: remove adjacent duplicates 【发布时间】：2018-11-28 04:35:20 【问题描述】：

我有时间戳数据，想从一列创建一个列表，相邻的重复项（但不是所有重复项）合并为一个。

例如，给定以下数据：

'2001-01-01 00:00:01' 'a'
'2001-01-01 00:00:02' 'a'
'2001-01-01 00:00:03' 'b'
'2001-01-01 00:00:04' 'b'
'2001-01-01 00:00:05' 'b'
'2001-01-01 00:00:06' 'a'
'2001-01-01 00:00:07' 'a'
'2001-01-01 00:00:08' 'c'
'2001-01-01 00:00:09' 'a'

——我希望结果是'a','b','a','c','a'。

我正在使用 Snowflake，它有 listagg(distinct foo) 和 listagg(distinct foo) within group(order by bar) 甚至 listagg(distinct foo) within group(order by bar) over(partition by baz)，但我没有找到满足我需要的方法（Google 也没有提供帮助）。我真的很想避免join。

如果您知道另一种方言的解决方案，其中包含 listagg 或 group_concat，请发布它，我会尝试将其翻译成 Snowflake 以供我使用。非常感谢。

不起作用的事情：

我尝试了 trim(regexp_replace('~' || listagg(foo, '~') || '~', '~([^~]+~)\\1', '~\\1'), '~')，但 Snowflake 不允许在匹配模式中使用 \1：我收到错误 Invalid regular expression: '~([^~]+~)\1', invalid escape sequence: \1。我尝试了

listagg(iff(lag(foo) ignore nulls over(partition by baz order by bar)=foo, null, foo), ',') within group(order by bar) over(partition by baz)

，但得到了错误Window function [LAG(...)] may not be nested inside another window function.

【问题讨论】：

【参考方案1】：

不幸的是，我认为 Snowflake 不支持正则表达式模式中的反向引用。

可能的解决方案：

使用 LAG 消除输入流中的重复项，例如

with sub as (select foo, bar, lag(bar) over (order by foo) barlag)
select listagg(foo) within group order by (bar) from foo 
where barlag is null or barlag <> lag;

使用 LISTAGG，但编写一个 javascript UDF，拆分 LISTAGG 的结果并消除其中的重复项

编写一个 JavaScript UDTF（表函数），执行 LISTAGG 并消除重复

【讨论】：

以上是关于listagg：删除相邻的重复项的主要内容，如果未能解决你的问题，请参考以下文章