用sql合并排序的数据
Posted
技术标签:
【中文标题】用sql合并排序的数据【英文标题】:Merge sorted data with sql 【发布时间】:2020-05-18 17:32:02 【问题描述】:我有这样的数据:
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2013|
| 1| b| 2014| 2015|
| 1| a| 2016| 2017|
| 1| a| 2018| 2019|
+---+----+----------+--------+
我想合并具有相同值的期间。 结果我想要这样的数据:
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2013|
| 1| b| 2014| 2015|
| 1| a| 2016| 2019|
+---+----+----------+--------+
(最后两行合并为一个句号)
我试过这样的查询:
%sql
select distinct
id,
hash,
min(start_date) over(partition by hash) as start_date,
max(end_date) over(partition by hash) as end_date
from (
select 1 as id, 'a' as hash, 2012 as start_date, 2013 as end_date
union
select 1 as id, 'b' as hash, 2014 as start_date, 2015 as end_date
union
select 1 as id, 'a' as hash, 2016 as start_date, 2017 as end_date
union
select 1 as id, 'a' as hash, 2018 as start_date, 2019 as end_date
) t
结果是
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2019|
| 1| b| 2014| 2015|
+---+----+----------+--------+
这是错误的,因为 2012-2013 年和 2016-2019 年应该分开。
如何使用 Spark SQL 获得正确的结果?
【问题讨论】:
【参考方案1】:这是一个孤岛问题。最简单的方法是行号的差异。假设您没有间隙,这将起作用:
select id, hash, min(start_date) as start_date, max(end_date) as max_end_date
from (select t.*,
row_number() over (partition by id, hash order by start_date) as seqnum_h,
row_number() over (partition by id order by start_date) as seqnum
from t
) t
group by id, hash, (seqnum - seqnum_h)
【讨论】:
【参考方案2】:这是一个孤岛问题。这是一种使用lag()
和窗口sum
来定义组的方法。这种方法的好处是它允许不同id
s 上的并发周期系列。
考虑:
select id, hash, min(start_date) start_date, max(end_date) end_date
from (
select
t.*,
sum(case when start_date = lag_end_date + 1 then 0 else 1 end)
over(partition by id, hash order by end_date) grp
from (
select
t.*,
lag(end_date) over(partition by id, hash order by end_date) lag_end_date
from mytable t
) t
) t
group by id, hash, grp
order by id, min(start_date)
【讨论】:
以上是关于用sql合并排序的数据的主要内容,如果未能解决你的问题,请参考以下文章