用sql合并排序的数据

Posted

技术标签:

【中文标题】用sql合并排序的数据【英文标题】:Merge sorted data with sql 【发布时间】:2020-05-18 17:32:02 【问题描述】:

我有这样的数据:

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2013|
|  1|   b|      2014|    2015|
|  1|   a|      2016|    2017|
|  1|   a|      2018|    2019|
+---+----+----------+--------+

我想合并具有相同值的期间。 结果我想要这样的数据:

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2013|
|  1|   b|      2014|    2015|
|  1|   a|      2016|    2019|
+---+----+----------+--------+

(最后两行合并为一个句号)

我试过这样的查询:

%sql
select distinct 
 id, 
 hash,  
 min(start_date)  over(partition by hash) as start_date,  
 max(end_date) over(partition by hash) as  end_date 
from (
 select 1 as id, 'a' as hash, 2012 as start_date, 2013 as end_date
  union 
 select 1 as id, 'b' as hash, 2014 as start_date, 2015 as end_date
  union 
 select 1 as id, 'a' as hash, 2016 as start_date, 2017 as end_date
  union 
 select 1 as id, 'a' as hash, 2018 as start_date, 2019 as end_date
) t

结果是

+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
|  1|   a|      2012|    2019|
|  1|   b|      2014|    2015|
+---+----+----------+--------+

这是错误的,因为 2012-2013 年和 2016-2019 年应该分开。

如何使用 Spark SQL 获得正确的结果?

【问题讨论】:

【参考方案1】:

这是一个孤岛问题。最简单的方法是行号的差异。假设您没有间隙,这将起作用:

select id, hash, min(start_date) as start_date, max(end_date) as max_end_date
from (select t.*,
             row_number() over (partition by id, hash order by start_date) as seqnum_h,
             row_number() over (partition by id order by start_date) as seqnum
      from t
     ) t
group by id, hash, (seqnum - seqnum_h)

【讨论】:

【参考方案2】:

这是一个孤岛问题。这是一种使用lag() 和窗口sum 来定义组的方法。这种方法的好处是它允许不同ids 上的并发周期系列。

考虑:

select id, hash, min(start_date) start_date, max(end_date) end_date
from (
    select
        t.*,
        sum(case when start_date = lag_end_date + 1 then 0 else 1 end)
            over(partition by id, hash order by end_date) grp
    from (
        select 
            t.*, 
            lag(end_date) over(partition by id, hash order by end_date) lag_end_date
        from mytable t
    ) t
) t
group by id, hash, grp
order by id, min(start_date)

【讨论】:

以上是关于用sql合并排序的数据的主要内容,如果未能解决你的问题,请参考以下文章

PCB MS SQL 排序应用---相邻数据且相同合并处理

SQL QUERY 在对表进行排序后合并连续的相同值

TS文件与m3u8文件合并怎么排序

将来自两个不同 wordpress 的帖子合并到一个按日期排序的帖子页面

混合快速/合并排序对随机数据的性能

将两个排序向量合并到一个排序向量中