在 SQL 中对有序数据中的子集进行分组

Posted

技术标签:

【中文标题】在 SQL 中对有序数据中的子集进行分组【英文标题】:Grouping subsets within ordered data in SQL 【发布时间】:2021-01-05 18:29:25 【问题描述】:

我有一个制造业务的数据集。在流程的某些部分,可能有一些步骤可以并行完成,这意味着它们可以按任何顺序完成,甚至可能重叠。例如,在下面的示例中,订单 1001 的步骤 2、3 和 4 可以按任何顺序执行。 Type = C 表示并行操作。

由于历史数据可能会显示以任何顺序完成的并行步骤,因此我想将每个 C 步骤块视为一行,并使用该组中的最短开始时间和最长结束时间,如所需表中所示。

如何在 SQL 中实现这一点?特别是 HANA SQL,但任何相关示例都会有所帮助。

当前:

+-----------+------+------+---------------------+---------------------+
| order_nbr | step | type |        start        |         end         |
+-----------+------+------+---------------------+---------------------+
|      1001 |    1 | P    | 2021-01-01 00:00:00 | 2021-01-01 09:00:00 |
|      1001 |    2 | C    | 2021-01-04 03:00:00 | 2021-01-04 06:00:00 |
|      1001 |    3 | C    | 2021-01-03 07:00:00 | 2021-01-03 08:00:00 |
|      1001 |    4 | C    | 2021-01-05 10:00:00 | 2021-01-05 15:00:00 |
|      1001 |    5 | Z    | 2021-01-06 00:00:00 | 2021-01-06 06:00:00 |
|      1001 |    6 | Z    | 2021-01-06 16:00:00 | 2021-01-06 20:00:00 |
|      1001 |    7 | C    | 2021-01-07 08:00:00 | 2021-01-07 09:00:00 |
|      1001 |    8 | C    | 2021-01-07 10:00:00 | 2021-01-07 12:00:00 |
|      1002 |    1 | P    | 2021-01-04 08:00:00 | 2021-01-04 16:00:00 |
+-----------+------+------+---------------------+---------------------+

期望:

+-----------+---------+------+---------------------+---------------------+
| order_nbr |  step   | type |        start        |         end         |
+-----------+---------+------+---------------------+---------------------+
|      1001 | 1       | P    | 2021-01-01 00:00:00 | 2021-01-01 09:00:00 |
|      1001 | 2, 3, 4 | C    | 2021-01-03 07:00:00 | 2021-01-05 15:00:00 |
|      1001 | 5       | Z    | 2021-01-06 00:00:00 | 2021-01-06 06:00:00 |
|      1001 | 6       | Z    | 2021-01-06 16:00:00 | 2021-01-06 20:00:00 |
|      1001 | 7, 8    | C    | 2021-01-07 08:00:00 | 2021-01-07 12:00:00 |
|      1002 | 1       | P    | 2021-01-04 08:00:00 | 2021-01-04 16:00:00 |
+-----------+---------+------+---------------------+---------------------+

【问题讨论】:

分组串联?就像 SQL Server 中的 STRING_AGG 一样 要获得连接,是的,但不确定如何选择性地对 C 类行的块进行分组。 GROUP BY order_nbr, type, CASE WHEN type <> 'C' THEN step ELSE NULL END 然后你选择order_nbr, type, step = STRING_AGG(step), start = MIN(start), end = MIN(end) 在一个订单中,例如1001,有两个C块必须独立分组。如果我正确运行该 sn-p,它将在最终结果中为 1001 顺序内的所有 C 返回一行。 抱歉,错过了这一点。您必须使用某种ROW_NUMBER 方案来对项目进行分组。请参阅 Itzik Ben-Gan 和 here also 【参考方案1】:

这是一个空白和孤岛问题,commented earlier 也是如此,因此您可以查看链接的文章以深入了解该问题。但是您需要在找到岛屿后有条件地对数据进行分组(您只需折叠type = 'C' 项。

代码如下:

with s as (
  select '1001' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-01 00:00:00' as start_ts, timestamp '2021-01-01 09:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '2' as step, 'C' as ex_type, timestamp '2021-01-04 03:00:00' as start_ts, timestamp '2021-01-04 06:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '3' as step, 'C' as ex_type, timestamp '2021-01-03 07:00:00' as start_ts, timestamp '2021-01-03 08:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '4' as step, 'C' as ex_type, timestamp '2021-01-05 10:00:00' as start_ts, timestamp '2021-01-05 15:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '5' as step, 'Z' as ex_type, timestamp '2021-01-06 00:00:00' as start_ts, timestamp '2021-01-06 06:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '6' as step, 'Z' as ex_type, timestamp '2021-01-06 16:00:00' as start_ts, timestamp '2021-01-06 20:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '7' as step, 'C' as ex_type, timestamp '2021-01-07 08:00:00' as start_ts, timestamp '2021-01-07 09:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '8' as step, 'C' as ex_type, timestamp '2021-01-07 10:00:00' as start_ts, timestamp '2021-01-07 12:00:00' as end_ts from dummy union all
  select '1002' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-04 08:00:00' as start_ts, timestamp '2021-01-04 16:00:00' as end_ts from dummy
)
, num as (
  select
    s.*
    /*Find consecutive rows on ex_type field*/
    , row_number() over(partition by order_nbr order by start_ts asc) as r1
    , row_number() over(partition by order_nbr, ex_type order by start_ts asc) as r2
  from s
)
select
  order_nbr
  , ex_type
  , min(start_ts) as start_ts
  , max(end_ts) as end_ts
  , string_agg(step, ',' order by start_ts asc) as steps
from num
group by
  order_nbr
  , ex_type
  , case
      /*For C use group number, for others - use original row number not to collapse them*/
      when ex_type = 'C'
      then r1 - r2
      else r1
  end
order by 
  order_nbr
  , start_ts asc

这里是 PostgreSQL 上的 db<>fiddle,作为所涉及功能的 HANA 语法相同的平台。

【讨论】:

太好了,谢谢!这是我昨天尝试过的丑陋穴居人approach,刚刚适应了您的确切数据集进行比较。我想创建一个 id 字段,该字段随每一行而变化,除非重复 C 类型。我更喜欢你看起来更干净的解决方案。【参考方案2】:

这是我在使用 astentx 提供的答案之前的方法,它为非 C 行和 C 类型行组创建一个 id。

with s as (
  select '1001' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-01 00:00:00' as start_ts, timestamp '2021-01-01 09:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '2' as step, 'C' as ex_type, timestamp '2021-01-04 03:00:00' as start_ts, timestamp '2021-01-04 06:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '3' as step, 'C' as ex_type, timestamp '2021-01-03 07:00:00' as start_ts, timestamp '2021-01-03 08:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '4' as step, 'C' as ex_type, timestamp '2021-01-05 10:00:00' as start_ts, timestamp '2021-01-05 15:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '5' as step, 'Z' as ex_type, timestamp '2021-01-06 00:00:00' as start_ts, timestamp '2021-01-06 06:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '6' as step, 'Z' as ex_type, timestamp '2021-01-06 16:00:00' as start_ts, timestamp '2021-01-06 20:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '7' as step, 'C' as ex_type, timestamp '2021-01-07 08:00:00' as start_ts, timestamp '2021-01-07 09:00:00' as end_ts from dummy union all
  select '1001' as order_nbr, '8' as step, 'C' as ex_type, timestamp '2021-01-07 10:00:00' as start_ts, timestamp '2021-01-07 12:00:00' as end_ts from dummy union all
  select '1002' as order_nbr, '1' as step, 'P' as ex_type, timestamp '2021-01-04 08:00:00' as start_ts, timestamp '2021-01-04 16:00:00' as end_ts from dummy
)


select
    b.order_nbr,
    b.ex_type,
    min(b.start_ts) as start_ts,
    max(b.end_ts) as end_ts,
    string_agg(b.step, ',') as steps
from
    (select
        a.order_nbr,
        a.step,
        a.ex_type,
        a.start_ts,
        a.end_ts,
        sum(a.inc) over (order by a.order_nbr asc, a.start_ts asc) as id
    from
        (select
            s.order_nbr,
            s.step,
            s.ex_type,
            s.start_ts,
            s.end_ts,
            case
                when s.ex_type = 'C' and s.ex_type = lag(s.ex_type) over (partition by s.order_nbr order by s.start_ts)
                    then 0
                    else 1
                end as inc
            from
                s
            order by
                s.order_nbr asc,
                s.start_ts asc
        ) as a
    ) as b
group by
    b.order_nbr,
    b.ex_type,
    b.id
order by
    b.order_nbr asc,
    min(b.start_ts) asc

【讨论】:

以上是关于在 SQL 中对有序数据中的子集进行分组的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Hive SQL 中对一列中的数据进行分组并将其分布在另一列中?

如何在SQL中对相邻行进行分组并对数据求和

如何从相关表中对 SQL 中的结果进行分组?

算法思想整理

需要在 SQL 中对值进行分组

在有序数据表中实现多记录上移下移置顶置底算法思路