分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)

Posted

技术标签:

【中文标题】分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)【英文标题】:Group By based on consequtive flag in Redshift (Gaps and Islands problem) 【发布时间】:2021-11-16 00:42:22 【问题描述】:

我正在尝试解决“差距和孤岛”并将连续检查组合在一起。我的数据是这样的

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           100
   5        20210101      126            792728           90
   5        20210101      228            792757           200
   5        20210102      217            792977           50
   5        20210102      218            792978           120
   5        20210102      219            792979           100

我想在同一日期和 site_id 内按连续的 location_id 和连续的 reservation_id(两者都应该是连续的)进行分组,并计算总收入。所以对于上面的例子,输出应该是:

site_id     date_id    location_id    reservation_id    revenue
   5        20210101      125            792727           190
   5        20210101      228            792757           200
   5        20210102      217            792977           270

Location_id 和 reservation_id 除了这个特定任务之外并不重要,因此对这两列使用简单的 MAX() 或 MIN() 即可。

【问题讨论】:

【参考方案1】:

尝试会话化

两个嵌套查询。首先,计数器在条件为假时为 0,在条件为真时为 1;在我们的例子中,之前的预订 id 并没有比当前的少一。

第二个查询查询第一个查询,并对之前获得的计数器进行运行总和。这给出了一个会话 id。

然后,按站点 id、日期 id 和获取的会话 id 分组。

WITH
indata(site_id,date_id,location_id,reservation_id,revenue) AS (
          SELECT 5,DATE '2021-01-01',125,792727,100
UNION ALL SELECT 5,DATE '2021-01-01',126,792728,90
UNION ALL SELECT 5,DATE '2021-01-01',228,792757,200
UNION ALL SELECT 5,DATE '2021-01-02',217,792977,50
UNION ALL SELECT 5,DATE '2021-01-02',218,792978,120
UNION ALL SELECT 5,DATE '2021-01-02',219,792979,100
)
,
with_counter AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , CASE
      WHEN reservation_id - LAG(reservation_id) OVER(
         PARTITION BY site_id ORDER BY date_id,reservation_id
      ) > 1
      THEN 1
      ELSE 0
    END AS counter
  FROM indata
)
,
with_session AS (
  SELECT
    site_id
  , date_id
  , location_id
  , reservation_id
  , revenue
  , SUM(counter) OVER(
      PARTITION BY site_id ORDER BY date_id,reservation_id
    ) AS session_id
  FROM with_counter
  -- test output ...
  -- out  site_id |  date_id   | location_id | reservation_id | revenue | session_id 
  -- out ---------+------------+-------------+----------------+---------+------------
  -- out        5 | 2021-01-01 |         125 |         792727 |     100 |          0
  -- out        5 | 2021-01-01 |         126 |         792728 |      90 |          0
  -- out        5 | 2021-01-01 |         228 |         792757 |     200 |          1
  -- out        5 | 2021-01-02 |         217 |         792977 |      50 |          2
  -- out        5 | 2021-01-02 |         218 |         792978 |     120 |          2
  -- out        5 | 2021-01-02 |         219 |         792979 |     100 |          2
)
SELECT
  site_id
, date_id
, MIN(location_id   ) AS location_id
, MIN(reservation_id) AS reservation_id
, SUM(revenue       ) AS revenue
FROM with_session
GROUP BY
  site_id
, date_id
, session_id
ORDER BY
  site_id
, date_id
;
-- out  site_id |  date_id   | location_id | reservation_id | revenue 
-- out ---------+------------+-------------+----------------+---------
-- out        5 | 2021-01-01 |         125 |         792727 |     190
-- out        5 | 2021-01-01 |         228 |         792757 |     200
-- out        5 | 2021-01-02 |         217 |         792977 |     270                                                                                                                  

【讨论】:

非常酷的方法!谢谢【参考方案2】:

试试这个:

with mytable as (
  select 5 as site_id, '20210101' as date_id, 125 as location_id, 792727 as reservation_id, 100 as revenue union all
  select 5, '20210101', 126, 792728, 90 union all
  select 5, '20210101', 228, 792757, 200 union all
  select 5, '20210102', 217, 792977, 50 union all
  select 5, '20210102', 218, 792978, 120 union all
  select 5, '20210102', 219, 792979, 100
)
select site_id, date_id, min(location_id) as location_id, min(reservation_id) as reservation_id, sum(revenue) as revenue
from ( 
  select *, count(nullif(is_new_group, false)) over (order by site_id, date_id, location_id rows unbounded preceding) as new_group_id
  from (
    select *, coalesce(lag(location_id) over(partition by site_id, date_id order by location_id) != location_id-1, true) as is_new_group
    from mytable
  ) a
) b
group by site_id, date_id, new_group_id
order by new_group_id

【讨论】:

谢谢。我需要把它分解成几部分来理解它是如何工作的,但是很棒的东西

以上是关于分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)的主要内容,如果未能解决你的问题,请参考以下文章

仅在某些条件下使用 Redshift 中的 SQL 对具有相同名称的行进行分组

PySpark 聚合和分组依据

Redshift 中的重叠函数

一句话实现MySQL库中的按条件变化分组

模糊分组依据,对相似词进行分组

GraphQL 中的分组查询(不是“分组依据”)