分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)
Posted
技术标签:
【中文标题】分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)【英文标题】:Group By based on consequtive flag in Redshift (Gaps and Islands problem) 【发布时间】:2021-11-16 00:42:22 【问题描述】:我正在尝试解决“差距和孤岛”并将连续检查组合在一起。我的数据是这样的
site_id date_id location_id reservation_id revenue
5 20210101 125 792727 100
5 20210101 126 792728 90
5 20210101 228 792757 200
5 20210102 217 792977 50
5 20210102 218 792978 120
5 20210102 219 792979 100
我想在同一日期和 site_id 内按连续的 location_id 和连续的 reservation_id(两者都应该是连续的)进行分组,并计算总收入。所以对于上面的例子,输出应该是:
site_id date_id location_id reservation_id revenue
5 20210101 125 792727 190
5 20210101 228 792757 200
5 20210102 217 792977 270
Location_id 和 reservation_id 除了这个特定任务之外并不重要,因此对这两列使用简单的 MAX() 或 MIN() 即可。
【问题讨论】:
【参考方案1】:尝试会话化:
两个嵌套查询。首先,计数器在条件为假时为 0,在条件为真时为 1;在我们的例子中,之前的预订 id 并没有比当前的少一。
第二个查询查询第一个查询,并对之前获得的计数器进行运行总和。这给出了一个会话 id。
然后,按站点 id、日期 id 和获取的会话 id 分组。
WITH
indata(site_id,date_id,location_id,reservation_id,revenue) AS (
SELECT 5,DATE '2021-01-01',125,792727,100
UNION ALL SELECT 5,DATE '2021-01-01',126,792728,90
UNION ALL SELECT 5,DATE '2021-01-01',228,792757,200
UNION ALL SELECT 5,DATE '2021-01-02',217,792977,50
UNION ALL SELECT 5,DATE '2021-01-02',218,792978,120
UNION ALL SELECT 5,DATE '2021-01-02',219,792979,100
)
,
with_counter AS (
SELECT
site_id
, date_id
, location_id
, reservation_id
, revenue
, CASE
WHEN reservation_id - LAG(reservation_id) OVER(
PARTITION BY site_id ORDER BY date_id,reservation_id
) > 1
THEN 1
ELSE 0
END AS counter
FROM indata
)
,
with_session AS (
SELECT
site_id
, date_id
, location_id
, reservation_id
, revenue
, SUM(counter) OVER(
PARTITION BY site_id ORDER BY date_id,reservation_id
) AS session_id
FROM with_counter
-- test output ...
-- out site_id | date_id | location_id | reservation_id | revenue | session_id
-- out ---------+------------+-------------+----------------+---------+------------
-- out 5 | 2021-01-01 | 125 | 792727 | 100 | 0
-- out 5 | 2021-01-01 | 126 | 792728 | 90 | 0
-- out 5 | 2021-01-01 | 228 | 792757 | 200 | 1
-- out 5 | 2021-01-02 | 217 | 792977 | 50 | 2
-- out 5 | 2021-01-02 | 218 | 792978 | 120 | 2
-- out 5 | 2021-01-02 | 219 | 792979 | 100 | 2
)
SELECT
site_id
, date_id
, MIN(location_id ) AS location_id
, MIN(reservation_id) AS reservation_id
, SUM(revenue ) AS revenue
FROM with_session
GROUP BY
site_id
, date_id
, session_id
ORDER BY
site_id
, date_id
;
-- out site_id | date_id | location_id | reservation_id | revenue
-- out ---------+------------+-------------+----------------+---------
-- out 5 | 2021-01-01 | 125 | 792727 | 190
-- out 5 | 2021-01-01 | 228 | 792757 | 200
-- out 5 | 2021-01-02 | 217 | 792977 | 270
【讨论】:
非常酷的方法!谢谢【参考方案2】:试试这个:
with mytable as (
select 5 as site_id, '20210101' as date_id, 125 as location_id, 792727 as reservation_id, 100 as revenue union all
select 5, '20210101', 126, 792728, 90 union all
select 5, '20210101', 228, 792757, 200 union all
select 5, '20210102', 217, 792977, 50 union all
select 5, '20210102', 218, 792978, 120 union all
select 5, '20210102', 219, 792979, 100
)
select site_id, date_id, min(location_id) as location_id, min(reservation_id) as reservation_id, sum(revenue) as revenue
from (
select *, count(nullif(is_new_group, false)) over (order by site_id, date_id, location_id rows unbounded preceding) as new_group_id
from (
select *, coalesce(lag(location_id) over(partition by site_id, date_id order by location_id) != location_id-1, true) as is_new_group
from mytable
) a
) b
group by site_id, date_id, new_group_id
order by new_group_id
【讨论】:
谢谢。我需要把它分解成几部分来理解它是如何工作的,但是很棒的东西以上是关于分组依据基于 Redshift 中的后续标志(间隙和孤岛问题)的主要内容,如果未能解决你的问题,请参考以下文章