查找具有已定义结束的连续相同值的行组 (SQL Redshift)

Posted 2023-03-30

技术标签:

【中文标题】查找具有已定义结束的连续相同值的行组 (SQL Redshift)【英文标题】：Finding groups of rows with consecutive identical values that have a defined end (SQL Redshift) 【发布时间】：2020-06-01 16:31:48 【问题描述】：

我有一张表，其中包含用户在任何一天的订阅状态。数据是这样的

+------------+------------+--------------+
| account_id |    date    | current_plan |
+------------+------------+--------------+
| 1          | 2019-08-01 | free         |
| 1          | 2019-08-02 | free         |
| 1          | 2019-08-03 | yearly       |
| 1          | 2019-08-04 | yearly       |
| 1          | 2019-08-05 | yearly       |
| ...        |            |              |
| 1          | 2020-08-02 | yearly       |
| 1          | 2020-08-03 | free         |
| 2          | 2019-08-01 | monthly      |
| 2          | 2019-08-02 | monthly      |
| ...        |            |              |
| 2          | 2019-08-31 | monthly      |
| 2          | 2019-09-01 | free         |
| ...        |            |              |
| 2          | 2019-11-26 | free         |
| 2          | 2019-11-27 | monthly      |
| ...        |            |              |
| 2          | 2019-12-27 | monthly      |
| 2          | 2019-12-28 | free         |
| 3          | 2020-05-31 | monthly      |
| 3          | 2020-06-01 | monthly      |
| 4          | 2019-08-01 | yearly       |
| ...        |            |              |
| 4          | 2020-06-01 | yearly       |
+------------+------------+--------------+

我想要一个表格，列出订阅的开始日期和结束日期。它看起来像这样。请注意，重要的是，account_ids3 和 4 未包含在此表中，因为截至今天（2020 年 6 月 1 日）它们仍在订阅中。我只想要一个已退出订阅的人的摘要。

+------------+------------+------------+-------------------+
| account_id | start_date |  end_date  | subscription_type |
+------------+------------+------------+-------------------+
|          1 | 2019-08-03 | 2020-08-02 | yearly            |
|          2 | 2019-08-01 | 2019-08-31 | monthly           |
|          2 | 2019-11-27 | 2019-12-27 | monthly           |
+------------+------------+------------+-------------------+

目前我有以下非常接近的，但仍然给我没有退出订阅的用户

select account_id, current_plan, min(date), max(date)
from (select d.*,
             row_number() over (partition by account_id order by date) as seqnum,
             row_number() over (partition by account_id, current_plan order by date) as seqnum_2
      from data d
     ) d
where current_plan not in ('free', 'trial')
group by account_id, current_plan, (seqnum - seqnum_2);

【问题讨论】：

【参考方案1】：

如果您想为截至今天已退出的用户做一个非常简单的过滤器，您只需添加：

having max(date)<current_date

到您的查询，但这也将包括以前的后果，例如 user_id=2 的第一个后果

但是，如果您想要前瞻性（例如用户 id=1）并且只过滤掉最后一个结果，您需要使用 lag 函数进行更好的“间隙和孤岛”查询，如果您检查更多“间隙和岛屿”解决方案，您会找到它...通常，lag(currrent_plan) over (partition by id order by date) 会为您提供前一天的每天计划，这样您就可以确定影响日期，然后在同一窗口中对它们进行排名以获得每个 ID 的最后一个日期

【讨论】：

以上是关于查找具有已定义结束的连续相同值的行组 (SQL Redshift)的主要内容，如果未能解决你的问题，请参考以下文章