在雪花中按分区过滤

Posted 2023-03-31

技术标签:

【中文标题】在雪花中按分区过滤【英文标题】：Filter by partition in Snowflake 【发布时间】：2021-06-02 16:59:17 【问题描述】：

我想过滤掉每个idstart_timecreatedstatus 之前有start_time 的记录。

例如在基于start_time 的“创建”之前，低于 ID A 的状态为“失败”。所以需要过滤。而 id B 首先是“创建”，然后是另一个可接受的状态。

所以预期的结果只是这样，但我正在寻找一种适用于数千行的可扩展解决方案。

WITH t1 AS (
SELECT 'A' AS id, 'failed' AS status, '2021-05-18 18:30:00'::timestamp AS start_time UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-05-24 11:30:00'::timestamp AS start_time UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-05-24 12:00:00'::timestamp AS start_time UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-05-19 18:30:00'::timestamp AS start_time UNION ALL
SELECT 'B' AS id, 'successful' AS status, '2021-05-20 11:30:00'::timestamp AS start_time
    )
SELECT *
FROM t1

【问题讨论】：

"我想过滤 id A 并且只得到 id B" 这没有多大意义。你能详细说明一下吗？ @Rajat 编辑了问题。如果仍然不清楚，请告诉我。经验法则是在创建之前不能是任何带有状态的id 每个id都有一个'created'状态，这是需要记录的第一个状态。 【参考方案1】：

有多种方法可以实现这一点，但这里有一种使用 first_value 的方法。

with t1 (id, status, start_time) as 
(select 'a', 'failed', '2021-05-18 18:30:00'::timestamp union all
 select 'a', 'created', '2021-05-24 11:30:00'::timestamp union all
 select 'a', 'created', '2021-05-24 12:00:00'::timestamp union all
 select 'b', 'created', '2021-05-19 18:30:00'::timestamp union all
 select 'b', 'successful', '2021-05-20 11:30:00'::timestamp)

select *
from t1
qualify first_value(status) over (partition by id order by start_time asc) = 'created'

您所做的只是确保任何给定 id 的第一个状态是“created”。将qualify 子句视为having 子句对应window functions。如果您发现它更具可读性，您也可以将其分解为子查询。

注意：上述解决方案还将保留仅具有“已创建”状态的记录。如果要确保每个 id 至少有两种不同的状态，请将其修改为

select *
from t1
qualify first_value(status) over (partition by id order by start_time asc) = 'created'
        and 
        count(distinct status) over (partition by id) > 1;

【讨论】：

以上是关于在雪花中按分区过滤的主要内容，如果未能解决你的问题，请参考以下文章