查找第一个“连续第 x 天”

Posted

技术标签:

【中文标题】查找第一个“连续第 x 天”【英文标题】:Finding the first "x-th consecutive day" 【发布时间】:2021-07-20 13:31:14 【问题描述】:

鉴于以下数据,

SELECT setseed(0.5);

WITH stuff AS (
    SELECT d::date, floor(random() * 5) AS v
    FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
)
SELECT d, v
FROM stuff
WHERE extract(isodow from d) BETWEEN 1 AND 5;

更具体地说,

     d      | v
------------+---
 2021-01-01 | 1  -- 1st consecutive day with a positive `v`
 2021-01-04 | 1  -- 2nd consecutive day with a positive `v`
 2021-01-05 | 0  -- 0th consecutive day with a positive `v`
 2021-01-06 | 0  -- 0th consecutive day with a positive `v`
 2021-01-07 | 0  -- 0th consecutive day with a positive `v`
 2021-01-08 | 1  -- 1st consecutive day with a positive `v`
 2021-01-11 | 0  -- 0th consecutive day with a positive `v`
 2021-01-12 | 4  -- 1st consecutive day with a positive `v`
 2021-01-13 | 3  -- 2nd consecutive day with a positive `v`
 2021-01-14 | 1  -- 3rd consecutive day with a positive `v` (this!)
 2021-01-15 | 3  -- 4th consecutive day with a positive `v`
(11 rows)

我想找到第一个“连续第三天v”。在上面的示例中,2021-01-122021-01-14 都符合条件,因此预期答案是 2021-01-14。如果不存在这样的日子,则应返回NULL

目前,我正在使用pandas 将数据提取到 Python 中并使用计数器计算答案,但出于性能原因,我想切换到 PostgreSQL。一个明显的解决方案是使用递归 CTE,但我想避免使用某些自定义聚合函数或 PL/pgSQL 过程,因为该解决方案将成为更大查询的一部分,所以我必须尽量保持简单以避免查询复杂性的爆炸式增长。我的意思是,在 LATERAL 内的另一个递归 CTE 中包含一个递归 CTE 是很荒谬的......

【问题讨论】:

【参考方案1】:

事实证明,您可以在不了解 PL/pgSQL 的情况下编写自定义聚合,所以这就是我所做的。一般来说,第一个“连续第 x 天”应该是retval_consecutive[2] = x - 1

CREATE OR REPLACE FUNCTION first_xth_consecutive_label_transfn(retval_consecutive int[2], label_cond int[2])
    RETURNS int[2]
    LANGUAGE sql
    IMMUTABLE AS
$$
SELECT CASE
           WHEN retval_consecutive[1] IS NOT NULL THEN
               retval_consecutive -- already found
           WHEN label_cond[2] = 1 THEN
               ARRAY [CASE WHEN retval_consecutive[2] = 2 THEN label_cond[1] END, retval_consecutive[2] + 1]
           ELSE
               ARRAY [NULL, 0]
           END
$$;

CREATE OR REPLACE FUNCTION first_xth_consecutive_label_final(ans_consecutive int[2])
    RETURNS int
    LANGUAGE sql
    IMMUTABLE AS
$$
SELECT ans_consecutive[1];
$$;


DROP AGGREGATE IF EXISTS first_xth_consecutive_label(int[2]);
CREATE AGGREGATE first_xth_consecutive_label(int[2]) (
    sfunc = first_xth_consecutive_label_transfn,
    stype = int[2],
    finalfunc = first_xth_consecutive_label_final,
    initcond = 'NULL, 0'
    );

用法:

SELECT setseed(0.5);

WITH stuff AS (
    SELECT d::date, floor(random() * 5) AS v
    FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
)
SELECT to_timestamp(first_xth_consecutive_label(ARRAY [extract(epoch FROM d)::int, CASE WHEN v > 0 THEN 1 ELSE 0 END]))::date
FROM stuff
WHERE extract(isodow from d) BETWEEN 1 AND 5;

【讨论】:

【参考方案2】:

我使用过窗口函数,但我不确定这种解决方案的性能:

SELECT setseed(0.5);

WITH stuff AS (
    SELECT d::date, floor(random() * 5) AS v
    FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
), tmp as (
    SELECT d, v, 
    LAG(v) OVER (ORDER BY d) AS v2, 
    LAG(v, 2) OVER (ORDER BY d) AS v3
    FROM stuff
    WHERE extract(isodow from d) BETWEEN 1 AND 5
)
SELECT d
FROM tmp
WHERE v > 0 and v2 > 0 AND v3 > 0
LIMIT 1;

【讨论】:

你的方法基本没问题,但是在外部查询中没有order by,可能会返回任何满足条件的行,而不是第一个。

以上是关于查找第一个“连续第 x 天”的主要内容,如果未能解决你的问题,请参考以下文章

[34]. 在排序数组中查找元素的第一个和最后一个位置

查找第二次出现索引最低的第一个重复元素

REGEX - 在字符串之间查找文本 - 第一个可选,但在第一个匹配之前只有第二个

CLGeocoder 只返回第一个查找结果

给定一天,查找当前月份的第一个第二个时间戳

R语言使用Which.max和Which.min函数定位数据对象中的第一个最大值或最小值实战:使用which.max函数查找第一个最大值的索引使用which.min函数查找第一个最小值的索引