查找第一个“连续第 x 天”
Posted
技术标签:
【中文标题】查找第一个“连续第 x 天”【英文标题】:Finding the first "x-th consecutive day" 【发布时间】:2021-07-20 13:31:14 【问题描述】:鉴于以下数据,
SELECT setseed(0.5);
WITH stuff AS (
SELECT d::date, floor(random() * 5) AS v
FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
)
SELECT d, v
FROM stuff
WHERE extract(isodow from d) BETWEEN 1 AND 5;
更具体地说,
d | v
------------+---
2021-01-01 | 1 -- 1st consecutive day with a positive `v`
2021-01-04 | 1 -- 2nd consecutive day with a positive `v`
2021-01-05 | 0 -- 0th consecutive day with a positive `v`
2021-01-06 | 0 -- 0th consecutive day with a positive `v`
2021-01-07 | 0 -- 0th consecutive day with a positive `v`
2021-01-08 | 1 -- 1st consecutive day with a positive `v`
2021-01-11 | 0 -- 0th consecutive day with a positive `v`
2021-01-12 | 4 -- 1st consecutive day with a positive `v`
2021-01-13 | 3 -- 2nd consecutive day with a positive `v`
2021-01-14 | 1 -- 3rd consecutive day with a positive `v` (this!)
2021-01-15 | 3 -- 4th consecutive day with a positive `v`
(11 rows)
我想找到第一个“连续第三天v
”。在上面的示例中,2021-01-12
到 2021-01-14
都符合条件,因此预期答案是 2021-01-14
。如果不存在这样的日子,则应返回NULL
。
目前,我正在使用pandas
将数据提取到 Python 中并使用计数器计算答案,但出于性能原因,我想切换到 PostgreSQL。一个明显的解决方案是使用递归 CTE,但我想避免使用某些自定义聚合函数或 PL/pgSQL 过程,因为该解决方案将成为更大查询的一部分,所以我必须尽量保持简单以避免查询复杂性的爆炸式增长。我的意思是,在 LATERAL
内的另一个递归 CTE 中包含一个递归 CTE 是很荒谬的......
【问题讨论】:
【参考方案1】:事实证明,您可以在不了解 PL/pgSQL 的情况下编写自定义聚合,所以这就是我所做的。一般来说,第一个“连续第 x 天”应该是retval_consecutive[2] = x - 1
。
CREATE OR REPLACE FUNCTION first_xth_consecutive_label_transfn(retval_consecutive int[2], label_cond int[2])
RETURNS int[2]
LANGUAGE sql
IMMUTABLE AS
$$
SELECT CASE
WHEN retval_consecutive[1] IS NOT NULL THEN
retval_consecutive -- already found
WHEN label_cond[2] = 1 THEN
ARRAY [CASE WHEN retval_consecutive[2] = 2 THEN label_cond[1] END, retval_consecutive[2] + 1]
ELSE
ARRAY [NULL, 0]
END
$$;
CREATE OR REPLACE FUNCTION first_xth_consecutive_label_final(ans_consecutive int[2])
RETURNS int
LANGUAGE sql
IMMUTABLE AS
$$
SELECT ans_consecutive[1];
$$;
DROP AGGREGATE IF EXISTS first_xth_consecutive_label(int[2]);
CREATE AGGREGATE first_xth_consecutive_label(int[2]) (
sfunc = first_xth_consecutive_label_transfn,
stype = int[2],
finalfunc = first_xth_consecutive_label_final,
initcond = 'NULL, 0'
);
用法:
SELECT setseed(0.5);
WITH stuff AS (
SELECT d::date, floor(random() * 5) AS v
FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
)
SELECT to_timestamp(first_xth_consecutive_label(ARRAY [extract(epoch FROM d)::int, CASE WHEN v > 0 THEN 1 ELSE 0 END]))::date
FROM stuff
WHERE extract(isodow from d) BETWEEN 1 AND 5;
【讨论】:
【参考方案2】:我使用过窗口函数,但我不确定这种解决方案的性能:
SELECT setseed(0.5);
WITH stuff AS (
SELECT d::date, floor(random() * 5) AS v
FROM generate_series('2021-01-01'::date, '2021-01-15'::date, '1 day'::interval) t(d)
), tmp as (
SELECT d, v,
LAG(v) OVER (ORDER BY d) AS v2,
LAG(v, 2) OVER (ORDER BY d) AS v3
FROM stuff
WHERE extract(isodow from d) BETWEEN 1 AND 5
)
SELECT d
FROM tmp
WHERE v > 0 and v2 > 0 AND v3 > 0
LIMIT 1;
【讨论】:
你的方法基本没问题,但是在外部查询中没有order by
,可能会返回任何满足条件的行,而不是第一个。以上是关于查找第一个“连续第 x 天”的主要内容,如果未能解决你的问题,请参考以下文章
REGEX - 在字符串之间查找文本 - 第一个可选,但在第一个匹配之前只有第二个
R语言使用Which.max和Which.min函数定位数据对象中的第一个最大值或最小值实战:使用which.max函数查找第一个最大值的索引使用which.min函数查找第一个最小值的索引