如何通过 Postgres 11 函数（存储过程）在某个时区返回 $TIMESTAMP 或之前的最新行？

Posted 2023-02-16

技术标签:

【中文标题】如何通过 Postgres 11 函数（存储过程）在某个时区返回 $TIMESTAMP 或之前的最新行？【英文标题】：How can I return the most recent rows at or before $TIMESTAMP at a certain time zone, via a Postgres 11 function (stored proc)? 【发布时间】：2020-09-13 22:47:27 【问题描述】：

我有一个像这样的 Postgres 11 表：

CREATE TABLE schema.foo_numbers (

 id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
 created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
 quantity INTEGER,
 category TEXT
);

它有一些数据，例如：

 id |       created_at                  | quantity | category 
----+------------------------+----------+----------
  1 | 2020-01-01 12:00:00+00 |        2    | a
  2 | 2020-01-02 17:00:00+00 |        1    | b
  3 | 2020-01-01 15:00:00+00 |        6    | a
  4 | 2020-01-04 09:00:00+00 |        1    | b
  5 | 2020-01-05 19:00:00+00 |        2    | a
  6 | 2020-01-06 23:00:00+00 |        8    | b
  7 | 2020-01-07 20:00:00+00 |        1    | a
  8 | 2020-01-08 04:00:00+00 |        2    | b
  9 | 2020-01-09 23:00:00+00 |        1    | a
 10 | 2020-01-10 19:00:00+00 |        1   | b
 11 | 2020-01-11 05:00:00+00 |        1    | a
 12 | 2020-01-12 21:00:00+00 |        1   | b
 13 | 2020-01-13 01:00:00+00 |        1   | a
 14 | 2020-01-14 18:00:00+00 |        1   | b

我有另一个表跟踪 foo 类别的某些属性：

create table schema.foo_category_properties (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY, 
category TEXT NOT NULL,  
some_bool BOOLEAN NOT NULL DEFAULT FALSE
);

这个表有这样的数据：

 id | category | some_bool 
----+----------+-----------
  1 | a        | f
  2 | b        | f

我需要创建一个 postgres 函数（通过 postgREST api 从应用程序逻辑调用），对于参数 $TIMESTAMP ，它将返回每个类别的最新记录 created_at

理想情况下，传入的参数将被视为 TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles' ，并且该函数返回最新记录，其时间戳显示在同一时区 - 但是，如果这是不可能的，并且所有时间戳都保留在 UTC [在应用程序逻辑中偏移]，前提是以一致的方式返回正确的数据。

服务器时间设置为 UTC：

psql => show time zone; 
 TimeZone 
----------
 UTC
(1 row)

我写的postgres函数是这样的：

CREATE OR REPLACE FUNCTION schema.foo_proc (end_date TEXT)   

  RETURNS TABLE (
      id INTEGER,
      category TEXT,
      quantity BIGINT,
      snapshot_count NUMERIC,
      latest_entry TIMESTAMP WITH TIME ZONE 
  )
  AS $$
  #variable_conflict use_column

BEGIN
RETURN QUERY 

    SELECT 
    alias1.id,
    alias1.category, 
    alias1.quantity,
    alias1.snapshot_count,
    alias2.latest_entry AS latest_entry

    FROM
      (
          SELECT 
          id,
          category,
          quantity,
          sum(quantity) OVER (partition by category ORDER BY created_at) AS snapshot_count 
          FROM 
          schema.foo_numbers 
      ) AS alias1
    INNER JOIN 
    (
       SELECT
        max(id) AS id,
        category, 
        max(created_at AT TIME ZONE 'America/Los_Angeles') AS latest_entry
        from 
        schema.foo_numbers 
        WHERE created_at AT TIME ZONE 'America/Los_Angeles' <= to_timestamp($1', 'YYYY-MM-DD HH24:MI:SS') :: TIMESTAMPTZ AT TIME ZONE 'America/Los_Angeles'
        group by category
        order by category 
    ) AS alias2
    ON 
    alias1.id = alias2.id
    INNER JOIN 
    schema.foo_category_properties fcp 
    ON 
    alias2.category = fcp.category
    WHERE fcp.some_bool IS FALSE
    ORDER BY 
    alias1.category 
    ;
END;
$$ LANGUAGE plpgsql;

这是foo_numbers 中的数据，时间戳转移到时区“America/Los_Angeles”

psql=> select id, created_at at time zone 'america/los_angeles', quantity, category  from schemai.foo_numbers order by created_at;

 id |      timezone               | quantity | category 
----+---------------------+----------+----------
  1 | 2020-01-01 04:00:00 |        2     | a
  3 | 2020-01-01 07:00:00 |        6     | a
  2 | 2020-01-02 09:00:00 |        1     | b
  4 | 2020-01-04 01:00:00 |        1     | b
  5 | 2020-01-05 11:00:00 |        2     | a
  6 | 2020-01-06 15:00:00 |        8     | b
  7 | 2020-01-07 12:00:00 |        1     | a
  8 | 2020-01-07 20:00:00 |        2     | b
  9 | 2020-01-09 15:00:00 |        1     | a
 10 | 2020-01-10 11:00:00 |        1    | b
 11 | 2020-01-10 21:00:00 |        1    | a
 12 | 2020-01-12 13:00:00 |        1    | b
 13 | 2020-01-12 17:00:00 |        1    | a
 14 | 2020-01-14 10:00:00 |        1    | b
(14 rows)

参数的预期输出： "end_date":"2020-01-07 19:00:00"

应该是

 id | category | quantity | snapshot_count |      latest_entry      
----+----------+----------+----------------+------------------------
  6 | b        |        8 |             10 | 2020-01-06 15:00:00
  7 | a        |        1 |             11 | 2020-01-07 12:00:00
(2 rows)

但是，相同参数的实际输出是：

 id | category | quantity | snapshot_count |      latest_entry      
----+----------+----------+----------------+------------------------
  5 | a        |        2 |             10 | 2020-01-05 19:00:00+00
  6 | b        |        8 |             10 | 2020-01-06 23:00:00+00
(2 rows)

在 UTC 时将参数转换为 timestamptz 会发生类似的意外结果。

在我尝试过的所有变体中，返回的行都没有正确匹配参数边界。

显然，我无法理解 PG 中如何处理时区——我已经详细阅读了官方文档和一些关于 SO 的相关问题，以及在 to_timestamp() 的 PG 论坛上功能进行了讨论，但经过大量的试验和错误后未能获得正确的结果。

非常感谢所有指导！

【问题讨论】：

【参考方案1】：

您可以使用distinct on 和正确的时区翻译：

select distinct on (n.category)
    n.id,
    n.created_at at time zone 'America/Los_Angeles' at time zone 'utc' created_at,
    n.quantity,
    n.category,
    sum(quantity) 
        over (partition by n.category order by n.created_at) as snapshot_count 
from foo_numbers n
inner join foo_category_properties cp on cp.category = n.category
where n.created_at <= '2020-01-07 19:00:00'::timestamp with time zone 
                          at time zone 'utc' at time zone 'America/Los_Angeles'
order by n.category, n.created_at desc

Demo on DB Fiddle：

编号 | created_at |数量 |类别 |快照计数 -: | :--------------------- | --------: | :------- | -------------: 7 | 2020-01-07 12:00:00+00 | 1 |一个 | 11 6 | 2020-01-06 15:00:00+00 | 8 |乙 | 10

【讨论】：

这非常好用，而且比我的解决方案优雅得多！我不太明白的一件事是：为什么每次处理时间戳的时区都需要声明 [converted] 两次，以及为什么调用各个时区的顺序是 reversed从第一次治疗到第二次治疗？

以上是关于如何通过 Postgres 11 函数（存储过程）在某个时区返回 $TIMESTAMP 或之前的最新行？的主要内容，如果未能解决你的问题，请参考以下文章