优化时间序列数据的 MIN / MAX 查询

Posted 2023-02-24

技术标签:

【中文标题】优化时间序列数据的 MIN / MAX 查询【英文标题】：Optimizing MIN / MAX queries on time-series data 【发布时间】：2020-03-11 02:31:36 【问题描述】：

我有几个包含大量 Null 的大型时间序列表（每个表最多可能有 300 列），例如：

时间序列表

time                |   a     | b        | c       | d
--------------------+---------+----------+---------+---------
2016-05-15 00:08:22 |         |          |         |         
2016-05-15 13:50:56 |         |          | 26.8301 |
2016-05-15 01:41:58 |         |          |         |            
2016-05-15 00:01:37 |         |          |         |            
2016-05-15 01:45:18 |         |          |         |         
2016-05-15 13:45:32 |         |          | 26.9688 |
2016-05-15 00:01:48 |         |          |         |         
2016-05-15 13:47:56 |         |          |         | 27.1269
2016-05-15 00:01:22 |         |          |         |            
2016-05-15 13:35:36 | 26.7441 | 29.8398  |         | 26.9981
2016-05-15 00:08:53 |         |          |         |         
2016-05-15 00:08:30 |         |          |         |         
2016-05-15 13:14:59 |         |          |         |         
2016-05-15 13:33:36 | 27.4277 | 29.7695  |         |                            
2016-05-15 13:36:36 | 27.4688 | 29.6836  |         |            
2016-05-15 13:37:36 | 27.1016 | 29.8516  |         |

我想优化查询以在每列中搜索第一个和最后一个值，即：

select MIN(time), MAX(time) from TS where a is not null

（这些查询可能会运行几分钟）

我计划创建一个包含列名并指向第一个和最后一个时间戳的元数据表：

元数据表

col_name | first_time          | last_time
---------+---------------------+--------------------
a        | 2016-05-15 13:35:36 | 2016-05-15 13:37:36
b        | 2016-05-15 13:35:36 | 2016-05-15 13:37:36
c        | 2016-05-15 13:50:56 | 2016-05-15 13:45:32
d        | 2016-05-15 13:47:56 | 2016-05-15 13:35:36

这样在查询期间不会发生空搜索，我只会访问第一个和最后一个时间戳中的值。

但我想避免每次修改时间序列数据时都需要更新元数据表。相反，我想创建一个通用触发器函数，它将更新每个插入、更新或删除到时间序列表的元数据表的 first_time 和 last_time 列。触发器函数应该将元数据表中的现有时间戳与插入/删除的行进行比较。

是否可以创建一个不包含时间序列表的确切列名的通用触发器函数？

谢谢

【问题讨论】：

不如尝试在(a asc, time asc) 和(a asc, time desc) 上放置索引（b、c 和d 也是如此）。如果您愿意，可以选择“元数据”视图。最好尽量避免造成冗余。您指的是会定期刷新的“物化”视图吗？我在标准视图中看不到任何优化...... 没有。只是一个“正常”的观点。优化是索引。我一开始没有提到，但是我在 TS 表中最多可以有 300 列。我怀疑拥有 300 个索引会比触发函数更影响插入性能.... 我还希望索引会比触发器慢，因为触发器只需要在插入新的最小值或最大值时更新一行。虽然需要为插入的每个值更新索引。顺便说一句，您可能很想在触发器中编写循环所有列的动态代码，但根据我的经验，最好编写一个脚本，为每列生成带有特定代码的触发器。 【参考方案1】：

您可以使用union all 取消透视。我建议使用视图而不是触发器。这样做的好处是更灵活、更易于维护，并且不会减慢您的 DML 语句：

create view metadata_view as
select 'a' col_name, min(time) first_time, max(time) last_time from ts where a is not null
union all select 'b', min(time), max(time) from ts where b is not null
union all select 'c', min(time), max(time) from ts where c is not null
union all select 'd', min(time), max(time) from ts where d is not null

为了提高性能，您需要以下索引：

ts(a, time)
ts(b, time)
ts(c, time)
ts(d, time)

【讨论】：

如果我理解了 OP 的意图，那么投影列列表中的 a、b、c、d 最好位于 's 中。照原样，该查询应该引发错误，因为它们不在 GROUP BY 中。我正在尝试理解这个想法 - 像 select 'b', min(time), max(time) from ts where b is not null 这样的查询可以运行几分钟。据我了解，当我试图快速找到第一个和最后一个值时，在这些查询之上拥有一个标准视图不会产生任何优化。所以你建议创建一个物化视图？ @Miro：好的，你最初没有提到这些查询很慢。在进行物化视图之前，请确保您拥有我刚刚添加到答案中的索引。 @Miro And：在担心性能之前，也许重新考虑您的数据库设计？如果我有 300 列怎么办？您确定 300 个索引的插入性能仍然合理吗？我正在处理时间序列数据，所以我不确定如何重新考虑数据库设计.....【参考方案2】：

最好使用多列来做到这一点：

select min(time) filter (where a is not null) as a_min,
       max(time) filter (where a is not null) as a_max,
       min(time) filter (where b is not null) as b_min,
       max(time) filter (where b is not null) as b_max,
       min(time) filter (where c is not null) as c_min,
       max(time) filter (where c is not null) as c_max,
       min(time) filter (where d is not null) as d_min,
       max(time) filter (where d is not null) as d_max,    
from t;

您可以在此步骤之后取消透视：

select v.*
from (select min(time) filter (where a is not null) as a_min,
             max(time) filter (where a is not null) as a_max,
             min(time) filter (where b is not null) as b_min,
             max(time) filter (where b is not null) as b_max,
             min(time) filter (where c is not null) as c_min,
             max(time) filter (where c is not null) as c_max,
             min(time) filter (where d is not null) as d_min,
             max(time) filter (where d is not null) as d_max,    
      from metadata
     ) x cross join lateral
     (values ('a', min_a, max_a),
             ('b', min_b, max_b),
             ('c', min_c, max_c),
             ('d', min_d, max_d)
     ) v(which, min_val, max_val);

我不会创建触发器，而是选择索引，它可以与 GMB 的方法一起使用。

【讨论】：

甚至可以创建部分索引？在这种情况下，部分索引是什么意思？谢谢 @Miro 。 . .我认为time 列上的索引，其中其他四列中的每一列都是NULL（四个单独的索引）。【参考方案3】：

可以在触发器函数中创建动态查询，参见how-to-implement-dynamic-sql-in-postgresql-10 中的此示例

CREATE OR REPLACE FUNCTION car_portal_app.get_account (predicate TEXT)
RETURNS SETOF car_portal_app.account AS
$$
BEGIN
RETURN QUERY EXECUTE 'SELECT * FROM car_portal_app.account WHERE ' || predicate;
END;
$$ LANGUAGE plpgsql;

format 函数也有助于构建查询字符串。

您可以实现每个语句触发一次的触发器（而不是每行）：postgres 文档有一个很好的示例：查看 43.10. Trigger Functions 中的“示例 43.7。使用转换表进行审计”

这对于插入非常有用。但是当列的最小/最大值被更新/删除时，您必须再次检查所有行以找到新的最小值/最大值。如果这需要几分钟，则不应在触发器中完成。

【讨论】：

对不起，如果不清楚 - 问题是如何编写一个触发器函数，根据数据中可用的时间戳在单独的表中更新最小/最大时间戳，以插入/可用行删除。（而不是如何编写通用触发函数）。 @Miro：直接从您的问题中复制：Any idea if it's possible to create a generic Trigger Function..：D 也许我理解错了你的答案，或者你理解了......我看不出你发布的函数和提出的问题之间有任何关系。我不确定为什么触发器会运行几分钟，因为它应该只检查插入的数据。然而，“转换表”是一种有趣的方法..... @Miro 我的回答只是为了引导您朝着正确的方向前进，而不是提供确切的解决方案。使用format 函数，您可以构建一个可以做任何事情的动态查询。您可以从information_schema 获取表的列名。 “几分钟”部分仅指更新/删除。插入就可以了。

以上是关于优化时间序列数据的 MIN / MAX 查询的主要内容，如果未能解决你的问题，请参考以下文章