聚合过滤器表达式可以不使用索引吗？

Posted 2023-04-12

技术标签:

【中文标题】聚合过滤器表达式可以不使用索引吗？【英文标题】：Can aggregate filter expressions not use indices? 【发布时间】：2018-01-12 20:35:03 【问题描述】：

关于过滤器表达式的一个很酷的事情是，您可以在一个查询中执行多个不同的过滤器和聚合。 “where”部分成为聚合的一部分，而不是整个“where”子句。

例如：

SELECT count('id') FILTER (WHERE account_type=1) as regular,
       count('id') FILTER (WHERE account_type=2) as gold,
       count('id') FILTER (WHERE account_type=3) as platinum
FROM clients;

（来自the Django documentation）

要么这是 PostgreSQL 9.5 中的错误，要么是我做的不对，或者只是 PostgreSQL 的限制。

考虑这两个查询：

select count(*)
from main_search
where created >= '2017-10-12T00:00:00.081739+00:00'::timestamptz
and created < '2017-10-13T00:00:00.081739+00:00'::timestamptz
and parent_id is null;

select
count('id') filter (
where created >= '2017-10-12T00:00:00.081739+00:00'::timestamptz
and created < '2017-10-13T00:00:00.081739+00:00'::timestamptz
and parent_id is null) as count
from main_search;

（main_search 表在created and parent_id is null 上有一个组合 btree 索引）

这是输出：

 count
-------
  9682
(1 row)

 count
-------
  9682
(1 row)

如果我在每个查询前面加上explain analyze，这就是输出：

    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1174.04..1174.05 rows=1 width=0) (actual time=5.077..5.077 rows=1 loops=1)
   ->  Index Scan using main_search_created_parent_id_null_idx on main_search  (cost=0.43..1152.69 rows=8540 width=0) (actual time=0.026..4.384 rows=9682 loops=1)
         Index Cond: ((created >= '2017-10-11 20:00:00.081739-04'::timestamp with time zone) AND (created < '2017-10-12 20:00:00.081739-04'::timestamp with time zone))
 Planning time: 0.826 ms
 Execution time: 5.227 ms
(5 rows)

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=178054.93..178054.94 rows=1 width=12) (actual time=1589.006..1589.007 rows=1 loops=1)
   ->  Seq Scan on main_search  (cost=0.00..146459.39 rows=4212739 width=12) (actual time=0.051..882.099 rows=4212818 loops=1)
 Planning time: 0.051 ms
 Execution time: 1589.070 ms
(4 rows)

注意！筛选表达式 SELECT 语句始终使用秒扫描而不是索引扫描：

我也尝试过使用不同数据库中的另一个 PostgreSQL 9.5 表。起初我认为“Seq Scan”发生是因为表的行太少，但两个表都足够大，应该启动索引。

【问题讨论】：

如果我不得不猜测，那是因为 Postgres 优化器假定没有 where 子句的聚合查询必须读取所有行。因此，使用索引而不是原始数据并没有真正的优化（嗯，除了限制正在读取的字节总数）。我写过一篇相关的博文：peterbe.com/plog/conditional-aggregation-in-django-2.0 这两个查询没有做同样的事情。第二个不限制要查看的行（没有where 子句），因此seq 扫描更有效。如果您只想计算行的子集，请使用where 子句。顺便说一句：count('id') 毫无意义——至少对我而言。如果有的话count(id) 会更有意义。 【参考方案1】：

您误解了用例。过滤器仅影响 PRODUCED ALREADY DATASET 上的聚合。它不过滤记录。

考虑修改示例：

SELECT count(*) FILTER (WHERE account_type=1) as regular,
       count(*) FILTER (WHERE account_type=2) as gold,
       count(*) FILTER (WHERE account_type=3) as platinum,
       count(*) 
FROM clients;

那么 clasue 应该在哪里呢？

WHERE
(WHERE account_type=3)
or
(WHERE account_type=2)
or
(WHERE account_type=1)
or 1=1 ???

考虑更复杂的 FILTER 和未过滤列的组合。这对优化器来说将是一场噩梦。

当您考虑 FILTER 时，请考虑这只是 CASE 等较长句子的快捷方式

SELECT SUM(CASE WHEN account_type=1 THEN 1 ELSE 0 END) as regular,
       SUM(CASE WHEN account_type=2 THEN 1 ELSE 0 END) as gold,
       SUM(CASE WHEN account_type=3 THEN 1 ELSE 0 END) as platinum
FROM clients;

【讨论】：

以上是关于聚合过滤器表达式可以不使用索引吗？的主要内容，如果未能解决你的问题，请参考以下文章