选择最小值时不使用索引的PostgreSQL多列组

Posted 2023-04-15

技术标签:

【中文标题】选择最小值时不使用索引的PostgreSQL多列组【英文标题】：PostgreSQL multi-column group by not using index when selecting minimum 【发布时间】：2021-02-07 11:55:50 【问题描述】：

在对多个列执行GROUP BY 操作后在 PostgreSQL（11、12、13）的列上选择 MIN 时，不会使用在分组列上创建的任何索引：https://dbfiddle.uk/?rdbms=postgres_13&fiddle=30e0f341940f4c1fa6013677643a0baf

CREATE TABLE tags (id serial, series int, index int, page int);
CREATE INDEX ON tags (page, series, index);

INSERT INTO tags (series, index, page)
SELECT
    ceil(random() * 10),
    ceil(random() * 100),
    ceil(random() * 1000)
FROM generate_series(1, 100000);

EXPLAIN ANALYZE
SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=108.968..133.153 rows=9999 loops=1)
  Group Key: page, series
  Batches: 1  Memory Usage: 1425kB
  ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.015..55.240 rows=100000 loops=1)
Planning Time: 0.257 ms
Execution Time: 133.771 ms

理论上，索引应该允许数据库以(tags.page, tags.series) 的步长进行查找，而不是执行全盘扫描。这将导致上述数据集的处理行数为 10,000，而不是 100,000。 This link 描述了没有分组列的方法。

This answer（以及this one）建议使用带有排序的DISTINCT ON 而不是GROUP BY，但这会产生这个查询计划：

Unique  (cost=0.42..5680.42 rows=10000 width=12) (actual time=0.066..268.038 rows=9999 loops=1)
  ->  Index Only Scan using tags_page_series_index_idx on tags  (cost=0.42..5180.42 rows=100000 width=12) (actual time=0.064..227.219 rows=100000 loops=1)
        Heap Fetches: 100000
Planning Time: 0.426 ms
Execution Time: 268.712 ms

虽然现在正在使用索引，但它似乎仍在扫描完整的行集。使用 SET enable_seqscan=OFF 时，GROUP BY 查询会降级为相同的行为。

如何鼓励 PostgreSQL 使用多列索引？

【问题讨论】：

【参考方案1】：

如果您可以从另一个表中提取一组不同的页面、系列，那么您可以使用横向连接来破解它：

CREATE TABLE pageseries AS SELECT DISTINCT page,series FROM tags ORDER BY page,series;
EXPLAIN ANALYZE SELECT p.*, minindex FROM pageseries p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.page=p.page AND t.series=p.series ORDER BY page,series,index LIMIT 1) x;
 Nested Loop  (cost=0.42..8720.00 rows=10000 width=12) (actual time=0.039..56.013 rows=10000 loops=1)
   ->  Seq Scan on pageseries p  (cost=0.00..145.00 rows=10000 width=8) (actual time=0.012..1.872 rows=10000 loops=1)
   ->  Limit  (cost=0.42..0.84 rows=1 width=12) (actual time=0.005..0.005 rows=1 loops=10000)
         ->  Index Only Scan using tags_page_series_index_idx on tags t  (cost=0.42..4.62 rows=10 width=12) (actual time=0.004..0.004 rows=1 loops=10000)
               Index Cond: ((page = p.page) AND (series = p.series))
               Heap Fetches: 0
 Planning Time: 0.168 ms
 Execution Time: 57.077 ms

...但不一定更快：

EXPLAIN ANALYZE                                                                                                                                              SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

 HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=56.177..58.923 rows=10000 loops=1)
   Group Key: page, series
   Batches: 1  Memory Usage: 1425kB
   ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.010..12.845 rows=100000 loops=1)
 Planning Time: 0.129 ms
 Execution Time: 59.644 ms

如果嵌套循环中的迭代次数很少，换句话说，如果不同的（页面，系列）数量很少，它会大大加快。我将单独尝试系列，因为它只有 10 个不同的值：

CREATE TABLE series AS SELECT DISTINCT series FROM tags;
EXPLAIN ANALYZE SELECT p.*, minindex FROM series p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.series=p.series ORDER BY series,index LIMIT 1) x;
 Nested Loop  (cost=0.29..886.18 rows=2550 width=8) (actual time=0.081..0.264 rows=10 loops=1)
   ->  Seq Scan on series p  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.007..0.010 rows=10 loops=1)
   ->  Limit  (cost=0.29..0.31 rows=1 width=8) (actual time=0.024..0.024 rows=1 loops=10)
         ->  Index Only Scan using tags_series_index_idx on tags t  (cost=0.29..211.29 rows=10000 width=8) (actual time=0.023..0.023 rows=1 loops=10)
               Index Cond: (series = p.series)
               Heap Fetches: 0
 Planning Time: 0.198 ms
 Execution Time: 0.292 ms

在这种情况下，绝对值得，因为查询只命中 10/100000 行。其他查询达到 10000/100000 行，即表的 10%，高于索引真正有用的阈值。

请注意，将基数较低的列放在最前面会导致索引更小：

CREATE INDEX ON tags (series, page, index);
select pg_relation_size( 'tags_page_series_index_idx' );
          4284416
select pg_relation_size( 'tags_series_page_index_idx' );
          3104768

...但它并没有使查询变得更快。

如果这类东西真的很重要，不妨试试 clickhouse 或 dolphindb。

【讨论】：

【参考方案2】：

为了支持这种事情，PostgreSQL 必须有类似 index skip scan 之类的东西，并且只有在组很少的情况下才有效。

如果该查询的速度很重要，您可以考虑使用物化视图。

【讨论】：

以上是关于选择最小值时不使用索引的PostgreSQL多列组的主要内容，如果未能解决你的问题，请参考以下文章