升级到版本 11 后 Postgres 查询产生高磁盘 IO

Posted 2023-04-14

技术标签:

【中文标题】升级到版本 11 后 Postgres 查询产生高磁盘 IO【英文标题】：Postgres query making high disk IO after upgrade to version 11 【发布时间】：2021-02-01 12:42:22 【问题描述】：

从 RDS 9.6 版升级到 RDS 11 版后，Postgres 查询开始谈论高读取 IOPS 和 CPU。数据集与升级前相同。不知道是什么问题。

以下是解释计划：

可能是因为索引损坏了？

Explain (analyze true, verbose true, costs true, buffers true, timing true )
select consumertr0_.ref_id as col_0_0_
from consumer_transactions consumertr0_
where (consumertr0_.remaining_amount is not null)
  and (consumertr0_.expiry_time is not null)
  and consumertr0_.expiry_time>'2020-12-15T00:00:00'
  and consumertr0_.expiry_time<now()
  and consumertr0_.remaining_amount>0
order by consumertr0_.expiry_time asc
limit 20000;

                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..80391.67 rows=20000 width=24) (actual time=191716.213..192077.229 rows=20000 loops=1)
   Output: ref_id, expiry_time
   Buffers: shared hit=9481343 read=1566736
   I/O Timings: read=97.486
   ->  Index Scan using consumer_transactions_expiry_time_remaining_amount on public.consumer_transactions consumertr0_  (cost=0.57..1109723.40 rows=276081 width=24) (actual time=191716.211..192075.241 rows=20000 loops=1)
         Output: ref_id, expiry_time
         Index Cond: ((consumertr0_.expiry_time > '2020-12-15 00:00:00'::timestamp without time zone) AND (consumertr0_.expiry_time < now()))
         Buffers: shared hit=9481343 read=1566736
         I/O Timings: read=97.486
 Planning Time: 1.525 ms
 Execution Time: 192078.720 ms
(11 rows)

索引定义：

"consumer_transactions_expiry_time_remaining_amount" btree
   (expiry_time, remaining_amount)
WHERE expiry_time IS NOT NULL
  AND remaining_amount IS NOT NULL
  AND remaining_amount > 0::numeric

分析细节：

        relname        |         last_analyze          |       last_autoanalyze      
 consumer_transactions | 2021-01-24 22:00:03.144379+00 |

之前处理相同数量的记录的速度非常快，IOPS 要求很低。虽然我没有以前版本 9.6 的解释计划。

解决方案： 我用不同的名称创建了相同的索引，它解决了这个问题。一旦确定为什么旧索引在升级后突然变得如此缓慢，我将删除旧索引。

用新索引解释计划：

explain (analyze true, verbose true, costs true, buffers true, timing true )  select consumertr0_.ref_id as col_0_0_ from consumer_transactions consumertr0_ where (consumertr0_.remaining_amount is not null) and (consumertr0_.expiry_time is not null) and consumertr0_.expiry_time>'2019-07-01T00:00:00' and consumertr0_.expiry_time<now() and consumertr0_.remaining_amount>0 order by consumertr0_.expiry_time asc limit 20000;

            
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..73592.06 rows=20000 width=24) (actual time=0.048..18.307 rows=20000 loops=1)
   Output: ref_id, expiry_time
   Buffers: shared hit=11140
   ->  Index Scan using consumer_transactions_expiry_time_remaining_amount2 on public.consumer_transactions consumertr0_  (cost=0.57..22273478.26 rows=6053275 width=24) (actual time=0.047..16.119 rows=20000 loops=1)
         Output: ref_id, expiry_time
         Index Cond: ((consumertr0_.expiry_time > '2019-07-01 00:00:00'::timestamp without time zone) AND (consumertr0_.expiry_time < now()))
         Buffers: shared hit=11140
 Planning Time: 1.160 ms
 Execution Time: 19.600 ms
(9 rows)

(END)

旧的解释计划也直接从实际时间 191716.211 开始，而新的解释计划从 0.047 开始。我不明白 191716.211 之前实际花费的时间在哪里。

仅供参考： 索引膨胀详情：

 current_database | schemaname |        tblname        |                       idxname                       |  real_size  | extra_size  |    extra_ratio    | fillfactor | bloat_size  |    bloat_ratio    
| is_na 
------------------+------------+-----------------------+-----------------------------------------------------+-------------+-------------+-------------------+------------+-------------+-------------------
+-------
| f
 proddb            | public     | consumer_transactions | consumer_transactions_expiry_time_remaining_amount  |  4748820480 |  3698360320 |  77.8795563145819 |         90 |  3583516672 |  75.4611947765185 | f
 proddb            | public     | consumer_transactions | consumer_transactions_expiry_time_remaining_amount2 |  1755013120 |   704552960 |  40.1451676896866 |         90 |   589709312 |  33.6014190024973 | f

【问题讨论】：

@Laurenz Albe：请在这里提供帮助。同时我会尝试再次创建相同的索引，以防索引损坏。我知道提取的记录相当高，但记录的数量相同。之前也获取过，我保留了相同的参数更改。看起来很奇怪，好像表或者索引都臃肿了。 REINDEX 或 VACUUM (FULL) 有帮助吗？您查询的是热备吗？您是否有任何长期运行的交易处于未处理状态？ @jjanes ：我在主（主主）服务器上运行它。没有长时间运行的事务处于打开状态。 【参考方案1】：

旧索引相当臃肿：扫描它必须查看 11048079 个 8kB 块（并从磁盘读取其中 1566736 个）才能找到匹配的行，而新索引只需要查看 11140 个块。

我不确定索引是如何进入这种状态的。

第二个索引列似乎没什么用。

这个查询的完美索引应该是：

CREATE INDEX ON public.consumer_transactions (expiry_time) INCLUDE (ref_id)
WHERE remaining_amount > 0;

如果您VACUUM 表，您将获得快速的仅索引扫描。

【讨论】：

旧索引显示 I/O Timings: read=97.486，即 97 ms，并不多。所以从磁盘中获取应该不是问题。共享命中块也是从磁盘读取的块的 6 倍（共享命中=9481343 读取=1566736），因此共享缓冲区也需要大约 100 毫秒...旧索引扫描的总时间 = 97 毫秒 + 100 毫秒 = 197 毫秒，但这不是反映在解释计划中。为什么旧的解释计划显示实际时间=191716.211..192075.241。还有为什么节点实际开始时间是 191716.211？ 191716.211之前的ms去哪儿了？还有什么让您认为索引或表臃肿？还有什么方法可以检查在总共 100 个索引扫描中... 70% 的索引扫描是从 shared_buffers 发生的，其余的索引扫描是从磁盘发生的....只是要知道索引不适合 RAM 和我们需要对这些做点什么。非常感谢@laurenz Albe 如果你必须趟过 1100 万个索引块才能找到 20000 行，那么这个索引就有点问题了。谢谢！请您帮助我了解旧索引计划的索引扫描节点的实际时间指标。索引扫描花了将近 200 秒才找到第一个结果。

以上是关于升级到版本 11 后 Postgres 查询产生高磁盘 IO的主要内容，如果未能解决你的问题，请参考以下文章