PostgreSQL-缓存利器

Posted 2021-08-31 PostgreSQLChina

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了PostgreSQL-缓存利器相关的知识，希望对你有一定的参考价值。

作者：徐田原

引言

当发起“select * from XXX”时，数据会加载到操作系统缓存然后才到shared buffer。PostgreSQL缓存读顺序share_buffers -> 操作系统缓存 -> 硬盘。同样当将脏页向磁盘刷写时，也是先到操作系统缓存，然后由操作系统调用fsync()将操作系统缓存中数据持久化到磁盘。这样PG实际上由两份数据，看起来有些浪费空间，但是操作系统缓存是一个简单的LRU而不是数据库优化的clock sweep algorithm。一旦在shared_buffers中命中，那么读就不会下沉到操作系统缓存。如果shared buffer和操作系统缓存有相同页，操作系统缓存中的页很快会被驱逐替换。

什么参数能影响操作系统的fsync将脏页刷回磁盘吗？

当然，通过postgresql.conf中参数bgwriter_flush_after，该参数整型，默认512KB。当后台写进程写了这么多数据时，会强制OS发起sync将cache中数据刷到底层存储。这样会限制内核页缓存中的脏数据数量，从而减小checkpoint时间或者后台大批量写回数据的时间。

不仅仅是bgwriter，即使checkpoint进程和用户进程也从shared buffer刷写脏页到OS cache。可以通过checkpoint_flush_after影响checkpoint进程的fsync，通过backend_flush_after影响后台进程的fsync。

缓存

pg_buffercache
pg_prewarm
pgfincore
pg_dropcache

1.pg_buffercache简介

pg_buffercache模块提供了一种实时检测共享缓冲区的方法。

这个模块提供了一个C函数：pg_buffercache_pages，它返回一个记录的集合和一个视图：pg_buffercache，它包装了这个函数来更方便的使用。

共享缓存中的每个缓冲区都有一行。未使用的缓冲区除了bufferid以外的所有列为null。共享系统目录被显示为属于数据库零。

因为缓存被所有的数据库共享，通常会有关系中的一些页面不属于当前的数据库。这意味着对许多行来说，在pg_class中可能不会有匹配的连接行，或者甚至可能会有错误连接。如果你试图连接pg_class，一个好的办法就是将连接限制为reldatabase等于当前数据库的OID或者为零的行。

当访问pg_buffercache视图时，内部缓冲区管理器会锁住足够长的时间来拷贝所有这个视图会展示的缓冲区状态数据。这确保了这个视图产生一个一致的结果集，同时不会不必要的长时间阻碍正常的缓冲区活动。虽然如此，但是如果这个视图被频繁读取的话，会对数据库性能产生一些影响。

1.1 pg_buffercache的部署

cd /home/postgres/postgresql-13.3/contrib/pg_buffercache
gmake && gmake install
psql -c 'create extension pg_buffercache;'

1.2 pg_buffercache的使用

pg_buffercache 主要是用来查看pgsql数据库 shared_pool 的使用情况

select name,setting,unit,current_setting(name) from pg_settings where 1 = 1 and name = 'shared_buffers';
  name  | setting | unit | current_setting 
----------------+---------+------+-----------------
 shared_buffers | 16384 | 8kB | 128MB
(1 row)

查看使用情况

select
 d.datname,
 c.relname,
 c.relkind,
 count(*) as buffers
from
 pg_class c
inner join pg_buffercache b on
 b.relfilenode = c.relfilenode
inner join pg_database d on
 (b.reldatabase = d.oid
  and d.datname = current_database())
where
 1 = 1
group by
 d.datname,
 c.relname,
 c.relkind
order by
 d.datname,
 4 desc ;
 datname |         relname         | relkind | buffers 
----------+-----------------------------------------------+---------+---------
 postgres | tmp_t0                   | r   | 16150
 postgres | pg_operator                 | r   |  14
 postgres | pg_depend_reference_index          | i   |  11
 postgres | pg_statistic                | r   |  10
 postgres | pg_depend                  | r   |   8
 postgres | pg_rewrite                 | r   |   7
 postgres | pg_toast_2619                | t   |   7
 postgres | pg_init_privs                | r   |   5
 postgres | pg_operator_oprname_l_r_n_index       | i   |   5
 postgres | pg_amop                   | r   |   5
 postgres | pg_depend_depender_index          | i   |   5
 postgres | pg_index                  | r   |   4
 postgres | pg_amop_fam_strat_index           | i   |   3
 postgres | pg_operator_oid_index            | i   |   3
 postgres | pg_amproc_fam_proc_index          | i   |   3
 postgres | pg_amop_opr_fam_index            | i   |   3
 postgres | pg_namespace_oid_index           | i   |   2
 postgres | pg_aggregate_fnoid_index          | i   |   2
 postgres | pg_cast                   | r   |   2
 postgres | pg_cast_source_target_index         | i   |   2
 postgres | pg_constraint_conrelid_contypid_conname_index | i   |   2
 postgres | pg_description_o_c_o_index         | i   |   2
 postgres | pg_index_indexrelid_index          | i   |   2
 postgres | pg_index_indrelid_index           | i   |   2
 postgres | pg_namespace                | r   |   2
 postgres | pg_namespace_nspname_index         | i   |   2
 postgres | pg_opclass                 | r   |   2
 postgres | pg_opclass_am_name_nsp_index        | i   |   2
 postgres | pg_opclass_oid_index            | i   |   2
 postgres | pg_rewrite_oid_index            | i   |   2
 postgres | pg_rewrite_rel_rulename_index        | i   |   2
 postgres | pg_statistic_relid_att_inh_index      | i   |   2
 postgres | pg_toast_2619_index             | i   |   2
 postgres | pg_extension                | r   |   1
 postgres | pg_am                    | r   |   1
 postgres | pg_aggregate                | r   |   1
 postgres | pg_amproc                  | r   |   1
 postgres | pg_description               | r   |   1
 postgres | pg_statistic_ext_relid_index        | i   |   1
 postgres | pg_init_privs_o_c_o_index          | i   |   1
 postgres | pg_extension_oid_index           | i   |   1
 postgres | pg_extension_name_index           | i   |   1
(42 rows)

2.pg_prewarm简介

pg_prewarm 它可以用于在系统重启时，手动加载经常访问的表到操作系统的cache或PG的shared buffer，从而减少检查系统重启对应用的影响。

2.1 pg_prewarm 函数

CREATE FUNCTION pg_prewarm(regclass,
   mode text default 'buffer',
   fork text default 'main',
   first_block int8 default null,
   last_block int8 default null)
RETURNS int8
AS 'MODULE_PATHNAME', 'pg_prewarm'
LANGUAGE C

备注: regclass 参数为数据库对像,通常情况为表名; modex 参数指加载模式,可选项有 ‘prefetch’, ‘read’,’buffer’, 默认为 ‘buffer’ 具体稍后介绍; fork 表示对像模式,可选项有 ‘main’, ‘fsm’, ‘vm’, 默认为 ‘main’, first_block 表示开始prewarm的数据块,last_block 表示最后 prewarm 的数据块.

pg_prewarm 加载模式

mode 参数指加载模式，可选项有 ‘prefetch’, ‘read’,’buffer’, 默认为 ‘buffer’.
prefetch: 异步地将数据预加载到操作系统缓存;
read: 最终结果和 prefetch 一样,但它是同步方式,支持所有平台.
buffer: 将数据预加载到数据库缓存

2.2 pg_prewarm的部署

psql -c 'create EXTENSION pg_prewarm;'

2.3 pg_prewarm的演示

先将PG的shared buffer设为128M，OS总的memory有8G。然后创建下面的大小近1G的表test：

postgres=# create table tmp_t0 ( id int8,name varchar(100),memo1 varchar(200),memo2 varchar(200));
CREATE TABLE
postgres=# insert into tmp_t0 select id,md5(id::varchar),md5(md5(id::varchar)),md5(md5(md5(id::varchar))) from generate_series(1,10000000) as id;
INSERT 0 10000000

在每次都清掉操作系统cache和PG的shared buffer的情况下，分别测试下面几种场景：

不进行pg_prewarm的情况：
postgres=# \\d tmp_t0
          Table 'public.tmp_t0'
 Column |    Type    | Collation | Nullable | Default 
--------+------------------------+-----------+----------+---------
 id  | bigint        |     |    | 
 name | character varying(100) |     |    | 
 memo1 | character varying(200) |     |    | 
 memo2 | character varying(200) |     |    | 

postgres=# SELECT pg_size_pretty(pg_total_relation_size('tmp_t0'));
 pg_size_pretty 
----------------
 1347 MB
(1 row)

postgres=# explain analyze select count(*) from tmp_t0;
                                QUERY PLAN                                 
-----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=1291.463..1292.254 rows=1 loops=1)
 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=1291.353..1292.247 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=1286.237..1286.238 rows=1 loops=3)
       -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=6.579..1137.312 rows=3333333 loops=3)
 Planning Time: 0.214 ms
 Execution Time: 1292.363 ms
(8 rows)
可以看到，近1G的表，全表扫描一遍，耗时1.2s
postgres=# select pg_prewarm('tmp_t0', 'read', 'main');
 pg_prewarm 
------------
  172414
(1 row)

postgres=# explain analyze select count(*) from tmp_t0;
                                QUERY PLAN                                 
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=437.033..437.697 rows=1 loops=1)
 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=436.866..437.689 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=434.013..434.014 rows=1 loops=3)
       -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.039..305.793 rows=3333333 loops=3)
 Planning Time: 0.083 ms
 Execution Time: 437.726 ms
(8 rows)
时间降至4秒多！这时反复执行全表扫描，时间稳定在0.4秒多。

尝试buffer模式：
postgres=# select pg_prewarm('tmp_t0', 'buffer', 'main');
 pg_prewarm 
------------
  172414
(1 row)

postgres=# explain analyze select count(*) from tmp_t0;
                                QUERY PLAN                                 
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=480.060..482.189 rows=1 loops=1)
 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=479.862..482.180 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=475.604..475.605 rows=1 loops=3)
       -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.035..336.157 rows=3333333 loops=3)
 Planning Time: 0.068 ms
 Execution Time: 482.220 ms
(8 rows)
比read模式时间略少，但相差不大。可见，如果操作系统的cache够大，数据取到OS cache还是shared buffer对执行时间影响不大（在不考虑其他应用影响PG的情况下）

尝试prefetch模式，即异步预取。这里，有意在pg_prewarm返回后，立即执行全表查询。这样在执行全表查询时，可能之前的预取还没完成，从而使全表查询和预取并发进行，缩短了总的响应时间：
postgres=# select pg_prewarm('tmp_t0', 'prefetch', 'main');
 pg_prewarm 
------------
  172414
(1 row)

postgres=# explain analyze select count(*) from tmp_t0;
                                QUERY PLAN                                 
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate (cost=225497.55..225497.56 rows=1 width=8) (actual time=434.313..436.782 rows=1 loops=1)
 -> Gather (cost=225497.33..225497.54 rows=2 width=8) (actual time=434.155..436.774 rows=3 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    -> Partial Aggregate (cost=224497.33..224497.34 rows=1 width=8) (actual time=431.066..431.067 rows=1 loops=3)
       -> Parallel Seq Scan on tmp_t0 (cost=0.00..214080.67 rows=4166667 width=0) (actual time=0.034..307.687 rows=3333333 loops=3)
 Planning Time: 0.061 ms
 Execution Time: 436.816 ms
(8 rows)
可以看到，总的完成时间是0.4秒多，使用pg_prewarm做预取大大缩短了总时间。因此在进行全表扫描前，做一次异步的prewarm，不失为一种优化全表查询的方法。
问题点：
执行1次select * from 不就可以将表的数据读入shared buffer和OS cache而实现预热了吗？岂不是比做这样一个插件更简单？实际上，对于较大的表（大小超过shared buff的1/4），进行全表扫描时，PG认为没必要为这种操作使用所有shared buffer，只会让其使用很

少的一部分buffer，一般只有几百K，所以，预热大表是不能用一个查询直接实现的
详情可以阅读：https://github.com/postgres/postgres/tree/17792bfc5b62f42a9dfbd2ac408e7e71c239330a/src/backend/storage/buffer

3.pgfincore简介

这是一种将数据库表，索引CACHE到OS层面的缓存里，只要内存足够大，可以将需要的数据都CACHE到OS 内存中，这极到的提高了应用的处理速度。

3.1 pgfincore函数

pgsysconf –查看操作系统CACHE情况
pgsysconf_pretty –查看操作系统CACHE情况
pgfincore –查看对象（表，索引）CACHE情况
pgfadvise_willneed –将数据库对象（表，索引）载入OS CACHE
pgfadvise_dontneed –将数据库对象（表，索引）刷出OS CACHE
pgfadvise_normal –将数据库对象（表，索引）修改为普通内存方式
pgfadvise_loader -将数据库对象（表，索引）自定义载入OS CACHE

函数返回列

relpath : the relation path
block_size : the size of one block disk?
block_disk : the total number of file system blocks of the relation
block_mem : the total number of file system blocks of the relation in buffer cache. (not the shared buffers from PostgreSQL but the OS cache)
group_mem : the number of groups of adjacent block_mem

3.2 pgfincore部署

make clean
make
su
make install

For PostgreSQL >= 9.1, log in your database and:

postgres=# CREATE EXTENSION pgfincore;

For other release, create the functions from the sql script (it should be in your contrib directory):

psql mydb -f pgfincore.sql

3.3 pgfincore的演示

pgfincore
查询表CACHE情况
提供对象在操作系统缓存中的信息的。

它分为三个函数，参数分别为（relname, fork, getdatabit），（relname, getdatabit），（relname），三个参数意思为对象名，进程名（这个地方默认是main），是否要显示databit（很长，注意显示），第一个函数需要全部输入，第二个函数默认fork为main，第三个函数默认fork为main，getdatabit为false。

它输出的是文件位置及名称（relpath），文件顺序（segment），OS page或block大小（os_page_size），对象占用系统缓存需要的页面个数（rel_os_pages），对象已经占用缓存页面个数（pages_mem），在缓存中连续的页面组的个数（group_mem），OS剩余的page数（os_pages_free），加载信息的位图（databit）。

postgres=# select * from pgfincore ('tmp_t0'); 
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   262144 |    0 |    0 |   1899395 |    |     0 |     0
 base/13580/16426.1 |   1 |    4096 |   82684 |    0 |    0 |   1899395 |    |     0 |     0
(2 rows)
postgres=# select * from pgfincore('tmp_t0', false);
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   262144 | 262144 |    1 |   1553850 |    |     0 |     0
 base/13580/16426.1 |   1 |    4096 |   82684 |  82684 |    1 |   1553850 |    |     0 |     0
(2 rows)

postgres=# select * from pgfincore ('tmp_t0', 'main', false);
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   262144 | 262144 |    1 |   1553819 |    |     0 |     0
 base/13580/16426.1 |   1 |    4096 |   82684 |  82684 |    1 |   1553819 |    |     0 |     0
(2 rows)

pgsysconf与pgsysconf_pretty

查询当前操作系统的块大小，剩余多少可用的CACHE（块）。
输出OS block的大小（os_page_size），OS中剩余的page数(os_pages_free)和OS拥有的page总数(os_total_pages)。
postgres=# select * from pgsysconf(); 
 os_page_size | os_pages_free | os_total_pages 
--------------+---------------+----------------
    4096 |   1899465 |   1997514
(1 row)
postgres=# select * from pgsysconf_pretty();
 os_page_size | os_pages_free | os_total_pages 
--------------+---------------+----------------
 4096 bytes | 6070 MB   | 7803 MB
(1 row)

pgfadvise_willneed

将数据库对象（表，索引）载入OS CACHE
输出文件名（relpath），OS block大小（os_page_size），对象占用系统page数（rel_os_pages）,OS剩余的page数（os_pages_free）
postgres=# select * from pgfadvise_willneed('tmp_t0'); 
  relpath   | os_page_size | rel_os_pages | os_pages_free 
--------------------+--------------+--------------+---------------
 base/13580/16426 |    4096 |   262144 |   1637008
 base/13580/16426.1 |    4096 |   82684 |   1554178
(2 rows)

postgres=# select * from pgfincore ('tmp_t0'); 
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   262144 | 262144 |    1 |   1553945 |    |     0 |     0
 base/13580/16426.1 |   1 |    4096 |   82684 |  82684 |    1 |   1553945 |    |     0 |     0
(2 rows)

pgfadvise_dontneed

将数据库对象（表，索引）刷出OS CACHE
对当前对象设置dontneed标记。dontneed标记的意思就是当操作系统需要释放内存时优先释放标记为dontneed的pages。
postgres=# select * from pgfadvise_dontneed('tmp_t0');
  relpath   | os_page_size | rel_os_pages | os_pages_free 
--------------------+--------------+--------------+---------------
 base/13580/16426 |    4096 |   262144 |   1816091
 base/13580/16426.1 |    4096 |   82684 |   1898771
(2 rows)

postgres=# select * from pgfincore ('tmp_t0'); 
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   262144 |    0 |    0 |   1899411 |    |     0 |     0
 base/13580/16426.1 |   1 |    4096 |   82684 |    0 |    0 |   1899411 |    |     0 |     0

pgfadvise_normal

使用普通内存方式,再pgfadv_dontneed方式刷出对象后，建议执行这个函数，将内存方式改为普通方式
这里的pgfadvise主要调用了Linux下的函数posix_fadvise，标记值也是posix_fadvise所需要的。 
postgres=# select * from pgfadvise_normal('tmp_t0');
  relpath   | os_page_size | rel_os_pages | os_pages_free 
--------------------+--------------+--------------+---------------
 base/13580/16426 |    4096 |   262144 |   1899337
 base/13580/16426.1 |    4096 |   82684 |   1899337
(2 rows)

postgres=# select * from pgfincore ('tmp_t0'); 
  relpath   | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty 
--------------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
 base/13580/16426 |   0 |    4096 |   以上是关于PostgreSQL-缓存利器的主要内容，如果未能解决你的问题，请参考以下文章