查询增强插件
postgresql使用的是双缓存结构,其是数据读取的时候,可以直接通过缓存命中返回数据,用于提高查询的效率。
在PG中常见的缓存管理插件有pgfincore和pg_premarm。安装部署都比较简单。
- pg_prewarm:用于在数据库启动后主动将表或索引的数据页加载到 shared buffers 或 OS page cache,避免冷启动时的大量磁盘 I/O。它支持多种加载方式:直接读入 shared buffers(buffer 模式)、利用内核读缓存(prefetch 模式)、或者两者结合。典型场景是数据库重启后提前“热身”,让热点数据尽快进入缓存。
- pgfincore:偏重于 缓存状态的观察与控制。它可以查询某个对象(表、索引)当前在 OS page cache 中的驻留情况,帮助 DBA 判断哪些数据已经在内存,哪些需要预加载。同时它也提供强制加载或丢弃页面的能力,更像一个运维工具,聚焦于 Linux 内核 page cache 的可见性与操作。
总结:pg_prewarm 更强调 主动预热,让未来访问更快;而 pgfincore 更强调 缓存状态检测与精细化控制。前者像“暖身器”,后者像“探测器+遥控器”。
pgfincore
pgfincore 是一个低层级的扩展,它直接利用 Linux 内核的 fincore 与 posix_fadvise 系统调用。主要作用是查看文件块是否在 OS 页缓存 (page cache) 中,以及操作这些文件块的缓存状态。截止目前,官方介绍已经支持postgresq-16
功能特点
1、检查表/索引文件的 OS 层缓存命中情况
2、将文件预加载到 Linux 页缓存 中
3、从 OS 页缓存中 丢弃数据,模拟冷缓存场景
使用限制
- PgFincore使用需要POSIX_FADVISE支持
--使用一下指令检测posix_fadvise模块的支持
man 2 posix_fadvise
- PostgreSQL >= 8.3
- 无法在windows 系统使用
使用案例:
wget https://github.com/klando/pgfincore/archive/refs/tags/1.3.1.tar.gz
tar -zxvf 1.3.1.tar.gz
cd pgfincore-1.3.1/
make clean
make
make install
psql -c "CREATE EXTENSION pgfincore; "
在创建该插件后会生成以下14个函数
pgfadvise(regclass, text, integer, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_dontneed(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_loader(regclass, integer, boolean, boolean, bit varying, OUT relpath text, OUT os_page_size bigint, OUT os_pages_free bigint, OUT pages_loaded bigint, OUT pages_unloaded bigint )
pgfadvise_loader(regclass, text, integer, boolean, boolean, bit varying, OUT relpath text, OUT os_page_size bigint, OUT os_pages_free bigint, OUT pages_loaded bigint, OUT pages_unloaded bigint )
pgfadvise_normal(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_random(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_sequential(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_willneed(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfincore(regclass, boolean, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore(regclass, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore(regclass, text, boolean, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore_drawer(bit varying, OUT drawer cstring)
pgsysconf(OUT os_page_size bigint, OUT os_pages_free bigint, OUT os_total_pages bigint)
pgsysconf_pretty(OUT os_page_size text, OUT os_pages_free text, OUT os_total_pages text)
pgsysconf 查看OS 的页面大小、可用页数、总页数。
--单位是页数
pgfincore=# SELECT * FROM pgsysconf();
os_page_size | os_pages_free | os_total_pages
--------------+---------------+----------------
4096 | 1817160 | 2038679
(1 row)
-- os_page_size = 4096 → 操作系统页面大小 = 4KB
-- os_pages_free = 1817160 → 空闲页数
-- os_total_pages = 2038679 → 总页数
pgfincore=# SELECT * FROM pgsysconf_pretty();
os_page_size | os_pages_free | os_total_pages
--------------+---------------+----------------
4096 bytes | 7098 MB | 7964 MB
(1 row)
/*
os_page_size
直接显示为 4096 bytes
os_pages_free
计算:1817160 × 4096 / 1024^2
= 1817160 × 4 KB / 1 MB
= 1817160 × 4 / 1024 MB
≈ 7098 MB
os_total_pages
计算:2038679 × 4096 / 1024^2
= 2038679 × 4 / 1024 MB
≈ 7964 MB
*/
pgfincore=# \! free -m
total used free shared buff/cache available
Mem: 7963 491 7098 21 373 7216
Swap: 2047 0 2047
/*
pgsysconf_pretty().os_total_pages = 7964 MB ≈ free -m 中的 total=7963
pgsysconf_pretty().os_pages_free = 7098 MB = free -m 中的 free=7098
*/
--可以看出,查出来的缓冲大小并没
探测某个表/索引的数据页是否驻留在 OS Page Cache。
-- 查看一个表的缓存情况
postgres=# SELECT * FROM pgfincore('t_corrupt'::regclass);
-[ RECORD 1 ]-+-------------
relpath | base/5/66366
segment | 0
os_page_size | 4096
rel_os_pages | 49076
pages_mem | 49076
group_mem | 1
os_pages_free | 1773181
databit |
pages_dirty | 0
group_dirty | 0
/*
relpath (text):该 relation 在数据目录下对应的文件路径(相对路径,如 base/5/66366)
segment (integer):文件段编号(PostgreSQL 的 relation 文件会被拆成多个 segment 文件,每个通常最大 1G,如base/5/66366.1 segment会被显示为1并一次递增)
os_page_size (bigint):OS 页的大小
rel_os_pages (bigint):relation 对应这个 segment 的总 OS 页数
pages_mem (bigint):当前在 OS 缓存中的页数(即被缓存的页数)
group_mem (bigint):在 group / 一组之下的页数(可能用于按某种分组汇总)
os_pages_free (bigint):操作系统当前可用空闲页数(作为参考)
databit (bit varying):按页给出一个 bitmap / bit 向量,标识各页是否在缓存中1表示在缓存在,0表示该页为缓存。
pages_dirty (bigint):脏页数(可能在缓存中但尚未写回磁盘的页)
group_dirty (bigint):按分组统计的脏页数
在某些版本中,你还可以指定是否返回 databit、是否只返回某个文件 fork(main / fsm / vm)等参数。
添加true可以返回databit
SELECT * FROM pgfincore('t_corrupt'::regclass,true);
*/
-- 当表数据超过1GB 时,其展示效果
CREATE UNLOGGED TABLE t_big (id serial, data text);
-- 2. 插入足够多数据使文件超过 1GB
-- 为了快一些,我们用较大的 text
INSERT INTO t_big (data)
SELECT repeat('x', 8192) FROM generate_series(1, 300000);
-- 检查表大小
postgres=# SELECT pg_size_pretty(pg_relation_size('t_big'));
pg_size_pretty
----------------
3402 MB
(1 row)
postgres=# \! ls -lh $PGDATA/base/5/90956*
-rw------- 1 postgres postgres 1.0G Oct 5 21:32 /home/postgres/pg/data/base/5/90956
-rw------- 1 postgres postgres 1.0G Oct 5 21:35 /home/postgres/pg/data/base/5/90956.1
-rw------- 1 postgres postgres 1.0G Oct 5 21:38 /home/postgres/pg/data/base/5/90956.2
-rw------- 1 postgres postgres 556M Oct 5 21:40 /home/postgres/pg/data/base/5/90956.3
-rw------- 1 postgres postgres 936K Oct 5 21:40 /home/postgres/pg/data/base/5/90956_fsm
-rw------- 1 postgres postgres 0 Oct 5 21:28 /home/postgres/pg/data/base/5/90956_init
-rw------- 1 postgres postgres 8.0K Oct 5 21:29 /home/postgres/pg/data/base/5/90956_vm
postgres=# SELECT * FROM pgfincore('t_big'::regclass);
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 95436 | 95436 | 1 | 33960 | | 0 | 0
(4 rows)
指定查看特定 fork(比如 ‘fsm’ 或 ‘vm’)
postgres=# SELECT * FROM pgfincore('t_big'::regclass, 'fsm', true);
-[ RECORD 1 ]-+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
relpath | base/5/90956_fsm
segment | 0
os_page_size | 4096
rel_os_pages | 244
pages_mem | 242
group_mem | 1
os_pages_free | 798929
databit | 0011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
pages_dirty | 0
group_dirty | 0
postgres=# SELECT * FROM pgfincore('t_big'::regclass, 'vm', true);
-[ RECORD 1 ]-+----------------
relpath | base/5/90956_vm
segment | 0
os_page_size | 4096
rel_os_pages | 8
pages_mem | 8
group_mem | 1
os_pages_free | 798960
databit | 11111111
pages_dirty | 0
group_dirty | 0
驱逐对应的缓存表
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 10346 | 1 | 1790376 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
(4 rows)
postgres=# SELECT * FROM pgfadvise_dontneed('t_big');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------+--------------+--------------+---------------
base/5/90956 | 4096 | 10346 | 1800976
base/5/90956.1 | 4096 | 0 | 1800976
base/5/90956.2 | 4096 | 0 | 1800976
base/5/90956.3 | 4096 | 0 | 1800976
(4 rows)
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
(4 rows)
postgres=# \! free -h
total used free shared buff/cache available
Mem: 7.8G 510M 6.9G 17M 418M 7.0G
Swap: 2.0G 520K 2.0G
select os_pages_free * (os_page_size/1024::numeric(26,9)/1024::numeric(26,9)) from pgfincore('t_big'); --该公式算出来的值将等于 free 的剩余内存,单位:MB。
加载表到缓存中
这里t_big表的relation 文件由于我再插入数据的时候,在未完成的情况下进行了终止操作,所以.1、.2、.3 衍生出来段的数据都是死元组,均不可见,也不可缓存,我也未做vacuum。在后续的SQL中我也做了WHERE databit IS NOT NULL;条件进行过滤。
-- 使用 pgfadvise_willneed,预加载 t_big 的页到 OS 缓存
postgres=# \! free -h
total used free shared buff/cache available
Mem: 7.8G 509M 6.9G 17M 422M 7.0G
Swap: 2.0G 520K 2.0G
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
(4 rows)
postgres=# SELECT * FROM pgfadvise_willneed('t_big');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------+--------------+--------------+---------------
base/5/90956 | 4096 | 10346 | 1790049
base/5/90956.1 | 4096 | 0 | 1790049
base/5/90956.2 | 4096 | 0 | 1790049
base/5/90956.3 | 4096 | 0 | 1790049
(4 rows)
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 10346 | 1 | 1790135 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
(4 rows)
#### 加载卸载的精准控制pgfadvise_loader
SELECT * FROM pgfincore('t_big');
SELECT * FROM pgfadvise_loader('t_big', 0, false, true, B'1010100');
SELECT * FROM pgfincore('t_big');
SELECT * FROM pgfadvise_loader('t_big', 0, true, false, B'1010100');
regclass:目标表的 relation
integer:sement的编号
boolean:是否加载,如果true,读取最后一个参数的1标识位。
boolean:是否卸载,如果true,读取最后一个参数的0标识位。
bit varying:一个 bit vector(databit)表示哪些页是目标操作对象。0表示不缓存,1表示缓存对应的页,按照顺序其0/1便是代表其所在segment的页。
-- 当表数据超过1GB时
-- 卸载已经加载的页面
SELECT pgfadvise_loader('t_big', seg.segment, false, true, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
--查看卸载情况
select * FROM pgfincore('t_big', true);
~ 是 PostgreSQL 对 bit / bit varying 的按位取反操作:
-- 加载已卸载的页面
SELECT pgfadvise_loader('t_big', seg.segment, true, false, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
--查看加载情况
select * FROM pgfincore('t_big', true);
-- 加载所有数据
SELECT pgfadvise_loader('t_big', seg.segment,true, false, repeat('1', seg.rel_os_pages::integer)::bit varying)
FROM pgfincore('t_big', true) as seg;
-- 或者使用以下方法加载所有数据
select pgfadvise_sequential('t_big');
IO规则预热
SELECT * FROM pgfadvise_random(‘pg_class’::regclass);
提前调用 pgfadvise_sequential() 、pgfadvise_random 可以让 OS 做 连续页预读或者随机页读,此方法对于不同的硬盘优势情况,可以灵活选择。
-- 创建索引
create index idx_t_big on t_big(id);
-- 卸载所有加载页面
SELECT pgfadvise_loader('t_big', seg.segment, false, true, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
-- 使用全表顺序扫描时
postgres-# select * from t_big;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on t_big (cost=0.00..8173.00 rows=300000 width=108) (actual time=0.341..63.657 rows=300000 loops=1)
Buffers: shared read=5173
Planning:
Buffers: shared hit=16 read=4 dirtied=1
Planning Time: 0.177 ms
Execution Time: 70.076 ms
(6 rows)
select pgfadvise_sequential('t_big');
postgres=# explain (analyze,buffers)
select * from t_big;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on t_big (cost=0.00..8173.00 rows=300000 width=108) (actual time=0.024..23.663 rows=300000 loops=1)
Buffers: shared read=5173
Planning Time: 0.091 ms
Execution Time: 29.922 ms
(4 rows)
-- 再次对比索引扫描时其影响效果。理论上来说,当你缓存索引时,其影响效果依然是比较积极的。个人实验中页印证了这一点,不再累述
select pgfadvise_sequential('idx_t_big');
explain (analyze,buffers)
select * from t_big where id <103;
pgfadvise_random 使用方法与pgfadvise_sequential一样,不再累述。再次调用pgfadvise_normal,是其返回normal的形式。
pgfincore_drawer 函数
-- pgfincore_drawer
SELECT pgfincore_drawer(databit) AS drawer FROM pgfincore('t_big',true) where databit is not null ; -- 没看出有啥作用
pgfadvise函数
pgfadvise函数和以上函数一样,都是调用posix_fadvise
posix_fadvise 是 POSIX 标准提供的系统调用函数
#include
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
fd:文件描述符,通过 open 函数获取。
offset:建议应用的起始位置(字节为单位)。
len:建议的字节长度。若为 0,则覆盖从 offset 到文件末尾的所有数据。
advice:建议类型,常见值包括:
POSIX_FADV_DONTNEED:数据近期不会被访问,可释放缓存。
POSIX_FADV_WILLNEED:数据即将被访问,可预读到缓存。
POSIX_FADV_SEQUENTIAL:顺序访问模式,可能增大预读窗口。
POSIX_FADV_RANDOM:随机访问模式,清除预读缓存。
#define PGF_WILLNEED 10 //加载数据
#define PGF_DONTNEED 20 // 卸载数据
#define PGF_NORMAL 30 // 恢复预读标识
#define PGF_SEQUENTIAL 40 // 顺序预读标识
#define PGF_RANDOM 50 // 随机预读标识

可以看出pgfadvise则是pgfadvise_normal、pgfadvise_sequential、pgfadvise_random、pgfadvise_dontneed、pgfadvise_willneed的混合用法
pgfadvise_normal函数则将其预读访问模式擦掉。
SELECT *
FROM pgfadvise('t_big'::regclass, 'main', 10);
relpath | os_page_size | rel_os_pages | os_pages_free
--------------+--------------+--------------+---------------
base/5/90963 | 4096 | 10346 | 1788377