pg_basebackup工具运维改造
Posted PostgreSQLChina
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了pg_basebackup工具运维改造相关的知识,希望对你有一定的参考价值。
作者:杨向博
一、背景介绍
在日常运维过程中,多次遇到HA切换后,全量重建,之前的pglog日志被覆盖的场景。这里HA管控侧使用的全量重建工具是pg_basebackup。
因为pg_basebackup重建时指定的datadir必须为空,因此需要删除datadir下所有的文件再进行重建。
很多时候是需要查看重建之前的老日志来分析一些问题。那这个时候就比较尴尬。
当然我们可以在管控里加入拷贝老日志的逻辑,可以拷贝到本地专用磁盘路径,或者上传至对象存储,又或者将日志同步至ES等。但是无疑又要增加一部分存储成本。
那是否可以考虑改造下pg_basebackup呢?在重建时可以选择是否保留pglog,同时将删除旧文件,同步数据,启动standby,这些繁琐的步骤,设计为“一键式”操作呢。
其实“一键式”重建,早年在玩GaussDB时就已经见识过了。最近抽空学习(白嫖)了下opengauss的方案,对pg_basebackup加入了“一键式”重建功能,同时通过参数选择是否保留pglog。
二、修改原则
这里必须保证原有的功能不受影响,也就是新功能可以通过命令行参数打开/关闭。
三、实现效果
先来看新加的参数介绍
[postgres@NickCentos:pg10.4:5404 ~]$pg_basebackup --help | grep -E '\\--no-pglog|\\--force-rebuild'
-L, --no-pglog keep the previous pglog when rebuilding the standby database
-f, --force-rebuild Forcibly rebuild the standby database, this operation should be cautious
because it will delete the old data directory and re-synchronize from the primary database
[postgres@NickCentos:pg10.4:5404 ~]$
-L长参数为—no-pglog 启用该参数在强制重建时保留之前的pglog,不会将primary的pglog同步过来
-f 长参数为—force-rebuild 启用该参数就是开启了“一键式”强制重建
这两个参数结合使用。
最终的实现效果:
[postgres@NickCentos:pg10.4:5404 ~]$ pg_basebackup -F p -X stream -D /data/pg10-4debug/standby -h 127.0.0.1 -p 5404 -w --verbose -P --no-pglog --force-rebuild
pg_basebackup: pg_basebackup version is equal to the dataDir version, major version is 10
pg_basebackup: dataDir is running server: (PID: 7450)
pg_basebackup: /data/pg10-4debug/standby is standby dataDir
pg_basebackup: standby is force shutting down (PID: 7450)
pg_basebackup: server stopped
pg_basebackup: start delete standby dataDir
pg_basebackup: delete standby dataDir done
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 1/91000028 on timeline 1
pg_basebackup: starting background WAL receiver
797219/797219 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 1/910000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
waiting for server to start....2021-07-18 19:42:16.700 CST [10171] LOG: listening on IPv4 address '0.0.0.0', port 5444
2021-07-18 19:42:16.701 CST [10171] LOG: listening on IPv6 address '::', port 5444
2021-07-18 19:42:16.707 CST [10171] LOG: listening on Unix socket '/tmp/.s.PGSQL.5444'
2021-07-18 19:42:16.717 CST [10171] LOG: redirecting log output to logging collector process
2021-07-18 19:42:16.717 CST [10171] HINT: Future log output will appear in directory 'pg_log'.
done
server started
[postgres@NickCentos:pg10.4:5404 ~]
对于以上步骤,稍加解读如下:
- 进行版本校验,查看pg_basebackup主版本是否和dataDir数据版本一致(读取PG_VERSION文件,不一致则退出);
- 查看dataDir是否running server(查看postmaster.pid是否存在,从文件中获取pid,并探测进程是否存在);
- 如果dataDir已经running server,确认是否是standby?(9-11版本确认recovery.conf是否存在,12-13版本确认standby.signal是否存在,存在则为standby);
- Stop standby(如果standby running则stop standby,这里是重建操作,因此直接用SIGQUIT停库,即immediate mode);
- 删除dataDir下的数据文件(这里保留了pglog和postgresql.conf,postgresql.auto.conf,postmaster.ops);
- 进行basebackup,这里主要是原生的功能。稍微做了下改动,在向primary发送basebackup command时,同时发送了keepflag。Keepflag将决定walsender在sendDir时是否发送pglog和conf文件;
- Basebackup完成后,start standby
目录详情:从时间戳可以看出大部分都是重建重新拷贝的,其余部分文件(pg_log和conf文件)是很久之前的
[postgres@NickCentos:pg10.4:5404 ~]$ll /data/pg10-4debug/standby
total 136
-rw------- 1 postgres postgres 3 Jul 18 19:42 PG_VERSION
-rw------- 1 postgres postgres 208 Jul 18 19:42 backup_label.old
drwx------ 6 postgres postgres 4096 Jul 18 19:42 base
-rw------- 1 postgres postgres 32 Jul 18 19:42 current_logfiles
drwx------ 2 postgres postgres 4096 Jul 18 19:42 global
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_commit_ts
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_dynshmem
-rw------- 1 postgres postgres 4513 Jul 18 19:42 pg_hba.conf
-rw------- 1 postgres postgres 1636 Jul 18 19:42 pg_ident.conf
drwx------ 2 postgres postgres 4096 Jul 14 13:13 pg_log
drwx------ 4 postgres postgres 4096 Jul 18 19:52 pg_logical
drwx------ 4 postgres postgres 4096 Jul 18 19:42 pg_multixact
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_notify
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_replslot
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_serial
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_snapshots
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_stat
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_stat_tmp
drwx------ 2 postgres postgres 4096 Jul 18 19:47 pg_subtrans
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_tblspc
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_twophase
drwx------ 3 postgres postgres 4096 Jul 18 19:52 pg_wal
drwx------ 2 postgres postgres 4096 Jul 18 19:42 pg_xact
-rw------- 1 postgres postgres 122 Jul 13 15:48 postgresql.auto.conf
-rw------- 1 postgres postgres 22740 Jul 14 19:31 postgresql.conf
-rw------- 1 postgres postgres 87 Jul 18 19:42 postmaster.opts
-rw------- 1 postgres postgres 84 Jul 18 19:42 postmaster.pid
-rw-rw-r-- 1 postgres postgres 174 Jul 16 23:14 recovery.conf
[postgres@NickCentos:pg10.4:5404 ~]$
四、关键代码
1. Client端:pg_basebackup
新增的函数声明:
由于这些自定义函数体比较长,这里只是展示下函数声明
/* Modify by nickxyang at 2021-07-11 pm */
static long ValidatePgVersion(const char *path);
static pgpid_t get_pgpid(bool is_status_request);
static bool pgmaster_is_alive(pid_t pid);
static bool is_standbyDir(const char* dirname,char* filename);
static bool is_standby_running(void);
static void stop_standby(void);
static void delete_datadir(const char* dirname);
static pgpid_t start_postmaster(void);
static char *find_other_exec_or_die(const char *argv0, const char *target, const char *versionstr);
static WaitPMResult wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint);
static char **readfile(const char *path, int *numlines);
static void free_readfile(char **optlines);
static void start_standby(void);
/* End at 2021-07-11 pm */
static void BaseBackup(void) 函数中修改:
/*
* Modify by nickxyang as 2021-07-11
* If the --no-pglog or --force-rebuild is enabled,
* when we send the basebackup command to the primary,
* a keepflag will be sent at the same time.
* This flag can be 'keeppglog' (keep pglog), 'keeplogaf' (keep pglog and conf) , 'Keeppgcnf'(keep conf)
*/
if (keeppglog || forcerebuild)
{
if (keeppglog && forcerebuild)
strcpy(keepflag,',keeplogaf');
else if(keeppglog && !forcerebuild)
strcpy(keepflag,',keeppglog');
else if(!keeppglog && forcerebuild)
strcpy(keepflag,',keeppgcnf');
basebkp =
psprintf('BASE_BACKUP LABEL '%s' %s %s %s %s %s %s %s',
escaped_label,
showprogress ? 'PROGRESS' : '',
includewal == FETCH_WAL ? 'WAL' : '',
fastcheckpoint ? 'FAST' : '',
includewal == NO_WAL ? '' : 'NOWAIT',
maxrate_clause ? maxrate_clause : '',
format == 't' ? 'TABLESPACE_MAP' : '',
keepflag);
}
else
basebkp =
psprintf('BASE_BACKUP LABEL '%s' %s %s %s %s %s %s',
escaped_label,
showprogress ? 'PROGRESS' : '',
includewal == FETCH_WAL ? 'WAL' : '',
fastcheckpoint ? 'FAST' : '',
includewal == NO_WAL ? '' : 'NOWAIT',
maxrate_clause ? maxrate_clause : '',
format == 't' ? 'TABLESPACE_MAP' : '');
main函数修改:
/*
* Modify by Nickxyang at 2021-07-11 pm
* When forcerebuild is true, there is no need to check whether the standby datadir is empty,
* because we will delete some of its files and directories
*/
if (format == 'p' || strcmp(basedir, '-') != 0)
{
if (forcerebuild != true )
verify_dir_is_empty_or_create(basedir, &made_new_pgdata, &found_existing_pgdata);
}
/* connection in replication mode to server */
conn = GetConnection();
if (!conn)
{
/* Error message already written in GetConnection() */
exit(1);
}
/* When keepplog or forcerebuild is true, check whether standby is started */
if (keeppglog || forcerebuild)
{
baseversion = ValidatePgVersion(basedir);
running = is_standby_running();
}
/* make standbyfile */
if (baseversion >= 9 && baseversion < 12)
strcpy(standbyfile,'recovery.conf');
else if (baseversion >= 12)
strcpy(standbyfile,'standby.signal');
else
exit(1);
/* If the standby has been started, stop it */
if (running && is_standbyDir(basedir, standbyfile))
{
stop_standby();
}
/* Delete the old standby data directory
* We just delete some data files, pglog can be retained by specifying the --no-pglog parameter,
* and retain postgresql.conf, postgresql.auto.conf, recovery.conf (if it exists)
*/
if (forcerebuild)
delete_datadir(basedir);
/* Create pg_wal symlink, if required */
if (strcmp(xlog_dir, '') != 0)
{
char *linkloc;
verify_dir_is_empty_or_create(xlog_dir, &made_new_xlogdir, &found_existing_xlogdir);
/*
* Form name of the place where the symlink must go. pg_xlog has been
* renamed to pg_wal in post-10 clusters.
*/
linkloc = psprintf('%s/%s', basedir,
PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
'pg_xlog' : 'pg_wal');
#ifdef HAVE_SYMLINK
if (symlink(xlog_dir, linkloc) != 0)
{
fprintf(stderr, _('%s: could not create symbolic link \\'%s\\': %s\\n'),
progname, linkloc, strerror(errno));
disconnect_and_exit(1);
}
#else
fprintf(stderr, _('%s: symlinks are not supported on this platform\\n'));
disconnect_and_exit(1);
#endif
free(linkloc);
}
BaseBackup();
/*
* If forcerebuild is ture , and BaseBackup is done , start standby
*/
if (forcerebuild)
start_standby();
/*
* Modify End at 2021-07-11 pm
*/
2. Server端:walsender
bool exec_replication_command(const char *cmd_string)函数修改:
/*
* Modify by Nickxyang at 2021-07-11
* Cut out cmd_str and keepflag from cmd_string, cmd_str is the basebackup command
*/
const char delim[2] = ',';
char *cmd_str = (char*) malloc (200 *sizeof(char));
char *strtoken = (char*) malloc (20 *sizeof(char));
memset(strtoken,'\\0',sizeof(20 *sizeof(char)));
memset(cmd_str,'\\0',sizeof(200 *sizeof(char)));
strcpy(cmd_str,cmd_string);
strtok(cmd_str,delim);
strtoken = strtok(NULL,delim);
if (strtoken != NULL)
{
strcpy(keepflag,strtoken);
ereport(LOG,
(errmsg('keepflag: %s', keepflag)));
}
sendDir函数修改:
/* Skip special stuff */
if (strcmp(de->d_name, '.') == 0 || strcmp(de->d_name, '..') == 0)
continue;
/* Skip temporary files */
if (strncmp(de->d_name,
PG_TEMP_FILE_PREFIX,
strlen(PG_TEMP_FILE_PREFIX)) == 0)
continue;
if (strcmp(de->d_name, Log_directory) ==0 && (strcmp(keepflag, 'keeppglog') == 0 || (strcmp(keepflag, 'keeplogaf')) == 0))
continue;
if ((strcmp(keepflag, 'keeppgcnf') == 0 || strcmp(keepflag, 'keeplogaf') == 0) && (strcmp(de->d_name, 'postgresql.conf') == 0 || (strcmp(de->d_name, 'postgresql.auto.conf') == 0)))
continue;
五、小结
“一键式”重建功能对于“原地”重建场景来说还是比较方便的,同时可以选择保留pglog,这样不影响重建后需要查询一定周期内的历史日志。
以上是关于pg_basebackup工具运维改造的主要内容,如果未能解决你的问题,请参考以下文章