Debezium异常退出问题排查小记

Posted 瀚高PG实验室

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Debezium异常退出问题排查小记相关的知识,希望对你有一定的参考价值。

作者:瀚高PG实验室(Highgo PG Lab)- 徐云鹤

问题描述

今天在一套测试环境下遇到了如下问题,当数据库执行update后,debezium中止并报错:

[FATAL] Replication stream was unexpectedly terminated: ERROR:  no known snapshots
CONTEXT:  slot "myslot", output plugin "debezium", in the change callback, associated LSN 19/B6A4A448

故障处理过程

首先查询数据库运行日志。

2021-08-09 11:31:45.370 CST [62063] postgres@mydb app=debezium LOG:  received replication command: IDENTIFY_SYSTEM
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG:  received replication command: START_REPLICATION SLOT "myslot" LOGICAL 19/B22FE808 ("error_policy" 'exit',"table_list_path" '/home/postgres/app/table_list.conf')
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG:  starting logical decoding for slot "myslot"
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium DETAIL:  streaming transactions committing after 19/B3000000, reading WAL from 19/B22FE808
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG:  logical decoding found consistent point at 19/B22FE808
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium DETAIL:  There are no running transactions.
Mon Aug  9 11:31:46 CST 2021
2021-08-09 11:31:46.629 CST [62063] postgres@mydb app=debezium ERROR:  no known snapshots
2021-08-09 11:31:46.629 CST [62063] postgres@mydb app=debezium CONTEXT:  slot "myslot", output plugin "debezium", in the change callback, associated LSN 19/B6A4A448

初步怀疑是复制槽的问题,在主库查询出该复制槽,进行手动删除。让debezium重新生成一个新复制槽,发现不行,报错依旧,包括重启数据库,都不好使。。
再次分析日志,通过上述日志中的报错信息“DETAIL: There are no running transactions.”,定位源码是
snapbuild.c中执行SnapBuildFindSnapshot函数。源码内容如下。

	/*
	 * a) No transaction were running, we can jump to consistent.
	 *
	 * This is not affected by races around xl_running_xacts, because we can
	 * miss transaction commits, but currently not transactions starting.
	 *
	 * NB: We might have already started to incrementally assemble a snapshot,
	 * so we need to be careful to deal with that.
	 */
	if (running->oldestRunningXid == running->nextXid)
	{
		if (builder->start_decoding_at == InvalidXLogRecPtr ||
			builder->start_decoding_at <= lsn)
			/* can decode everything after this */
			builder->start_decoding_at = lsn + 1;

		/* As no transactions were running xmin/xmax can be trivially set. */
		builder->xmin = running->nextXid;	/* < are finished */
		builder->xmax = running->nextXid;	/* >= are running */

		/* so we can safely use the faster comparisons */
		Assert(TransactionIdIsNormal(builder->xmin));
		Assert(TransactionIdIsNormal(builder->xmax));

		builder->state = SNAPBUILD_CONSISTENT;
		SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);

		ereport(LOG,
				(errmsg("logical decoding found consistent point at %X/%X",
						(uint32) (lsn >> 32), (uint32) lsn),
				 errdetail("There are no running transactions.")));

		return false;
	}

执行判断条件running->oldestRunningXid == running->nextXid为true,说明获取的的旧的事务ID和新的事务ID相同,抛出上述日志信息。
继续通过查询源码定位到是tuptoaster.c中执行init_toast_snapshot函数抛出的异常。

/* ----------
 * init_toast_snapshot
 *
 *	Initialize an appropriate TOAST snapshot.  We must use an MVCC snapshot
 *	to initialize the TOAST snapshot; since we don't know which one to use,
 *	just use the oldest one.  This is safe: at worst, we will get a "snapshot
 *	too old" error that might have been avoided otherwise.
 */
static void
init_toast_snapshot(Snapshot toast_snapshot)
{
	Snapshot	snapshot = GetOldestSnapshot();

	if (snapshot == NULL)
		elog(ERROR, "no known snapshots");

	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
}

这一块代码就是初始化一个合适的TOAST快照,会获取最老的一个快照,由于没获取到导致snapshot变量为NULL,导致异常被抛出。
目前通过修改表结构,将行外存储改为行内存储。

alter table complain_info alter linknum SET STORAGE plain;
vacuum full complain_info;

经验证后问题不再出现。
至此问题初步解决。

以上是关于Debezium异常退出问题排查小记的主要内容,如果未能解决你的问题,请参考以下文章

java内存泄漏问题排查

Redis进程异常退出排查

kubernetes scc 故障排查小记

小记一次排查nginx正向代理问题

阿里云slb+nginx配置curl无法获取url问题小记

java 异常捕获小记