Debezium异常退出问题排查小记
Posted 瀚高PG实验室
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Debezium异常退出问题排查小记相关的知识,希望对你有一定的参考价值。
作者:瀚高PG实验室(Highgo PG Lab)- 徐云鹤
问题描述
今天在一套测试环境下遇到了如下问题,当数据库执行update后,debezium中止并报错:
[FATAL] Replication stream was unexpectedly terminated: ERROR: no known snapshots
CONTEXT: slot "myslot", output plugin "debezium", in the change callback, associated LSN 19/B6A4A448
故障处理过程
首先查询数据库运行日志。
2021-08-09 11:31:45.370 CST [62063] postgres@mydb app=debezium LOG: received replication command: IDENTIFY_SYSTEM
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG: received replication command: START_REPLICATION SLOT "myslot" LOGICAL 19/B22FE808 ("error_policy" 'exit',"table_list_path" '/home/postgres/app/table_list.conf')
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG: starting logical decoding for slot "myslot"
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium DETAIL: streaming transactions committing after 19/B3000000, reading WAL from 19/B22FE808
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium LOG: logical decoding found consistent point at 19/B22FE808
2021-08-09 11:31:45.373 CST [62063] postgres@mydb app=debezium DETAIL: There are no running transactions.
Mon Aug 9 11:31:46 CST 2021
2021-08-09 11:31:46.629 CST [62063] postgres@mydb app=debezium ERROR: no known snapshots
2021-08-09 11:31:46.629 CST [62063] postgres@mydb app=debezium CONTEXT: slot "myslot", output plugin "debezium", in the change callback, associated LSN 19/B6A4A448
初步怀疑是复制槽的问题,在主库查询出该复制槽,进行手动删除。让debezium重新生成一个新复制槽,发现不行,报错依旧,包括重启数据库,都不好使。。
再次分析日志,通过上述日志中的报错信息“DETAIL: There are no running transactions.”,定位源码是
snapbuild.c中执行SnapBuildFindSnapshot函数。源码内容如下。
/*
* a) No transaction were running, we can jump to consistent.
*
* This is not affected by races around xl_running_xacts, because we can
* miss transaction commits, but currently not transactions starting.
*
* NB: We might have already started to incrementally assemble a snapshot,
* so we need to be careful to deal with that.
*/
if (running->oldestRunningXid == running->nextXid)
{
if (builder->start_decoding_at == InvalidXLogRecPtr ||
builder->start_decoding_at <= lsn)
/* can decode everything after this */
builder->start_decoding_at = lsn + 1;
/* As no transactions were running xmin/xmax can be trivially set. */
builder->xmin = running->nextXid; /* < are finished */
builder->xmax = running->nextXid; /* >= are running */
/* so we can safely use the faster comparisons */
Assert(TransactionIdIsNormal(builder->xmin));
Assert(TransactionIdIsNormal(builder->xmax));
builder->state = SNAPBUILD_CONSISTENT;
SnapBuildStartNextPhaseAt(builder, InvalidTransactionId);
ereport(LOG,
(errmsg("logical decoding found consistent point at %X/%X",
(uint32) (lsn >> 32), (uint32) lsn),
errdetail("There are no running transactions.")));
return false;
}
执行判断条件running->oldestRunningXid == running->nextXid为true,说明获取的的旧的事务ID和新的事务ID相同,抛出上述日志信息。
继续通过查询源码定位到是tuptoaster.c中执行init_toast_snapshot函数抛出的异常。
/* ----------
* init_toast_snapshot
*
* Initialize an appropriate TOAST snapshot. We must use an MVCC snapshot
* to initialize the TOAST snapshot; since we don't know which one to use,
* just use the oldest one. This is safe: at worst, we will get a "snapshot
* too old" error that might have been avoided otherwise.
*/
static void
init_toast_snapshot(Snapshot toast_snapshot)
{
Snapshot snapshot = GetOldestSnapshot();
if (snapshot == NULL)
elog(ERROR, "no known snapshots");
InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
}
这一块代码就是初始化一个合适的TOAST快照,会获取最老的一个快照,由于没获取到导致snapshot变量为NULL,导致异常被抛出。
目前通过修改表结构,将行外存储改为行内存储。
alter table complain_info alter linknum SET STORAGE plain; |
---|
vacuum full complain_info; |
经验证后问题不再出现。
至此问题初步解决。
以上是关于Debezium异常退出问题排查小记的主要内容,如果未能解决你的问题,请参考以下文章