生产排查 | MySQL主从同步时报错1864之slave_pending_jobs_size_max设置不当导致的主从失效案例

Posted 2021-06-21 戏说数据那点事

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了生产排查 | MySQL主从同步时报错1864之slave_pending_jobs_size_max设置不当导致的主从失效案例相关的知识，希望对你有一定的参考价值。

生产排查 | mysql主从同步时报错1864之slave_pending_jobs_size_max设置生产排查

大家都知道，MySQL 的各参数很多，设置不恰当，会引起各种异常；

本文根据一则实际生产异常场景，即从库的同步异常报错Last_Errno: 1864，来介绍slave_pending_jobs_size_max参数的设置不当导致主从同步异常的案例。

1. 问题现象描述

生产环境上从库出现SQL进程停止的异常，错误信息如下：

mysql> show slave status \G; *************************** 1. row ***************************Slave_IO_State: Waiting for master to send eventMaster_Host: 192.168.0.11Master_User: replica_01Master_Port: 3306Connect_Retry: 60Master_Log_File: mysql-bin.004783Read_Master_Log_Pos: 3681586Relay_Log_File: DB2-relay-bin.000007Relay_Log_Pos: 671091671Relay_Master_Log_File: mysql-bin.004782Slave_IO_Running: YesSlave_SQL_Running: NoReplicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1864Last_Error: Cannot schedule event Update_rows, relay-log name ./DB2-relay-bin.000007, position 671094876 to Worker thread because its size 50450016 exceeds 16777216 of slave_pending_jobs_size_max.Skip_Counter: 0Exec_Master_Log_Pos: 671091458Relay_Log_Space: 1077425236Until_Condition: NoneUntil_Log_File: Until_Log_Pos: 0Master_SSL_Allowed: NoMaster_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULLMaster_SSL_Verify_Server_Cert: NoLast_IO_Errno: 0Last_IO_Error: Last_SQL_Errno: 1864Last_SQL_Error: Cannot schedule event Update_rows, relay-log name ./DB2-relay-bin.000007, position 671094876 to Worker thread because its size 50450016 exceeds 16777216 of slave_pending_jobs_size_max.Replicate_Ignore_Server_Ids: Master_Server_Id: 17860Master_UUID: 938bda2c-41ed-11e6-96c1-0050569c39f0Master_Info_File: mysql.slave_master_infoSQL_Delay: 0SQL_Remaining_Delay: NULLSlave_SQL_Running_State: Master_Retry_Count: 86400Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 210608 15:02:19Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)

ERROR: No query specified

根据报错信息，可以很明显看出从库的Slave_SQL_Running: No线程异常，原因为Pending的事件所占用的内存为50450016（约48MB）大于slave_pending_jobs_size_max参数的默认设置。

错误提示：

Last_Error: Cannot schedule event Update_rows, relay-log name ./DB2-relay-bin.000007, position 671094876 to Worker thread because its size 50450016 exceeds 16777216 of slave_pending_jobs_size_max.

--根据错误提示，确认下当前slave_pending_jobs_size_max参数的默认设置为16777216（16MB）

mysql> show variables like '%pending%';+-----------------------------+----------+| Variable_name | Value |+-----------------------------+----------+| slave_pending_jobs_size_max | 16777216 |+-----------------------------+----------+1 row in set (0.00 sec)

--原因分析

MySQL 5.7开启了多线程复制，与多线程复制相关的有一个参数slave_pending_jobs_size_max，错误日志中提示这个参数设置的小。

此参数在MySQL 5.6以后引入，默认单位是字节。如果没有开启多线程复制，则此参数无用。参数的最小值是1024字节，默认值是16MB，最大值是18446744073709551615。

注意，从库中的此参数的值需等于或大于主库的max_allowed_packet参数值，否则从库的工作队列可能会变满。

官方建议

On a multi-threaded replica (with slave_parallel_workers > 0), ensure that the slave_pending_jobs_size_max system variable is set to a value equal to or greater than the setting for the max_allowed_packet system variable on the source.

The replica actually accepts packets up to the limit set by its slave_max_allowed_packet setting, which defaults to the maximum setting of 1GB, to prevent a replication failure due to a large packet. However, the value of slave_pending_jobs_size_max controls the memory that is made available on the replica to hold incoming packets. The specified memory is shared among all the replica worker queues.

在多线程复制时，在队列中Pending的事件所占用的最大内存，默认为16M，如果内存充足（本案例中服务器内存为64GB），或者延迟较大时，可适当调大，最大可设置为1GB;

再看有啥建议：

The value of slave_pending_jobs_size_max is a soft limit, and if an unusually large event (consisting of one or multiple packets) exceeds this size, the transaction is held until all the replica workers have empty queues, and then processed. All subsequent transactions are held until the large transaction has been completed. So although unusual events larger than slave_pending_jobs_size_max can be processed, the delay to clear the queues of all the replica workers and the wait to queue subsequent transactions can cause lag on the replica and decreased concurrency of the replica workers. slave_pending_jobs_size_max should therefore be set high enough to accommodate most expected event sizes.

简翻译下：

slave_pending_jobs_size_max的值是一软限制，如果一个异常大的事件(包含一个或多个数据包)超过这个大小，事务将被保留，直到所有的replica workers都有空队列，然后再进行处理。所有后续事务都被保存，直到大事务完成。尽管可以处理大于slave_pending_jobs_size_max的异常事件，但清除所有replica worker队列的延迟和后续事务排队的等待可能会导致replica延迟，并降低replica workers的并发性。因此，应该将Slave_pending_jobs_size_max设置得足够高，以适应大多数预期的事件大小。

--从库查看两个并行度参数（并行度为64）

mysql> show variables like '%parallel%';+------------------------+---------------+| Variable_name | Value |+------------------------+---------------+| slave_parallel_type | LOGICAL_CLOCK || slave_parallel_workers | 64 |+------------------------+---------------+2 rows in set (0.00 sec)

查看主库的max_allowed_packet参数值为16MB

mysql> show variables like '%allowed_packet%';+--------------------------+------------+| Variable_name | Value |+--------------------------+------------+| max_allowed_packet | 16777216 || slave_max_allowed_packet | 1073741824 |+--------------------------+------------+2 rows in set (0.01 sec)

--问题解决

设置slave_pending_jobs_size_max的默认值16777216（16MB），此参数单位为MB，大于50450016（48MB）即可。

这里设置为100MB(104857600=100*1024*1024字节)

set global slave_pending_jobs_size_max=104857600;

为了是从库重启后

slave_pending_jobs_size_max=104857600依旧生效，在初始参数文件/etc/my.cnf中添加如下参数

[mysqld]global slave_pending_jobs_size_max=104857600

注意：需重启mysql进程才能生效

从库中修改global slave_pending_jobs_size_max为104857600

（100MB）的操作如下：

--执行步骤为

mysql> stop slave;Query OK, 0 rows affected (0.04 sec)

mysql> set global slave_pending_jobs_size_max=104857600;Query OK, 0 rows affected (0.00 sec)

mysql> start slave;Query OK, 0 rows affected (0.61 sec)

mysql> show slave status \G;

经查看，MySQL主从同步恢复正常。

【参考】

https://dev.mysql.com/doc/refman/8.0/en/replication-features-max-allowed-packet.html

文章至此。

近期热文

你可能也会对以下话题感兴趣。点击链接便可查看。

以上是关于生产排查 | MySQL主从同步时报错1864之slave_pending_jobs_size_max设置不当导致的主从失效案例的主要内容，如果未能解决你的问题，请参考以下文章