故障分析 | MySQL 主从不同步分析案例
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了故障分析 | MySQL 主从不同步分析案例相关的知识,希望对你有一定的参考价值。
背景
线上某核心业务采用了自建mysql 主从架构 ,因这个业务库主要是和第三方合作项目,无监控权限,DB权限需要临时申请才可操作
现象分析
- 项目组里反馈 业务数据最近两天无新数据产生。
- 推测如下:
- 此业务这两天确实无实际数据产生
- 主从同步异常
诊断
- 经查业务主库的binlog有变化,说明最近两天有数据
- 目前业务库拓扑结构如下:
排查与处理
本次无数据是mysql07-bi-prod-bj1和主库不同步导致。不同步告警如下:
mysql> show slave status\\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.23.70.101
Master_User: repl_user
Master_Port: 3306
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 89d024cb-8865-11e9-babc-fa83d2adf000:33387663 at master log mysql_bin.000407, end_log_pos 427666467. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
详细查看告警信息:
mysql> select * from performance_schema.replication_applier_status_by_worker limit 1\\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1396
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 89d024cb-8865-11e9-babc-fa83d2adf000:33387663 at master log mysql_bin.000407, end_log_pos 427666467; Error Operation CREATE USER failed for app_sqoop_r@10.33.75.% on query. Default database: . Query: CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED WITH mysql_native_password AS *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA
1 row in set (0.00 sec)
根据告警提示,查看master log mysql_bin.000407, end_log_pos 427666467前执行了什么。
解析主库mysql_bin.000407如下:
mysqlbinlog --no-defaults -vv --base64-output=decode-rows \\
> /data/mysql/mysql57_3306/binlog/mysql_bin.000407 --stop-position=427666467
# at 427666247
#200224 11:59:51 server id 70101 end_log_pos 427666467 CRC32 0x3c9eaf42 Query thread_id=76726 exec_time=0 error_code=0 Xid = 57596658
SET TIMESTAMP=1582516791.092605/*!*/;
CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED WITH mysql_native_password AS *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA
/*!*/;
SET @@SESSION.GTID_NEXT= AUTOMATIC /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
DELIMITER /*!*/;
# at 427666467
GRANT SELECT ON `xyy_crm`.* TO app_bi_w@10.33.75.%
/*!*/;
# at 427666705
如上在create user时,从库2与主库不同步,检查从库2确实已经有app_sqoop_r@10.33.75.%用户,且权限与主库不同
查看主库的用户权限
mysql> show grants for app_sqoop_r@10.33.75.%;
+-----------------------------------------------------------+
| Grants for app_sqoop_r@10.33.75.% |
+-----------------------------------------------------------+
| GRANT USAGE ON *.* TO `app_sqoop_r`@`10.33.75.%` |
| GRANT SELECT ON `xyy_crm`.* TO `app_sqoop_r`@`10.33.75.%` |
mysql07-bi-prod-bj1从库2的用户权限
mysql> show grants for app_sqoop_r@10.33.75.%;
| GRANT SELECT ON *.* TO `app_sqoop_r`@`10.33.75.%` |
至此 已经定位具体的问题:用户是直接在从库2建了app_sqoop_r@10.33.75.%用户所导致的不同步,接下来只要跳过该事物即可
[root@localhost] 10:48:04 ((none)) [2]> stop slave;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] 10:48:57 ((none)) [5]> set gtid_next=89d024cb-8865-11e9-babc-fa83d2adf000:33387663;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] 10:49:11 ((none)) [6]> begin;commit;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] 10:49:15 ((none)) [7]> set gtid_next=AUTOMATIC;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] 10:49:23 ((none)) [8]> start slave;
Query OK, 0 rows affected (0.04 sec)
再查看主从同步:
[root@localhost] 10:49:26 ((none)) [9]> show slave status\\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.23.70.101
Master_User: repl_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql_bin.000411
Read_Master_Log_Pos: 302997670
Relay_Log_File: relay.000340
Relay_Log_Pos: 448617544
Relay_Master_Log_File: mysql_bin.000407
Slave_IO_Running: Yes ### 此时查看 主从同步已正常
Slave_SQL_Running: Yes ### 此时查看 主从同步已正常
Seconds_Behind_Master: 166878
测试验证
针对以上问题做出以下测试与验证
先说结论如下:
- 验证1:若先在从库create,然后grant;主库create用户查看主从关系。(结论是主从异常)
- 验证2:若在从库grant方式创建并授权用户;主库grant,查看主从关系(结论是主从正常)
- 验证3:若从库grant方式创建并授权用户;主库create用户,查看主从关系(结论是主从异常)
验证1:
当从库已经创建了用户,主库创建用户是否会出现Last_Errno: 1396
构造场景1
从库执行:
root@localhost] 16:04:33 ((none)) [12]> CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED by password *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA;
Query OK, 0 rows affected, 1 warning (0.07 sec)
[root@localhost] 16:05:18 ((none)) [13]> grant select on *.* to app_sqoop_r@10.33.75.%;
Query OK, 0 rows affected (0.00 sec)
主库执行
mysql> CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED by *463A80AB6D77E0D99CD;
Query OK, 0 rows affected (0.00 sec)
此时查看主从关系
[root@localhost] 16:05:34 ((none)) [14]> show slave status\\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Master_Log_File: mysql_bin.000005
Read_Master_Log_Pos: 192344
Relay_Log_File: relay.000003
Relay_Log_Pos: 192176
Relay_Master_Log_File: mysql_bin.000005
Slave_IO_Running: Yes
Slave_SQL_Running: No #### 主从关系是异常的
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction fad94351-9975-11e9-9171-fa1b8143c800:5 at master log mysql_bin.000005, end_log_pos 192344. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论1:
主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396
验证2
通过grant创建并授权用户
构造场景2
从库执行
[root@localhost] 16:21:22 ((none)) [36]> grant select on ost_test.* to app_test_r@10.% identified by 3A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)
主库执行
mysql> grant insert on `ost_test`.* TO app_test_r@10.% identified by 5A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)
此时查看主从关系
[root@localhost] 16:21:53 ((none)) [37]> show slave status\\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Master_Log_File: mysql_bin.000005
Read_Master_Log_Pos: 193391
Relay_Log_File: relay.000004
Relay_Log_Pos: 1501
Relay_Master_Log_File: mysql_bin.000005
Slave_IO_Running: Yes ### 主从同步正常
Slave_SQL_Running: Yes ### 主从同步正常
结论2
主从同步正常,不影响数据同步
验证3
在从库grant方式创建并授权用户;主库create用户
构造场景3
从库执行
[root@localhost] 16:30:45 ((none)) [41]> grant select on ost_test.* to app_test_r@10.% identified by 3A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)
主库执行
mysql> create user app_test_r@10.% identified by 3A80AB6D77E0D99;
Query OK, 0 rows affected (0.00 sec)
查看主从关系
[root@localhost] 16:30:59 ((none)) [42]> show slave status\\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Slave_IO_Running: Yes
Slave_SQL_Running: No ### 主从异常
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction fad94351-9975-11e9-9171-fa1b8143c800:11 at master log mysql_bin.000005, end_log_pos 193808. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论3
主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396
总结
- 针对生产环境的情况在不清楚主从关系情况下,
- 先用show slave status\\G检测是否是主库
- 用select * from information_schema.
PROCESSLIST
where state like %binlog%\\G检测从库IP端口等信息。
- 保证执行所有的语句一定要在主库执行!!!
以上是关于故障分析 | MySQL 主从不同步分析案例的主要内容,如果未能解决你的问题,请参考以下文章