故障分析 | MySQL 主从不同步分析案例

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了故障分析 | MySQL 主从不同步分析案例相关的知识,希望对你有一定的参考价值。

背景

线上某核心业务采用了自建mysql 主从架构 ,因这个业务库主要是和第三方合作项目,无监控权限,DB权限需要临时申请才可操作

现象分析

  1. 项目组里反馈 业务数据最近两天无新数据产生。
  2. 推测如下:
  1. 此业务这两天确实无实际数据产生
  2. 主从同步异常

诊断

  1. 经查业务主库的binlog有变化,说明最近两天有数据
  2. 目前业务库拓扑结构如下:

故障分析

排查与处理

本次无数据是mysql07-bi-prod-bj1和主库不同步导致。不同步告警如下:

mysql> show slave status\\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.23.70.101
                  Master_User: repl_user
                  Master_Port: 3306
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 89d024cb-8865-11e9-babc-fa83d2adf000:33387663 at master log mysql_bin.000407, end_log_pos 427666467. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

详细查看告警信息:    

mysql> select * from performance_schema.replication_applier_status_by_worker limit 1\\G
*************************** 1. row ***************************
                                           CHANNEL_NAME:
                                              WORKER_ID: 1
                                              THREAD_ID: NULL
                                          SERVICE_STATE: OFF
                                      LAST_ERROR_NUMBER: 1396
                                     LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 89d024cb-8865-11e9-babc-fa83d2adf000:33387663 at master log mysql_bin.000407, end_log_pos 427666467; Error Operation CREATE USER failed for app_sqoop_r@10.33.75.% on query. Default database: . Query: CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED WITH mysql_native_password AS *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA
1 row in set (0.00 sec)

根据告警提示,查看master log mysql_bin.000407, end_log_pos 427666467前执行了什么。

解析主库mysql_bin.000407如下:

mysqlbinlog --no-defaults -vv  --base64-output=decode-rows \\
>   /data/mysql/mysql57_3306/binlog/mysql_bin.000407 --stop-position=427666467
# at 427666247
#200224 11:59:51 server id 70101  end_log_pos 427666467 CRC32 0x3c9eaf42        Query   thread_id=76726 exec_time=0     error_code=0    Xid = 57596658
SET TIMESTAMP=1582516791.092605/*!*/;
CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED WITH mysql_native_password AS *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA
/*!*/;
SET @@SESSION.GTID_NEXT= AUTOMATIC /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
DELIMITER /*!*/;
# at 427666467
GRANT SELECT ON `xyy_crm`.* TO app_bi_w@10.33.75.%
/*!*/;
# at 427666705

如上在create user时,从库2与主库不同步,检查从库2确实已经有app_sqoop_r@10.33.75.%用户,且权限与主库不同

查看主库的用户权限

mysql> show grants for app_sqoop_r@10.33.75.%;
+-----------------------------------------------------------+
| Grants for app_sqoop_r@10.33.75.%                         |
+-----------------------------------------------------------+
| GRANT USAGE ON *.* TO `app_sqoop_r`@`10.33.75.%`          |
| GRANT SELECT ON `xyy_crm`.* TO `app_sqoop_r`@`10.33.75.%` |

mysql07-bi-prod-bj1从库2的用户权限
mysql> show grants for app_sqoop_r@10.33.75.%;
 
| GRANT SELECT ON *.* TO `app_sqoop_r`@`10.33.75.%` |

至此 已经定位具体的问题:用户是直接在从库2建了app_sqoop_r@10.33.75.%用户所导致的不同步,接下来只要跳过该事物即可

[root@localhost] 10:48:04 ((none)) [2]> stop slave;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] 10:48:57 ((none)) [5]> set gtid_next=89d024cb-8865-11e9-babc-fa83d2adf000:33387663;
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] 10:49:11 ((none)) [6]> begin;commit;
Query OK, 0 rows affected (0.00 sec)
 
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] 10:49:15 ((none)) [7]> set gtid_next=AUTOMATIC;
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] 10:49:23 ((none)) [8]> start slave;
Query OK, 0 rows affected (0.04 sec)

再查看主从同步:

[root@localhost] 10:49:26 ((none)) [9]> show slave status\\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.23.70.101
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000411
          Read_Master_Log_Pos: 302997670
               Relay_Log_File: relay.000340
                Relay_Log_Pos: 448617544
        Relay_Master_Log_File: mysql_bin.000407
             Slave_IO_Running: Yes ### 此时查看 主从同步已正常
            Slave_SQL_Running: Yes ### 此时查看 主从同步已正常
         Seconds_Behind_Master: 166878

测试验证

针对以上问题做出以下测试与验证

先说结论如下:

  1. 验证1:若先在从库create,然后grant;主库create用户查看主从关系。(结论是主从异常)
  2. 验证2:若在从库grant方式创建并授权用户;主库grant,查看主从关系(结论是主从正常)
  3. 验证3:若从库grant方式创建并授权用户;主库create用户,查看主从关系(结论是主从异常)

验证1:

当从库已经创建了用户,主库创建用户是否会出现Last_Errno: 1396

构造场景1

从库执行:

root@localhost] 16:04:33 ((none)) [12]> CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED by password *463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA;
Query OK, 0 rows affected, 1 warning (0.07 sec)
 
[root@localhost] 16:05:18 ((none)) [13]> grant select on *.* to app_sqoop_r@10.33.75.%;
Query OK, 0 rows affected (0.00 sec)

主库执行

mysql> CREATE USER app_sqoop_r@10.33.75.% IDENTIFIED by *463A80AB6D77E0D99CD;
Query OK, 0 rows affected (0.00 sec)

此时查看主从关系

[root@localhost] 16:05:34 ((none)) [14]> show slave status\\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000005
          Read_Master_Log_Pos: 192344
               Relay_Log_File: relay.000003
                Relay_Log_Pos: 192176
        Relay_Master_Log_File: mysql_bin.000005
             Slave_IO_Running: Yes
            Slave_SQL_Running: No #### 主从关系是异常的
                           Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction fad94351-9975-11e9-9171-fa1b8143c800:5 at master log mysql_bin.000005, end_log_pos 192344. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论1:

主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396

验证2

通过grant创建并授权用户

构造场景2

从库执行

[root@localhost] 16:21:22 ((none)) [36]> grant select on ost_test.* to app_test_r@10.% identified by  3A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)

主库执行

mysql> grant insert on `ost_test`.* TO app_test_r@10.% identified by  5A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)

此时查看主从关系

[root@localhost] 16:21:53 ((none)) [37]> show slave status\\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000005
          Read_Master_Log_Pos: 193391
               Relay_Log_File: relay.000004
                Relay_Log_Pos: 1501
        Relay_Master_Log_File: mysql_bin.000005
             Slave_IO_Running: Yes ### 主从同步正常
            Slave_SQL_Running: Yes ### 主从同步正常
结论2

主从同步正常,不影响数据同步

验证3

在从库grant方式创建并授权用户;主库create用户

构造场景3

从库执行

[root@localhost] 16:30:45 ((none)) [41]> grant select on ost_test.* to app_test_r@10.% identified by  3A80AB6D77E0D99CD;
Query OK, 0 rows affected, 1 warning (0.00 sec)

主库执行

mysql> create user app_test_r@10.% identified by  3A80AB6D77E0D99;
Query OK, 0 rows affected (0.00 sec)

查看主从关系

[root@localhost] 16:30:59 ((none)) [42]> show slave status\\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
             Slave_IO_Running: Yes
            Slave_SQL_Running: No ### 主从异常
                   Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction fad94351-9975-11e9-9171-fa1b8143c800:11 at master log mysql_bin.000005, end_log_pos 193808. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论3

主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396

总结

  1. 针对生产环境的情况在不清楚主从关系情况下,
  1. 先用show slave status\\G检测是否是主库
  2. 用select * from information_schema.​​PROCESSLIST​​ where state like %binlog%\\G检测从库IP端口等信息。
  1. 保证执行所有的语句一定要在主库执行!!!















以上是关于故障分析 | MySQL 主从不同步分析案例的主要内容,如果未能解决你的问题,请参考以下文章

Redis主从同步失败案例的步步深入

MySQL5.7主从同步案例

百度QA金BUG集锦之《Redis主从同步失败案例的步步深入》

MYSQL主从不同步延迟原理分析及解决方案

mysql主从复制故障分析

十七Mysql的主从--主从故障监控分析