线上Mysql重大事故快速应急解决办法

Posted 2021-01-21

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了线上Mysql重大事故快速应急解决办法相关的知识，希望对你有一定的参考价值。

由于好多公司节约成本都没有自己的DBA人员，大部分都是开发或者运维人员操作数据库，但数据库是重中之重，等公司达到一定规模之后，数据库一个不小心的事故，很有可能会让公司回到解放前。所以在公司小规模的时候就应该有一套自己的数据库体系以及完善的数据库架构，操作人员的重点在优化和提升性能，而不是再去修改数据库的整体架构。
下面我给大家总结了三种数据库事故和解决办法一边操作人员应急使用。请做类比参考，错误类型千变万化，解决思路一成不变。
错误：

1、InnoDB: Error: Table "mysql"."innodb_table_stats" not found.问题
2、Error ‘Cannot add or update a child row: a foreign key constraint fails故障
3、SQL_ERROR 1032解决办法

问题明确，下面开始逐条解决吧：
1、InnoDB: Error: Table "mysql"."innodb_table_stats" not found.问题

在MySQL 5.6.30后台日志报如下警告信息：

2016-05-27 12:25:27 7fabf86f7700 InnoDB: Error: Table "mysql"."innodb_table_stats" not found.
2016-05-27 12:25:27 7fabf86f7700 InnoDB: Error: Fetch of persistent statistics requested for table "hj_web"."wechat_res" but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_master_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_relay_log_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_worker_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.~~

这几张表确实是在mysql5.6中新入的

innodb_index_stats,
innodb_tables_stats,
slave_master_info,
slave_relay_log_info,
slave_worker_info

解决方法：
登录数据库，进入mysql库，执行如下SQL删除5张表
记住，一定要是drop table if exists

mysql> use mysql;
mysql> drop table if exists innodb_index_stats;
mysql> drop table if exists innodb_table_stats;
mysql> drop table if exists slave_master_info;
mysql> drop table if exists slave_relay_log_info;
mysql> drop table if exists slave_worker_info;

执行完后，可以用show tables查看一下，看表的数据是否已经比删除之前减少了，如果减少了，说明你成功了！

上一步操作完成后，停止数据库，并进入到数据库数据文件所在目录，删除上面5个表所对应的idb文件，如下所示：

#/etc/init.d/mysqld stop
#cd /data/mysql/data/mysql/
#ls -l .ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:17 innodb_index_stats.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:17 innodb_table_stats.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_master_info.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_relay_log_info.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_worker_info.ibd
#/bin/rm -rf .ibd

重新启动数据库，进入到mysql库，重建上面被删除的表结构：

#/etc/init.d/mysqld start
mysql> use mysql;
mysql> source /data/mysql/share/mysql_system_tables.sql(这个mysql_system_tables.sql是从别的mysql数据库备份过来的，备份方法如下：mysqldump –u账号 –p密码 mysql > mysql_system_tables.sql)
mysql> show tables;
+---------------------------+
| Tables_in_mysql |
+---------------------------+
| columns_priv |
| db |
| event |
| func |
| general_log |
| help_category |
| help_keyword |
| help_relation |
| help_topic |
| innodb_index_stats |
| innodb_table_stats |
| ndb_binlog_index |
| plugin |
| proc |
| procs_priv |
| proxies_priv |
| servers |
| slave_master_info |
| slave_relay_log_info |
| slave_worker_info |
| slow_log |
| tables_priv |
| time_zone |
| time_zone_leap_second |
| time_zone_name |
| time_zone_transition |
| time_zone_transition_type |
| user |
+---------------------------+
28 rows in set (0.00 sec)

说明表都正常了，再次查看mysql报错日志，就会发现没有了关于这5个表的报错日志。

2、Error ‘Cannot add or update a child row: a foreign key constraint fails故障解决

一大早的，某从库突然报出故障：SQL线程中断！
查看从库状态：

mysql> show slave statusG
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin.026023
Read_Master_Log_Pos: 230415889
Relay_Log_File: relay-bin.058946
Relay_Log_Pos: 54632056
Relay_Master_Log_File: mysql-bin.026002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1452
Last_Error: Error ‘Cannot add or update a child row: a foreign key constraint fails (zabbix.trigger_discovery, CONSTRAINT c_trigger_discovery_2 FOREIGN KEY (parent_triggerid) REFERENCES triggers (triggerid) ON DELETE CASCADE)‘ on query. Default database: ‘zabbix‘. Query: ‘insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,‘Free inodes is less than 20% on volume {#FSNAME}‘), (1678,26250,22532,‘Free inodes is less than 20% on volume {#FSNAME}‘), (1679,26251,22532,‘Free inodes is less than 20% on volume {#FSNAME}‘)‘
Exec_Master_Log_Pos: 54631910

重点关注报错信息，定位问题，问题是：Cannot add or update a child row:a foreign key constraint fails ，涉及到的外键是：c_trigger_discovery_2

那这个外键的定义是什么呢？

报错信息中也有列出：

trigger_discovery, CONSTRAINTc_trigger_discovery_2FOREIGN KEY (parent_triggerid) REFERENCEStriggers(triggerid`) ON DELETE CASCADE

那明白了，是表trigger_discovery中的列parent_triggerid和表triggers中的列triggerid有外键关联，现在这里的数据插入出现了问题
那为什么会出现问题？

继续看报错，错误是从这里开始的：

insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,‘Free inodes is less than 20% on volume {#FSNAME}‘)

上述外键对应的列parent_triggerid的值是22532，难道这个值在表triggers中有问题？
我们去表triggers中查看：

从库

mysql> select * from triggers where triggerid=22532;
Empty set (0.00 sec)

主库

mysql> select * from triggers where triggerid=22532;
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| triggerid | expression | description | url | status | value | priority | lastchange | comments | error | templateid | type | value_flags | flags |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| 22532 | {23251}<20 | Free inodes is less than 20% on volume {#FSNAME} | | 0 | 0 | 2 | 0 | | | 13272 | 0 | 1 | 2 |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
row in set (0.00 sec)

果然，从库中没有这个值对应的信息，但主库中是有的，原来是主从不一致导致的，从库中缺失这个值，主库中顺利插入了，但数据传到从库后，从库的外键约束限制了这一插入操作，所以SQL线程阻塞

问题找到了，那怎么解决？

首先，为了让从库尽快恢复运行，就先把这个错误跳过吧：

mysql>SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; #跳过一个事务
mysql>start slave;

接下来就是主从数据不一致的问题，可以使用pt-table-checksum来检查下不一致的数据，再进行同步，具体步骤如下：

在主库执行：

mysql>GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE,CREATE,DELETE,INSERT,UPDATE ON . TO ‘USER‘@‘MASTER_HOST‘ identified by ‘PASSWORD‘;

注：创建用户，这些权限都是必须的，否则会报错

shell> ./pt-table-checksum --host=‘master_host‘ --user=‘user‘ --password=‘password‘ --port=‘port‘ --databases=zabbix --ignore-tables=ignore_table --recursion-method=processlist

注：(1)因为涉及到的表太多，查看后发现很多表都有外键关联，错综复杂，而且因为是监控表，即使丢失一些也没什么关系，所以查出较大的且没有外键关联的表用ignore-tables选项排除，对其他表进行比对，如果表比较少的话直接指定--TABLES

(2)recursion-method如果不设的话，会报错：Diffs cannot be detected because no slaves were found. 其参数有四：processlist/hosts/dsn/no,用来决定查找slave的方式是show full processlist还是show slave hosts还是直接给出slave信息，具体用法在另一随笔pt-table-checksum介绍中详述

shell>./pt-table-sync --print --replicate=percona.checksums h=master_host,u=user,p=password,P=port h=slave_host,u=user,p=password,P=port --recursion-method=processlist >pt.log

注：最好使用--print，不要直接使用--execute，否则如果弄出问题，就更麻烦了，打印出直接执行的语句，去从库执行就好了

将pt.log传到从库，直接执行，然后再次在主库上进行一致性检查，如果还有不一致的数据，记得登录mysql去把checksums表清空，然后再次进行检查同步，直到没有不一致的数据。
当然，如果主从数据反复出现不一致的话，那就要先去检查造成不一致的原因了，釜底抽薪才是硬道理。

3、SQL_ERROR 1032解决办法

缘由：
　　在主主同步的测试环境，由于业务侧没有遵循同一时间只写一个点的原则，造成A库上删除了一条数据，B库上在同时更新这条数据。
由于异步和网络延时，B的更新event先到达A端执行，造成A端找不到这条记录，故SQL_THREAD报错1032，主从同步停止。

错误说明：
　　MySQL主从同步的1032错误，一般是指要更改的数据不存在，SQL_THREAD提取的日志无法应用故报错，造成同步失败
（Update、Delete、Insert一条已经delete的数据）。
　　1032的错误本身对数据一致性没什么影响，影响最大的是造成了同步失败、同步停止。
　　如果主主（主从）有同步失败，要第一时间查看并着手解决。因为不同步，会造成读取数据的不一致。应在第一时间恢复同步，
尽量减少对业务的影响。然后再具体分析不同步的原因，手动或者自动修复数据，并做pt-table-checksum数据一致性检查。
　　目前业务一般是做主主同步，主主同步由于是异步更新，存在更新冲突的问题，且很容易引起SQL ERROR 1032错误。这个应该在业务侧解决，
保证同一时间只更新数据库的一个点，类似单点写入。我们的解决办法是:写一个底层数据库调用库，可能涉及到更新冲突的操作，都调用这个库。
在配置文件里，配2个点的数据库A、B，保证一直都更新A库，如果A库不可用，就去更新B库。
　　另外，如果是对数据一致性要求较高的场景，比如涉及到钱，建议用PXC（强一致性、真正同步复制）。

解决办法：
　　MySQL5.6.30版本，binlog模式为ROW。

　　show slave statusG,可以看到如下报错：
Slave_SQL_Running: NO
Last_SQL_Errno: 1032
Last_SQL_Error: Worker 3 failed executing transaction ‘‘ at master log mysql-bin.000003, end_log_pos 440267874;
　　　　　　　　 Could not execute Delete_rows event on table db_test.tbuservcbgolog; Can‘t find record in ‘tbuservcbgolog‘, Error_code: 1032;
　　　　　　　　 handler error HA_ERR_KEY_NOT_FOUND; the event‘s master log mysql-bin.000003, end_log_pos 440267874

从上可以看出，是SQL_THREAD线程出错，错误号码1032。是在应用delete db_test.tbuservcbgolog 表中一行数据的事件时，由于这条数据
不存在而出错。此事件在主服务器Master binlog中的位置是 mysql-bin.000003, end_log_pos 440267874。（当然可以在从服务器Slave的Relay
log中查找，具体方法见最后）

方法1：跳过错误Event
先跳过这一条错误(event),让主从同步恢复正常。（或者N条event，一条一条跳过）

　　stop slave;
　　set global sql_slave_skip_counter=1;
　　start slave;

方法2：跳过所有1032错误
更改my.cnf文件，在Replication settings下添加：

　　slave-skip-errors　＝　1032
并重启数据库，然后start salve。
注意：因为要重启数据库，不推荐，除非错误事件太多。

方法3：还原被删除的数据
根据错误提示信息，用mysqlbinlog找到该条数据event SQL并逆向手动执行。如delete 改成insert。
本例中，此事件在主服务器Master binlog中的位置是 mysql-bin.000003, end_log_pos 440267874。
1）利用mysqlbinlog工具找出440267874的事件
/usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 |grep -A 20 ‘440267874‘
或者/usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 --stop-position=440267874 | tail -20
或者usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 > decode.log
（或者加上参数-d, --database=name 来进一步过滤）
#160923 20:01:27 server id 1223307 end_log_pos 440267874 CRC32 0x134b2cbc Delete_rows: table id 319 flags: STMT_END_F
###DELETE FROM db_99ducj.tbuservcbgolog
###WHERE
###@1=10561502 / INT meta=0 nullable=0 is_null=0 /
###@2=1683955 / INT meta=0 nullable=0 is_null=0 /
###@3=90003 / INT meta=0 nullable=0 is_null=0 /
###@4=0 / INT meta=0 nullable=0 is_null=0 /
###@5=‘2016-09-23 17:02:24‘ / DATETIME(0) meta=0 nullable=1 is_null=0 /
###@6=NULL / DATETIME(0) meta=0 nullable=1 is_null=1 /
#at 440267874

以上为检索出来的结果，事务语句为：delete from db_99ducj.tbuservcbgolog where @1=10561502 and @2=1683955 ...
其中@1 @2 @3...分别对应表tbuservcbgolog的列名，填补上即可。
我们可以逆向此SQL 将deleter 变成Insert,手动在从库上执行此Insert SQL,之后restart slave就好了。
注：通过Relay Log查找event SQL http://www.tuicool.com/articles/6RvUnqV

以上是关于线上Mysql重大事故快速应急解决办法的主要内容，如果未能解决你的问题，请参考以下文章

架构设计：线上服务故障应急机制讨论

记录一次Mysql主从不同步事故问题于事故解决办法

让 Bug 无处藏身，Java 线上问题排查思路常用工具