案例 - 一个IP切换引发的数据不一致

Posted 2021-08-25 mysql运维

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了案例 - 一个IP切换引发的数据不一致相关的知识，希望对你有一定的参考价值。

业务说，为什么10号机房缺少这条数据，其他机房却有？

mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10;
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| sid        | tm_timestamp | tm_lasttime | gid                 | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version           | __deleted |
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| xxxxxxxxxx |   1495773704 |  1495773704 | xxxxxxxxxxx | 处对象     |            0 |          5 |  3611732366 | vx:wtc2033      |      0 |     18 |        8 |           0 |                 0 |             1 |            0 |         0 |          |        0 | 6126694332813803019 |         0 |
+------------+--------------+-------------+---------------------+-------

大概断定，10号机房的数据同步是有问题的，先看这条记录，是从哪个机房插入的，然后再看10号机房与该机房之间的同步是否有问题，使用8827登录，获取这条数据的版本号__version，由函数转换得到这条数据，来自14号机房插入的，日期:2017-05-26 05:03:03 机房号:14 端口号:11

这相当于MySQL里的binlog，会记录每条SQL，来自于哪个server-id,目的是为了防止循环复制，myshard不仅在binlog记录server-id，每条记录都带有版本号，包含了从哪个机房，哪个端口写入的，什么时候写入的

到这里，知道14号机房写入的数据，无法同步到10号机房，可以去14号看一下同步命令

[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset"
shard_local             Read_offset             48494420885     
shard_local             Read_speed              33373           
shard_local             Read_bytes_behind       0                    
sync_r12m0              Read_offset             48494420885     
sync_r12m0              Read_speed              33373           
sync_r12m0              Read_bytes_behind       0               
sync_r13m0              Read_offset             48494420885     
sync_r13m0              Read_speed              33373           
sync_r13m0              Read_bytes_behind       0               
sync_r1m0               Read_offset             48494420885     
sync_r1m0               Read_speed              33373           
sync_r1m0               Read_bytes_behind       0               
sync_r3m0               Read_offset             48494420885     
sync_r3m0               Read_speed              33373           
sync_r3m0               Read_bytes_behind       0               
shard_remote            Read_offset             52080697507     
shard_remote            Read_speed              27290           
shard_remote            Read_bytes_behind       0

发现没有r10m0这个机房来拉取数据，那证明同步有问题了，去10号机房看同步的日志，看到不断去重连14号机房这个点

[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more                                     
May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2

看到有很多日志，不断重试去连接14号机房，其中最早的重连发生在

db_sync_xxxxxxxx_r10m0_d.log.13.gz

这个文件，而这个文件在5月14日记录的

-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz 
-rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz 
-rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz 
-rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz 
-rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz 
-rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz  
-rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz  
-rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz  
-rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz  
-rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz  
-rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz  
-rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz  
-rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz  
-rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz

一般重连只有2种可能，一个是14号机房没有开放白名单，不允许10号机房访问，但之前搭建成功，肯定白名单是开放了，很可能防火墙出问题，于是在14号机房，进行

iptables -n -L|grep 10号机房的IP

发现电信IP是开放了规则，但是联通的IP是没有开放防火墙规则，这是双线机房，而我在5月12日部署的环境，说明部署环境2天后，因为网络质量，电信通道无法连接，改为了联通通道了，而联通IP没有授权，这就导致10号机房无法顺利连接14号机房了，但是当时业务没有使用这个数据库，昨天5月25日，业务开始部署进程在14号机房，发现数据没同步，才找DBA的。我于是马上加入防火墙规则，然后重启同步进程，重新拉取数据，但10号机房还是在报错不断重连

然而在14号机房可以看到另外一个错误

May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132

一直在报一个位置点2587613132不存在，无法拉取...那是因为太久没有连接上，而14号机房的binlog只保留7天导致的，14号+7天=21号出问题，于是解决方案是把在14号机房，寻找存在但还没有被删除的位置点，让10号机房去拉数据，然后询问业务在14号机房有写入操作的表有哪些，然后把14号机房的表数据导出来，然后倒入到10号机房

myshard的好处是可以通过导数来去修补缺失的数据，而mysql只能用percona的修复工具，这也是给自己一个教训，在机房网络条件差的情况下，开通ip必须全部ip都开了，另外业务需要补充数据，事后开会总结了几个规则

对myshard监控的监控一定要做足够，为了避免数据落后能够及时发现
业务人员在申请数据库申请权限时，多线机房要提供全部IP（电信IP，联通IP，内网IP，管理网IP）
myshard要做一致性hash，对于同一个用户，在哪个机房写入数据，在哪个机房进行修改数据
在同步落后的情况下，不要做节点之间的切业务

以上是关于案例 - 一个IP切换引发的数据不一致的主要内容，如果未能解决你的问题，请参考以下文章

Reload Activity 以在 Fragment 之间重新切换

获取对象会引发不一致的数据类型：预期 - 得到 -

如何在切换片段时停止 AsyncTask？

在Android Studio片段之间切换时地图片段不隐藏

为啥代码片段在 matplotlib 2.0.2 上运行良好，但在 matplotlib 2.1.0 上引发错误

Python图像resize前后颜色不一致问题