记一次Redis错误排查经历(redis cluster 节点重启后无限同步问题)

Posted 鸭子船长

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次Redis错误排查经历(redis cluster 节点重启后无限同步问题)相关的知识,希望对你有一定的参考价值。

日志集中式监控平台上线已经有一段时间,但是大部分情况下只是作为发布或者出问题时查看日志的便利工具使用。平时大家都不怎么主动上去看看。于是前几天把应用的错误日志也加上邮件、Hi和短信报警,马上就收到很多错误报警,引起了大家的重视。其中有一个Redis报错:

Push notify error: => redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool at redis.clients.util.Pool.getResource(Pool.java:53) at 
redis.clients.jedis.JedisPool.getResource(JedisPool.java:99) at 
com.baidu.global.mobile.server.dataaccess.cache.CacheClient.execute(CacheClient.java:285) at 
com.baidu.global.mobile.server.dataaccess.cache.CacheClient.getAsRaw(CacheClient.java:793) at 
com.baidu.global.mobile.server.dataaccess.cache.CacheClient.get(CacheClient.java:432) at 
com.baidu.global.mobile.server.guanxing.facade.MessageTaskFacade.checkPushValidationID(MessageTaskFacade.java:952) at 
com.baidu.global.mobile.server.guanxing.facade.MessageTaskFacade.pushNotify(MessageTaskFacade.java:983) at 
com.baidu.global.mobile.server.guanxing.facade.MessageTaskFacade.buildMessage(MessageTaskFacade.java:586) at 
com.baidu.global.mobile.server.guanxing.facade.MessageTaskFacade.pushRightNow(MessageTaskFacade.java:372) at 
com.baidu.global.mobile.server.guanxing.facade.MessageTaskFacade$AsynMessagePush.run(MessageTaskFacade.java:350) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at 
java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException: Unable to validate object at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:502) at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361) at 
redis.clients.util.Pool.getResource(Pool.java:51) ... 12 more

看起来挺严重的,拿不到Redis连接,而且是在 validate的时候报的错:

java.util.NoSuchElementException: Unable to validate object at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:502) at 
org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361) at 

我们使用的是Jedis,在validate的时候默认是发送一个PING命令到Redis,然后期待返回的是PONG字符串。难道连接池真的不够了?

于是登上redis所在服务器,手动ping了一下,果然发现错误:

[work@id01-xxx-mob28.id01 redis-2.8.19]$ src/redis-cli -p 6379
127.0.0.1:6379> ping
(error) MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
127.0.0.1:6379> set hello world
(error) MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

读命令都没有问题,但是所有的写操作都不行。

看一下redis的log日志,发现一直在报错:

[26378] 03 May 17:58:51.215 * 50000 changes in 60 seconds. Saving...
[26378] 03 May 17:58:51.254 * Background saving started by pid 2854
[2854] 03 May 17:58:58.949 * DB saved on disk
[2854] 03 May 17:58:58.967 * RDB: 4 MB of memory used by copy-on-write
[26378] 03 May 17:58:59.063 * Background saving terminated with success

    ...

[55231] 11 Aug 10:43:12.034 # Can\'t save in background: fork: Cannot allocate memory
[55231] 11 Aug 10:43:18.043 * 100 changes in 3600 seconds. Saving...
[55231] 11 Aug 10:43:18.044 # Can\'t save in background: fork: Cannot allocate memory
[55231] 11 Aug 10:43:24.052 * 100 changes in 3600 seconds. Saving...
[55231] 11 Aug 10:43:24.052 # Can\'t save in background: fork: Cannot allocate memory

原来是内存不足,导致无法folk进程做RDB持久化。看了一下内存使用率:

[work@id01-xxx-mob28.id01 ~]$ free -g
             total       used       free     shared    buffers     cached
Mem:            62         61          1          0          0         25
-/+ buffers/cache:         35         27
Swap:            0          0          0    

其实free的内存还是蛮多的,有27G。然后info一下,发现其实连接的Client数并不多,但是使用的内存确实还是蛮多的,有27.55G,超过了free的内存。

# Clients
connected_clients:41
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:29281598848
used_memory_human:27.27G
used_memory_rss:30031474688
used_memory_peak:29581256416
used_memory_peak_human:27.55G
used_memory_lua:35840
mem_fragmentation_ratio:1.03
mem_allocator:jemalloc-3.6.0

Redis的RDB持久化实现是folk一个子进程,然后让子进程将内存镜像dump到RDB文件中。理论上来说是需要跟父进程一样的内存空间,也就是27.55G,但是由于Linux很早就支持的copy-on-write技术,所以实际上并不需要这么多的物理内存的,这个可以从log中分析出来。我们这个Redis最多只有150M左右的COW内存。 官方文档上有说明:Background saving is failing with a fork() error under Linux even if I\'ve a lot of free RAM!

Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can\'t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.

Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.

可以简单的通过修改overcommit_memory系统参数为1来改变这种简单粗暴的检查行为:

sysctl vm.overcommit_memory=1

马上就生效了:

[55231] 11 Aug 10:59:31.036 * Background saving started by pid 11820
[11820] 11 Aug 11:04:27.567 * DB saved on disk

为了避免重启失效,修改一下:/etc/sysctl.conf:

vm.overcommit_memory=1

TIPS 看到拿不到连接,validate出错,第一反应总是以为是连接池不够,或者连接太多之类的。但是实际上根据validate的实现机制,只要不是返回PONG字符串就是错误,所以上去手动PING一下,就知道什么问题了:)。

补记:记一次Redis错误排查经历-下集

突然又收到类似的错误:

get jedis connection has error => redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
    at redis.clients.util.Pool.getResource(Pool.java:53)
    at redis.clients.jedis.JedisPool.getResource(JedisPool.java:99)
    at com.baidu.global.mobile.hao123.dataaccess.utilscache.jedis.RedisUtils.doTask(RedisUtils.java:82)
    at com.baidu.global.mobile.hao123.nantianmen.utils.CacheHelper.getLatestEntriesByCategory(CacheHelper.java:123)
    at com.baidu.global.mobile.hao123.nantianmen.service.NewsServiceImpl.getLatestEntriesByCategory(NewsServiceImpl.java:591)
    at com.baidu.global.mobile.hao123.nantianmen.facade.NanTianMenFacadeImpl.getLatestEntriesByCategory(NanTianMenFacadeImpl.java:192)
    at com.baidu.global.mobile.hao123.nantianmen.web.MbrowserController.getLatestEntriesByCategory(MbrowserController.java:248)
    at com.baidu.global.mobile.hao123.nantianmen.web.MbrowserController.mbrowser(MbrowserController.java:138)
    at sun.reflect.GeneratedMethodAccessor143.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.springframework.web.method.support.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:215)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:132)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandleMethod(RequestMappingHandlerAdapter.java:745)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:685)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:80)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:919)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:851)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:953)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:844)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:829)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:721)
    at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:468)
    at org.apache.catalina.core.ApplicationDispatcher.doForward(ApplicationDispatcher.java:391)
    at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:318)
    at org.apache.catalina.core.StandardHostValve.custom(StandardHostValve.java:423)
    at org.apache.catalina.core.StandardHostValve.status(StandardHostValve.java:291)
    at org.apache.catalina.core.StandardHostValve.throwable(StandardHostValve.java:383)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:166)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:78)
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1033)
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
    at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:222)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1566)
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1523)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: Unable to validate object
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:497)
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:360)
    at redis.clients.util.Pool.getResource(Pool.java:51)

从日志看起来跟上次的错误基本是一样的,就是在borrowObject的时候进行validate,然后发现validate failed。也就是说PING没有返回正常的PONG! 首先第一反应仍然是上去手动PING了一下发现并没有问题。INFO也一切正常。直接请求报错的接口也没有发现问题。

在集中式日志平台搜索了一下,发现错误时间是从前天早上11点25分开始,到昨天下午4点为止大概有2w左右的错误量。

结合我们做的事情,刚好我们在前天早上11点半的时候对这个Redis进行了迁移。因为要把这台机器(c32-07)下线,所有李贝在新机器m32-06 Copy了一个Redis,然后把c32-6由主库变成从库了。然后就发现不断有这样的错误报出来了。

那其实在上次出现拿不到连接validate出错之后,我就在线上增加了一个监控,会每隔5分钟PING一下我们线上的所有Redis主库,判断返回结果是不是PONG,如果不是就会短信和邮件报警。不过之前漏掉了这个Redis,因为这个是新闻推荐的Redis,之前一直是伟哥在维护,被我遗漏了。于是立即做了两件事情:

  1. 将接口层的请求大部分切换到m32-06主库,只留下一台仍然访问c32-06从库。
  2. 将c32-06加入到定时PING监控脚本中

时间大概是19号下午5点的样子。

然后今天果然有效果了:

  1. 切换到主库之后,错误日志由2w变成2k。基本可以说明从库是有问题的。
  2. 监控脚本检测到了PING c32-06从库返回错误信息:LOADING Redis is loading the dataset in memory。

也就是说从库确实在某些时间点是不可用的。

从错误信息LOADING Redis is loading the dataset in memory看来,是redis在加载RDB数据到内存中的过程是阻塞的,没法提供服务的,阻塞的时间取决于RDB的大小。理论上来说这个过程只会发生在

  1. 启动的时候
  2. slave全同步的时候。2.8之后理论上来说只是第一次同步,后面应该都是部分同步。

难道slave一直在全量同步?看了一下Redis的日志,很快就发现问题了。

这个是从库c32-7的redis日志:

17859:S 18 Aug 11:26:16.363 # Server started, Redis version 3.0.0
17859:S 18 Aug 11:26:16.363 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add \'vm.overcommit_memory = 1\' to /etc/sysctl.conf and then reboot or run the command \'sysctl vm.overcommit_memory=1\' for this to take effect.
17859:S 18 Aug 11:26:52.802 * DB loaded from disk: 36.438 seconds
17859:S 18 Aug 11:26:52.802 * The server is now ready to accept connections on port 6666
17859:S 18 Aug 11:26:53.364 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 11:26:53.364 * MASTER <-> SLAVE sync started
17859:S 18 Aug 11:26:53.364 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 11:26:53.365 * Master replied to PING, replication can continue...
17859:S 18 Aug 11:26:53.365 * Partial resynchronization not possible (no cached master)
17859:S 18 Aug 11:26:53.365 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:1
17859:S 18 Aug 11:27:38.686 * MASTER <-> SLAVE sync: receiving 2550593979 bytes from master
17859:S 18 Aug 11:27:59.262 * MASTER <-> SLAVE sync: Flushing old data
17859:S 18 Aug 11:28:05.517 * MASTER <-> SLAVE sync: Loading DB in memory
17859:S 18 Aug 11:28:27.393 * MASTER <-> SLAVE sync: Finished with success
17859:S 18 Aug 12:51:54.051 # Connection with master lost.
17859:S 18 Aug 12:51:54.052 * Caching the disconnected master state.
17859:S 18 Aug 12:51:54.657 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 12:51:54.657 * MASTER <-> SLAVE sync started
17859:S 18 Aug 12:51:54.657 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 12:51:54.657 * Master replied to PING, replication can continue...
17859:S 18 Aug 12:51:54.658 * Trying a partial resynchronization (request 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:1688184130).
17859:S 18 Aug 12:51:54.658 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:1934263988
17859:S 18 Aug 12:51:54.658 * Discarding previously cached master state.
17859:S 18 Aug 12:52:39.353 * MASTER <-> SLAVE sync: receiving 2550573652 bytes from master
17859:S 18 Aug 12:52:56.391 # I/O error trying to sync with MASTER: connection lost
17859:S 18 Aug 12:52:57.584 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 12:52:57.584 * MASTER <-> SLAVE sync started
17859:S 18 Aug 12:52:57.584 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 12:52:57.584 * Master replied to PING, replication can continue...
17859:S 18 Aug 12:52:57.584 * Partial resynchronization not possible (no cached master)
17859:S 18 Aug 12:52:57.585 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:2022862913
17859:S 18 Aug 12:53:42.309 * MASTER <-> SLAVE sync: receiving 2550573652 bytes from master
17859:S 18 Aug 12:54:02.551 * MASTER <-> SLAVE sync: Flushing old data
17859:S 18 Aug 12:54:09.241 * MASTER <-> SLAVE sync: Loading DB in memory
17859:S 18 Aug 12:54:31.375 * MASTER <-> SLAVE sync: Finished with success
17859:S 18 Aug 13:22:01.266 # Connection with master lost.
17859:S 18 Aug 13:22:01.267 * Caching the disconnected master state.
17859:S 18 Aug 13:22:02.340 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 13:22:02.341 * MASTER <-> SLAVE sync started
17859:S 18 Aug 13:22:02.380 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 13:22:02.404 * Master replied to PING, replication can continue...
17859:S 18 Aug 13:22:02.404 * Trying a partial resynchronization (request 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:2548455709).
17859:S 18 Aug 13:22:02.404 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:2964290063
17859:S 18 Aug 13:22:02.405 * Discarding previously cached master state.
17859:S 18 Aug 13:22:47.186 * MASTER <-> SLAVE sync: receiving 2550563247 bytes from master
17859:S 18 Aug 13:23:10.751 # I/O error trying to sync with MASTER: connection lost
17859:S 18 Aug 13:23:11.097 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 13:23:11.097 * MASTER <-> SLAVE sync started
17859:S 18 Aug 13:23:11.097 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 13:23:11.097 * Master replied to PING, replication can continue...
17859:S 18 Aug 13:23:11.097 * Partial resynchronization not possible (no cached master)
17859:S 18 Aug 13:23:11.098 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:3034293813
17859:S 18 Aug 13:23:55.925 * MASTER <-> SLAVE sync: receiving 2550563310 bytes from master
17859:S 18 Aug 13:24:16.412 * MASTER <-> SLAVE sync: Flushing old data
17859:S 18 Aug 13:24:24.694 * MASTER <-> SLAVE sync: Loading DB in memory
17859:S 18 Aug 13:24:47.453 * MASTER <-> SLAVE sync: Finished with success
17859:S 18 Aug 13:52:27.452 # Connection with master lost.
17859:S 18 Aug 13:52:27.452 * Caching the disconnected master state.
17859:S 18 Aug 13:52:28.512 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 13:52:28.512 * MASTER <-> SLAVE sync started
17859:S 18 Aug 13:52:28.574 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 13:52:28.574 * Master replied to PING, replication can continue...
17859:S 18 Aug 13:52:28.574 * Trying a partial resynchronization (request 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:3560083735).
17859:S 18 Aug 13:52:28.575 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:3994311298
17859:S 18 Aug 13:52:28.575 * Discarding previously cached master state.
17859:S 18 Aug 13:53:13.042 * MASTER <-> SLAVE sync: receiving 2550551123 bytes from master
17859:S 18 Aug 13:53:33.314 * MASTER <-> SLAVE sync: Flushing old data
17859:S 18 Aug 13:53:41.168 * MASTER <-> SLAVE sync: Loading DB in memory
17859:S 18 Aug 13:54:03.683 * MASTER <-> SLAVE sync: Finished with success
17859:S 18 Aug 13:54:03.688 # Connection with master lost.
17859:S 18 Aug 13:54:03.688 * Caching the disconnected master state.
17859:S 18 Aug 13:54:03.783 * Connecting to MASTER 10.242.73.46:6666
17859:S 18 Aug 13:54:03.783 * MASTER <-> SLAVE sync started
17859:S 18 Aug 13:54:03.783 * Non blocking connect for SYNC fired the event.
17859:S 18 Aug 13:54:03.783 * Master replied to PING, replication can continue...
17859:S 18 Aug 13:54:03.783 * Trying a partial resynchronization (request 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:3996662976).
17859:S 18 Aug 13:54:03.783 * Full resync from master: 91fa2b65639c2ce2caf88d279aba53e8c4ad3a22:4045725312
17859:S 18 Aug 13:54:03.783 * Discarding previously cached master state.
17859:S 18 Aug 13:54:48.517 * MASTER <-> SLAVE sync: receiving 2550550079 bytes from master
17859:S 18 Aug 13:55:08.913 * MASTER <-> SLAVE sync: Flushing old data
17859:S 18 Aug 13:55:记一次线上使用redis时执行lua脚本的惨痛经历

记一次因 Redis 使用不当导致应用卡死 bug 的排查及解决!

记一次因 Redis 使用不当导致应用卡死 bug 的排查及解决!

记一次重构经历(未完)

TPS低,CPU高--记一次storm压测问题排查过程

记一次生产事故的排查与优化——Java服务假死