长连接开发踩坑之netty OOM问题排查实践
Posted 51NB技术
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了长连接开发踩坑之netty OOM问题排查实践相关的知识,希望对你有一定的参考价值。
随着我司的不断壮大,诸多业务需求促使研发团队需要独立开发一项长连接服务。而开发一个长连接服务端,一般会考虑单机需要能够支持上万的客户端连接,这时候选择一个合适的框架很重要。经过一段时间的调研,我们选择基于netty来做长连接开发。在服务端基本功能已经开发完成后,我想通过压测来检验下性能,于是就有了下面的踩坑实践……
一、踩坑经历
压测时,从开始1000个客户端连接到5000个连接,每个客户端连循环一定次数请求服务端,均未发现问题。直到客户端连接到15000的时候(实际上后续证实与连接数无关,而是与请求次数有关),请求一段时间后发现服务端不断抛OOM异常:
2018-06-22 14:24:37.875 ERROR 5381 --- [-] [ntLoopGroup-7-3] c.e.c.s.c.s.h.h.HttpProxyServerHandler : exception happens, clientId: 1aa0892c
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 251658247, max: 268435456)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:615)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:569)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:214)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
二、问题排查
2.1 初步猜想
从异常信息大致可以了解到,堆外内存已经满了,无法再继续分配,发生了堆外内存OOM,但是存在一些疑惑。
疑惑1:netty限制的堆外内存的最大值为何是268435456 byte?
查阅PlatformDependent.java
源码可以发现PlatformDependent.java
中通过maxDirectMemory0()
方法来获取最大值。而maxDirectMemory0()
返回的值从JVM参数-XX:MaxDirectMemorySize
解析得到,然后在PlatformDependent
初始化时赋值给DIRECT_MEMORY_LIMIT
常量,而异常信息则正是当前内存使用值跟DIRECT_MEMORY_LIMIT
做比较后抛出的。看了下自己应用的JVM启动参数XX:MaxDirectMemorySize
确实配置的是256M(268435456 byte)。
private static long maxDirectMemory0() {
long maxDirectMemory = 0;
ClassLoader systemClassLoader = null;
try {
// Try to get from sun.misc.VM.maxDirectMemory() which should be most accurate.
systemClassLoader = getSystemClassLoader();
Class<?> vmClass = Class.forName("sun.misc.VM", true, systemClassLoader);
Method m = vmClass.getDeclaredMethod("maxDirectMemory");
maxDirectMemory = ((Number) m.invoke(null)).longValue();
} catch (Throwable ignored) {
// Ignore
}
if (maxDirectMemory > 0) {
return maxDirectMemory;
}
try {
// Now try to get the JVM option (-XX:MaxDirectMemorySize) and parse it.
// Note that we are using reflection because android doesn't have these classes.
Class<?> mgmtFactoryClass = Class.forName(
"java.lang.management.ManagementFactory", true, systemClassLoader);
Class<?> runtimeClass = Class.forName(
"java.lang.management.RuntimeMXBean", true, systemClassLoader);
Object runtime = mgmtFactoryClass.getDeclaredMethod("getRuntimeMXBean").invoke(null);
@SuppressWarnings("unchecked")
List<String> vmArgs = (List<String>) runtimeClass.getDeclaredMethod("getInputArguments").invoke(runtime);
for (int i = vmArgs.size() - 1; i >= 0; i --) {
Matcher m = MAX_DIRECT_MEMORY_SIZE_ARG_PATTERN.matcher(vmArgs.get(i));
if (!m.matches()) {
continue;
}
maxDirectMemory = Long.parseLong(m.group(1));
switch (m.group(2).charAt(0)) {
case 'k': case 'K':
maxDirectMemory *= 1024;
break;
case 'm': case 'M':
maxDirectMemory *= 1024 * 1024;
break;
case 'g': case 'G':
maxDirectMemory *= 1024 * 1024 * 1024;
break;
}
break;
}
} catch (Throwable ignored) {
// Ignore
}
if (maxDirectMemory <= 0) {
maxDirectMemory = Runtime.getRuntime().maxMemory();
logger.debug("maxDirectMemory: {} bytes (maybe)", maxDirectMemory);
} else {
logger.debug("maxDirectMemory: {} bytes", maxDirectMemory);
}
return maxDirectMemory;
}
//...
private static void incrementMemoryCounter(int capacity) {
if (DIRECT_MEMORY_COUNTER != null) {
for (;;) {
long usedMemory = DIRECT_MEMORY_COUNTER.get();
long newUsedMemory = usedMemory + capacity;
if (newUsedMemory > DIRECT_MEMORY_LIMIT) {//异常抛出点
throw new OutOfDirectMemoryError("failed to allocate " + capacity
+ " byte(s) of direct memory (used: " + usedMemory + ", max: " + DIRECT_MEMORY_LIMIT + ')');
}
if (DIRECT_MEMORY_COUNTER.compareAndSet(usedMemory, newUsedMemory)) {
break;
}
}
}
}
疑惑2: 异常栈中的PoolArena是什么?
查阅一些文档资料后得知,PoolArena是netty中的内存池的实现类。为了集中管理内存的分配和释放,提高分配释放内存的性能,netty预先分配一个大块的内存空间来实现一个内存池,使用时会拆分成各个小块的内存空间进行分配使用。
在弄清楚上述2点疑惑后开始猜测是内存池中对象被分配使用后未进行回收,导致不断新分配内存空间直至到达JVM限制的临界值。
2.2 何不google一下
很多时候学会利用搜索引擎可以帮助我们快速定位问题,因为大多时候你可以在网上找到相同的问题。所以我尝试google搜索了一下异常里的一些关键词,在stackoverflow上被一个回答带到了netty官方的一篇文档,关于netty引用计数refrence count的讲解。 文章里提到了如果引用计数使用不当,没有调用release()方法,会导致Memory Leak,并给出了内存泄露原因的排查方式。然后我试着在log中grep关键词LEAK发现了有一条error(由于最上面的那些异常栈信息刷的太多,导致我开始并未注意到该ERROR log)。
2018-06-25 17:39:56.616 ERROR 18059 --- [-] [ntLoopGroup-5-1] io.netty.util.ResourceLeakDetector : LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information.
参考文档,得知各个level级别含义如下,默认SIMPLE会告诉你又leak,但是不会从buffers中进行采样分析,ADVANCED会从buffers中取1%进行采样分析告诉你leak的原因。
There are currently 4 levels of leak detection:
DISABLED - disables leak detection completely. Not recommended.
SIMPLE - tells if there is a leak or not for 1% of buffers. Default.
ADVANCED - tells where the leaked buffer was accessed for 1% of buffers.
PARANOID - Same with ADVANCED except that it's for every single buffer. Useful for automated testing phase. You could fail the build if the build output contains 'LEAK:'.
2.3 实践找寻“真相”
既然官方建议修改JVM参数来获取更为详细的异常信息,何不赶紧实践一下?于是修改jvm参数增加-Dio.netty.leakDetection.level=advanced
,结合3.1的分析,我减少-XX:MaxDirectMemorySize
配置的值为25M以便于尽快复现问题。重新启动长连接服务后继续开始压测,不一会儿异常重新出现,此时包含一些异常信息提示,leak record是netty帮我们分析内存泄露的一些提示,这里含有一个WARNING,告诉我们1条leak record被丢弃了,因为默认限制leak record为4,可以修改io.netty.leakDetection.maxRecords
参数来进行配置。
2018-06-25 16:09:34.647 ERROR 16469 --- [-] [ntLoopGroup-5-4] io.netty.util.ResourceLeakDetector : LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
WARNING: 1 leak records were discarded because the leak record count is limited to 4. Use system property io.netty.leakDetection.maxRecords to increase the limit.
Recent access records: 4
#4:
io.netty.buffer.AdvancedLeakAwareByteBuf.readByte(AdvancedLeakAwareByteBuf.java:396)
com.enniu.cloud.services.connection.server.httpproxy.handlers.HttpProxyHeader.readLine(HttpProxyHeader.java:105)
//跟下面代码一样,为节约篇幅省略
增加JVM参数-Dio.netty.leakDetection.maxRecords=10
后重试,得到完整信息如下:
2018-06-25 17:52:46.806 ERROR 18384 --- [-] [ntLoopGroup-5-3] io.netty.util.ResourceLeakDetector : LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 5
#5:
io.netty.buffer.AdvancedLeakAwareByteBuf.readByte(AdvancedLeakAwareByteBuf.java:396)
com.enniu.cloud.services.connection.server.httpproxy.handlers.HttpProxyHeader.readLine(HttpProxyHeader.java:105)
com.enniu.cloud.services.connection.server.httpproxy.handlers.HttpProxyHeader.digest(HttpProxyHeader.java:46)
com.enniu.cloud.services.connection.server.httpproxy.handlers.HttpProxyServerHandler.channelRead(HttpProxyServerHandler.java:42)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
#4:
Hint: 'HttpProxyServerHandler#0' will handle the message from this point.
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
#3:
Hint: 'LoggingHandler#0' will handle the message from this point.
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
#2:
Hint: 'DefaultChannelPipeline$HeadContext#0' will handle the message from this point.
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
#1:
io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:630)
io.netty.channel.socket.nio.NiosocketChannel.doReadBytes(NioSocketChannel.java:372)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
Created at:
io.netty.util.ResourceLeakDetector.track(ResourceLeakDetector.java:237)
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:331)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172)
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133)
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80)
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122)
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:745)
上诉异常信息提示泄漏的ByteBuf从LoggingHandler→HttpProxyServerHandler, 最终在HttpProxyServerHandler中出现问题。观察HttpProxyServerHandler代码,发现确实没有release读进来的ByteBuf对象。
@Override
public void channelRead(ChannelHandlerContext ctx, Object msg) {
ByteBuf original = (ByteBuf) msg;
if (isFirstRead) {
HttpProxyHeader httpProxyClientHeader = new HttpProxyHeader();
try {
httpProxyClientHeader.digest(original);
} catch (Exception e) {
log.error("http proxy request header is illegal, {}", e.getMessage());
ctx.close().addListener(CLOSE_ON_FAILURE);
return;
} //##############这里没有release original###########
//无关代码省略
}
参考netty给出的文档的ByteBuf的正确使用方式,我加上了ReferenceCountUtil.release(original);
try {
httpProxyClientHeader.digest(original);
} catch (Exception e) {
log.error("http proxy request header is illegal, {}", e.getMessage());
ctx.close().addListener(CLOSE_ON_FAILURE);
return;
} finally {
ReferenceCountUtil.release(original); //加个引用计数release
}
然后再重新进行压测,未出现OOM问题,bug得以解决。
2.4 刨根问底
问题解决了,就算完事了吗?这显然不符合我们广大51程序员对原理的求知精神,于是我开始排查引用计数跟内存释放之间的联系。
我DEBUG看了下channelRead
方法读进来的Object msg类型是PooledUnsafeDirectByteBuf
,看了下释放内存的方法为deallocate()
,所以它跟release()
方法到底有什么关系呢?
@Override
protected void deallocate() {
ByteBuffer buffer = this.buffer;
if (buffer == null) {
return;
}
this.buffer = null;
if (!doNotFree) {
freeDirect(buffer);
}
}
继续查看类的继承结构图以及相关代码,可以看到AbstractReferenceCountedByteBuf
在执行release()
时调用了deallocate()
,而deallocate()
方法则是由子类去实现,这其实是一个模版方法设计模式。
@Override
public boolean release(int decrement) {
return release0(checkPositive(decrement, "decrement"));
}
private boolean release0(int decrement) {
for (;;) {
int refCnt = this.refCnt;
if (refCnt < decrement) {
throw new IllegalReferenceCountException(refCnt, -decrement);
}
if (refCntUpdater.compareAndSet(this, refCnt, refCnt - decrement)) {
if (refCnt == decrement) {
deallocate();
return true;
}
return false;
}
}
}
/**
* Called once {@link #refCnt()} is equals 0.
*/
protected abstract void deallocate();
至此应该弄清楚原因了,前面猜想的内存池中对象被分配使用后未进行回收,导致不断新增对象直至到达JVM限制的临界值是正确的。
三、总结
OOM问题一般比较难排查原因,好在netty给我们提供了便捷的方式来获取一些重要的线索。官方建议通过编写单测代码去检测memory leak,检测期间leakDetection.level配置成simple级别,一旦发现有异常信息,再将leakDetection.level配置成advanced级别,重新运行单测代码定位问题原因并解决,而不是直接将项目发布上线。这样看来,上线前做详细的测试还是很有必要的~
林健,51信用卡基础服务及中间件工程师
前搜狐视频研发工程师,之前主要从事短信通道,图片储存等项目开发。2017年加入51信用卡,目前主要负责推送以及长连接项目开发,有丰富的基础服务开发经验, 平时热衷于疑难问题排查及技术底层原理的实现。
以上是关于长连接开发踩坑之netty OOM问题排查实践的主要内容,如果未能解决你的问题,请参考以下文章