JVM 问题排查
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了JVM 问题排查相关的知识,希望对你有一定的参考价值。
参考技术A也可以通过查看gc日志来观察问题
如果CPU使用率不高,但程序性能低下,则可考虑对线程进行分析,看看各个主要的线程都在做什么,是否有锁争用或IO阻塞问题
为了方便分析,最好给每个线程或者线程池命名
如果JVM发现有死锁存在,会在日志中出现Found one Java-level deadlock
在等待一个条件的发生,来把自己唤醒,或者调用了sleep方法
此时线程状态:
WAITING(parking): 一直等待那个条件发生
TIMED_WAITING(parking或sleeping): 定时等待,即使条件不发生,时间到了也可以自己唤醒
如果发现大量线程处于此状态,并且从线程的堆栈上查看到是正在执行网络读写,这可能是一个网络瓶颈问题或者第三方响应慢的问题
线程所需要的资源长时间等待却一直无法获取,被标识为阻塞状态,可以理解为等待资源超时的线程。线程堆栈中一般存在Waiting to Lock
每个 Monitor在某个时刻,只能被一个线程拥有,该线程就是Active Thread,而其它线程都是Waiting Thread,分别在两个队列 Entry Set和Wait Set里面等候。 在Entry Set中等待的线程状态是Waiting for monitor entry,而在Wait Set中等待的线程状态是in Object.wait()。 当被调用notify或notifyAll时,只有在Wait Set中的线程会被唤醒
Waiting for monitor entry:等待进入一个临界区 ,所以它在Entry Set队列中等待。此时线程状态一般都是Blocked,如果存在大量线程在此状态,可能是一个全局锁阻塞住了大量线程。随着时间流逝,waiting for monitor entry的线程越来越多,没有减少的趋势,可能意味着某些线程在临界区里呆的时间太长了,以至于越来越多新线程迟迟无法进入临界区
当线程获得了Monitor,如果发现线程继续运行的条件没有满足,它则调用对象(一般就是被 synchronized 的对象)的 wait() 方法,放弃了Monitor,进入Wait Set队列。此时线程状态大致为以下几种:TIMED_WAITING (on object monitor)和 WAITING (on object monitor)
有时候线程状态是Runnable,但却是在等待IO
性能分析之– JAVA Thread Dump 分析综述
三个实例演示 Java Thread Dump 日志分析
JVM故障问题排查心得「虚拟机线程问题」实战分析JVM线程导致CPU过载过高的问题排查方案
前提概要
(服务调优或问题排查执行方案)
本次主要分析一下相关的线上如果出现CPU负载过高的情况下,应该如何按照步骤去排查和分析问题,接下来就让我们一同迈入优化和解决问题的探索之路。
侦探问题根源
通过top指令发现,当前5511的线程cup和内存的使用率过高:
top
5511 root 20 0 16.841g 6.088g 5584 S 47.0 39.2 4011:41 java
9550 root 20 0 2516200 67892 2436 S 0.7 0.4 204:20.40 java
21271 root 20 0 11.579g 0.987g 5056 S 0.7 6.4 46:29.88 java
13 root 20 0 0 0 0 S 0.3 0.0 0:35.21 ksoftirqd/1
2128 root 20 0 9.833g 1.273g 6656 S 0.3 8.2 33:48.35 java
29464 root 20 0 11.578g 1.030g 5448 S 0.3 6.6 31:42.67 java
31721 root 20 0 157744 2264 1544 R 0.3 0.0 0:00.32 top
查询当前java进程详情,利用 jinfo pid 5511
Attaching to process ID 5511, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.151-b12
Java System Properties:
java.runtime.name = Java(TM) SE Runtime Environment
java.vm.version = 25.151-b12
sun.boot.library.path = /usr/local/java/jre/lib/amd64
java.vendor.url = http://java.oracle.com/
java.vm.vendor = Oracle Corporation
path.separator = :
file.encoding.pkg = sun.io
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
sun.os.patch.level = unknown
sun.java.launcher = SUN_STANDARD
user.country = US
user.dir = /mnt/app/bdcenter-base/bdcenter-service-base-1.0-SNAPSHOT
java.vm.specification.name = Java Virtual Machine Specification
PID = 5511
java.runtime.version = 1.8.0_151-b12
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
os.arch = amd64
java.endorsed.dirs = /usr/local/java/jre/lib/endorsed
line.separator = java.io.tmpdir = /tmp
java.vm.specification.vendor = Oracle Corporation
os.name = Linux
io.netty.noKeySetOptimization = true
sun.jnu.encoding = UTF-8
java.library.path = /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
spring.beaninfo.ignore = true
sun.nio.ch.bugLevel =
java.specification.name = Java Platform API Specification
java.class.version = 52.0
sun.management.compiler = HotSpot 64-Bit Tiered Compilers
os.version = 3.10.0-693.2.2.el7.x86_64
user.home = /root
user.timezone = Asia/Shanghai
catalina.useNaming = false
java.awt.printerjob = sun.print.PSPrinterJob
file.encoding = UTF-8
@appId = bdcenter-service-base
java.specification.version = 1.8
io.netty.recycler.maxCapacityPerThread = 0
catalina.home = /tmp/tomcat.7050606280722271893.9025
user.name = root
java.class.path = bdcenter-service-base-1.0-SNAPSHOT.jar
java.vm.specification.version = 1.8
sun.arch.data.model = 64
sun.java.command = bdcenter-service-base-1.0-SNAPSHOT.jar --spring.profiles.active=prod 5
java.home = /usr/local/java/jre
user.language = en
java.specification.vendor = Oracle Corporation
io.netty.noUnsafe = true
awt.toolkit = sun.awt.X11.XToolkit
java.vm.info = mixed mode
java.version = 1.8.0_151
java.ext.dirs = /usr/local/java/jre/lib/ext:/usr/java/packages/lib/ext
sun.boot.class.path = /usr/local/java/jre/lib/resources.jar:/usr/local/java/jre/lib/rt.jar:/usr/local/java/jre/lib/sunrsasign.jar:/usr/local/java/jre/lib/jsse.jar:/usr/local/java/jre/lib/jce.jar:/usr/local/java/jre/lib/charsets.jar:/usr/local/java/jre/lib/jfr.jar:/usr/local/java/jre/classes
java.awt.headless = true
java.vendor = Oracle Corporation
catalina.base = /tmp/tomcat.7050606280722271893.9025
file.separator = /
java.vendor.url.bug = http://bugreport.sun.com/bugreport/
sun.io.unicode.encoding = UnicodeLittle
sun.cpu.endian = little
sun.cpu.isalist =
VM Flags:
Non-default VM flags: -XX:+AggressiveOpts -XX:CICompilerCount=4 -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:InitialHeapSize=6442450944 -XX:MaxDirectMemorySize=1073741824 -XX:MaxHeapSize=6442450944 -XX:MaxNewSize=697892864 -XX:MaxTenuringThreshold=6 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=697892864 -XX:OldPLABSize=16 -XX:OldSize=5744558080 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ThreadStackSize=256 -XX:+UseBiasedLocking -XX:+UseCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseFastAccessorMethods -XX:+UseParNewGC
Command line: -Xms6G -Xmx6G -XX:MaxPermSize=256M -XX:MaxDirectMemorySize=1G -Xss256k -XX:+AggressiveOpts -XX:+UseBiasedLocking -XX:+UseFastAccessorMethods -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:CMSInitiatingOccupancyFraction=75 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
Command line 可以看出JVM 用户输入参,对整个jvm的信息进行消息的描述
查看当前gc情况: jstat -gccause 5511
[root@bigdata-service-1:/root]# jstat -gccause 5511 1000
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT LGCC GCC
0.00 8.86 61.50 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 64.31 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 65.04 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 65.17 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 65.26 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 65.26 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
0.00 8.86 67.80 72.16 95.05 91.62 4577 326.410 8 8.855 335.265 Allocation Failure No GC
发现新生代频繁的进行gc操【YGC:gc 次数, YGCT: gc时间】,老年的几乎没有gc【FGC:full gc次数,FGCT :full gc时间】,可能原因是新生代的设计过小,导致新生代频繁gc
查看当前jvm 内存分配:
jmap -heap 5511
[root@bigdata-service-1:/root]# jmap -heap 5511
Attaching to process ID 5511, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.151-b12
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 6442450944 (6144.0MB)
NewSize = 697892864 (665.5625MB)
MaxNewSize = 697892864 (665.5625MB)
OldSize = 5744558080 (5478.4375MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 628162560 (599.0625MB)
used = 210724080 (200.96214294433594MB)
free = 417438480 (398.10035705566406MB)
33.54610628178795% used
Eden Space:
capacity = 558432256 (532.5625MB)
used = 203572032 (194.14141845703125MB)
free = 354860224 (338.42108154296875MB)
36.45420367694519% used
From Space:
capacity = 69730304 (66.5MB)
used = 7152048 (6.8207244873046875MB)
free = 62578256 (59.67927551269531MB)
10.256728552337876% used
To Space:
capacity = 69730304 (66.5MB)
used = 0 (0.0MB)
free = 69730304 (66.5MB)
0.0% used
concurrent mark-sweep generation:
capacity = 5744558080 (5478.4375MB)
used = 4145940376 (3953.876853942871MB)
free = 1598617704 (1524.560646057129MB)
72.17161561016022% used
32720 interned Strings occupying 4008568 bytes.
-
从JVM的分配可以看出,新生代分配:capacity = 628162560 (599.0625MB),Eden Space:capacity = 558432256 (532.5625MB),老年代分配:capacity = 5744558080 (5478.4375MB)。从数据分析出,新生代设计过小,导致新生代频繁gc。
-
其次,新生代设计过小,导致大对象无法分配,直接分配到老年代。导致老年代空间占用过大,从而占用整个系统内存空间。可能导致其他服务内存不足。
线上手动GC,查看GC后的JVM详情信息:
[root@bigdata-service-1:/root]# jmap -histo:live 5511
日志过大.....
[root@bigdata-service-1:/root]# jstat -gc 5511 1000
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
68096.0 68096.0 7588.1 0.0 545344.0 394801.0 5609920.0 4050654.1 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 8.855 336.437
68096.0 68096.0 0.0 0.0 545344.0 0.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 46918.8 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 46922.8 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 46937.1 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 46937.1 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 46937.1 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 64190.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 64190.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 64190.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 64190.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 64190.0 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 82152.3 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 82152.3 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 82152.3 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 82156.4 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 82156.4 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 101737.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 101737.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 103581.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 103581.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 103581.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 124812.4 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 124812.4 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
68096.0 68096.0 0.0 0.0 545344.0 124814.5 5609920.0 4043257.0 87952.0 83637.7 11392.0 10437.9 4592 327.582 9 12.997 340.578
手动强制Full gc 以后发现,内存并没有被释放,说明当前系统可能存在内存泄漏,导致对象不能被清理释放【OC:老年代空间,OU :老年代使用空间(KB)】。
存在内存泄漏,主要是当前的对象引用没有被释放导致,比如僵尸线程的存在。说明当前系统可能存在大量的僵尸线程。进而进行进一步的排查
查看当前系统线程情况:jstack。可以看到当前线程的详细情况
[root@bigdata-service-1:/root]# jstack 5511 | head -50
2018-07-20 11:59:27
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode):
"Keep-Alive-Timer" #19437 daemon prio=8 os_prio=0 tid=0x00007f8d749ce000 nid=0x4995 waiting on condition [0x00007f8bb7b2b000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at sun.net.www.http.KeepAliveCache.run(KeepAliveCache.java:172)
at java.lang.Thread.run(Thread.java:748)
"http-nio-9025-exec-201" #19378 daemon prio=5 os_prio=0 tid=0x00007f8c8e3d3000 nid=0x7ffc waiting on condition [0x00007f8b3a724000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000679f4cbd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:103)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:31)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
"http-nio-9025-exec-197" #19374 daemon prio=5 os_prio=0 tid=0x00007f8c8e3cc000 nid=0x7ff7 waiting on condition [0x00007f8b3a828000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000679f4cbd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:103)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:31)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
"http-nio-9025-exec-196" #19373 daemon prio=5 os_prio=0 tid=0x00007f8c8a06d000 nid=0x7ff3 waiting on condition [0x00007f8b3a8aa000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000679f4cbd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:103)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:31)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
查看当前系统的线程总数:
jvm 开启的线程数:
[root@bigdata-service-1:/root]# jstack 5511 | wc -l
254132
当前服务器对不同IP的建立连接数:
[root@bigdata-service-1:/root]# netstat -nat|grep ESTABLISHED|awk 'print$5'|awk -F : 'print$1'|sort|uniq -c|sort -rn
10244 10.27.70.185
26 10.27.70.77
26 10.24.237.150
10 10.81.83.121
7 10.29.148.127
4 10.80.101.244
3 10.27.4.125
2 127.0.0.1
2 10.80.112.35
2 10.27.87.167
2 10.25.0.162
1 10.80.112.23
1 106.11.248.20
加载当前的堆信息:
**jmap -dump:live,format=b,file=dump.hprof 5511 **
通过 Eclipse IDE MAT 插件进行堆分析
Eclipse 导入堆信息
- 查看当前堆内存泄漏情况
堆的对象泄漏可以分析出,当前堆中存在大量的:”io.netty.util.internal.InternalThreadLocalMap“这个是netty的内部线程
- 查看当前堆中占用最大的对象
通过对堆的对象分析,可以发现,当前系统中大量存在连接池对象,从而定位到问题所在,连接池出现了泄漏。导致系统大量创建连接池,而没有释放。出现了内存泄漏
可以分析出,当前系统对外建立了大量的连接,而且连接没有释放,导致系统开启的线程数过多。通过分析连接最多的ip,定位到出现问题的节点
"elasticsearch[_client_][transport_client_boss][T#4]" #166 daemon prio=5 os_prio=0 tid=0x00007f8c9819e800 nid=0x20e3 runnable [0x00007f8c7403d000]
- locked <0x0000000674995e98> (a sun.nio.ch.Util$3)
- locked <0x0000000674995e80> (a java.util.Collections$UnmodifiableSet)
- locked <0x0000000674986f58> (a sun.nio.ch.EPollSelectorImpl)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
以上是关于JVM 问题排查的主要内容,如果未能解决你的问题,请参考以下文章