golang pprof工具
Posted jfcat
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了golang pprof工具相关的知识,希望对你有一定的参考价值。
pprof工具
pprof是什么
pprof是分析和显示性能相关数据的工具
pprof读取profile.proto格式的分析抽样集合数据,同时创建报告来展现和帮助分析数据,它能创建包括文本和图型报告。
profile.proto 是一个描述调用者和符号信息集合的protocol buffer。一个通常的使用是用来从统计分析数据中表现抽样调用栈集合。
profile 可以被读取从一个本地文件,或者通过http。多个profile可以被合并和比较。
pprof使用
pprof在golang中存在于两处
- runtime/pprof
- net/http/pprof
net/http/pprof的使用
使用下面的代码实现一个web服务
package main
import (
"net/http"
_ "net/http/pprof"
"time"
)
func counter() {
slice := make([]int, 0)
for i := 0; i < 100000; i++ {
slice = append(slice, i)
}
}
func httpGet(w http.ResponseWriter, r *http.Request) {
counter()
}
func main() {
http.HandleFunc("/get", httpGet)
http.ListenAndServe("localhost:8000", nil)
}
然后编译运行,并进行性能分析
$ go build netmain.go
$ ./netmain &
$ go tool pprof http://localhost:8000/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8000/debug/pprof/heap
Saved profile in /Users/zhaoyong/pprof/pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
Type: inuse_space
Time: Jul 26, 2021 at 7:14am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 902.59kB, 100% of 902.59kB total
Showing top 10 nodes out of 13
flat flat% sum% cum cum%
902.59kB 100% 100% 902.59kB 100% compress/flate.NewWriter
0 0% 100% 902.59kB 100% compress/gzip.(*Writer).Write
0 0% 100% 902.59kB 100% net/http.(*ServeMux).ServeHTTP
0 0% 100% 902.59kB 100% net/http.(*conn).serve
0 0% 100% 902.59kB 100% net/http.HandlerFunc.ServeHTTP
0 0% 100% 902.59kB 100% net/http.serverHandler.ServeHTTP
0 0% 100% 902.59kB 100% net/http/pprof.Index
0 0% 100% 902.59kB 100% net/http/pprof.handler.ServeHTTP
0 0% 100% 902.59kB 100% runtime/pprof.(*Profile).WriteTo
0 0% 100% 902.59kB 100% runtime/pprof.(*profileBuilder).build
(pprof)
这里显示的是程度堆内存的使用分配情况
Go 内置的内存 profiler可以让我们对线上系统进行内存使用采样,有四个相应的指标:
- inuse_objects:当我们认为内存中的驻留对象过多时,就会关注该指标
- inuse_space:当我们认为应用程序占据的 RSS 过大时,会关注该指标
- alloc_objects:当应用曾经发生过历史上的大量内存分配行为导致 CPU 或内存使用大幅上升时,可能关注该指标
- alloc_space:当应用历史上发生过内存使用大量上升时,会关注该指标
针对不同的指标可以使用不同参数查看,下面使用alloc_space查看效果
go tool pprof -alloc_space http://localhost:8000/debug/pprof/heap
下面解释一下其他字段
top:pprof的指令之一,默认显示pprof中的前10项数据,可以通过top 20 来显示20行数据;
flat/flat%:分别表示在当前调用中的数据占整体的百分比。例如compress/flate.NewWriter的占用内存为902.59kB,占比总分配的58.11%。
cum/cum%:分别表示在当前累计数据占整体的百分比。例如 compress/flate.NewWriter的累积占比100%。
当然也可以直接在浏览器上访问 http://127.0.0.1:8000/debug/pprof/ 来总览并进行选择查看
/debug/pprof/
Types of profiles available:
Count Profile
44 allocs
0 block
0 cmdline
7 goroutine
44 heap
0 mutex
0 profile
13 threadcreate
0 trace
full goroutine stack dump
Profile Descriptions:
allocs: A sampling of all past memory allocations
block: Stack traces that led to blocking on synchronization primitives
cmdline: The command line invocation of the current program
goroutine: Stack traces of all current goroutines
heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.
mutex: Stack traces of holders of contended mutexes
profile: CPU profile. You can specify the duration in the seconds GET parameter. After you get the profile file, use the go tool pprof command to investigate the profile.
threadcreate: Stack traces that led to the creation of new OS threads
trace: A trace of execution of the current program. You can specify the duration in the seconds GET parameter. After you get the trace file, use the go tool trace command to investigate the trace.
这样就可以具体选择需要关注的内容进行查看
如果在页面上选择profile会获取到cpu的profile文件,再通过pprof工具进行分析, 并可以通过pprof工具下执行web 来生成运行是的调用关系图, 获得更直观的信息.
在查看图形数据前需要安装graphviz 工具,Mac下的安装命令
brew install graphviz
完成后就可以在命令行中输入web命令来生成svg格式的文件了
模拟并发情况分析
这里使用wrk工具来模拟并发
wrk -c20 -t20 -d30s http://localhost:8000/get
- -c20: 有20个并发连接
- -t20 : 使用20个线程并发请求
- -d30s 测试时间持续30s
看下运行统计,后面可以对比
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.36ms 3.62ms 42.07ms 65.63%
Req/Sec 197.45 32.92 330.00 68.80%
118208 requests in 30.07s, 8.45MB read
Requests/sec: 3931.53
Transfer/sec: 287.95KB
在浏览器中下载profile文件进行分析
Type: cpu
Time: Jul 26, 2021 at 8:54am (CST)
Duration: 30.03s, Total samples = 34.10s (113.54%)
Entering interactive mode (type “help” for commands, “o” for options)
(pprof) top20
Showing nodes accounting for 30.83s, 90.41% of 34.10s total
Dropped 245 nodes (cum <= 0.17s)
Showing top 20 nodes out of 119
flat flat% sum% cum cum%
10.93s 32.05% 32.05% 10.93s 32.05% runtime.memmove
4.19s 12.29% 44.34% 4.23s 12.40% syscall.syscall
2.61s 7.65% 51.99% 2.61s 7.65% runtime.usleep
1.89s 5.54% 57.54% 1.89s 5.54% runtime.memclrNoHeapPointers
1.65s 4.84% 62.38% 1.66s 4.87% runtime.kevent
1.53s 4.49% 66.86% 1.53s 4.49% runtime.pthread_kill
1.45s 4.25% 71.11% 17.68s 51.85% main.counter
1.37s 4.02% 75.13% 1.37s 4.02% runtime.asyncPreempt
1.23s 3.61% 78.74% 1.23s 3.61%
—
下面对这里出现最多的runtime.memmove分析下,从web展现的图看,memmove主要针对的是counter函数,切片的插入对0大小的切片来说就存在不断分配新的空间,然后把历史数据移动到新空间下,这就导致大量的memmove操作,我们就从这里下手优化下。
我们这里将slice的初始分配为100000
slice := make([]int, 100000)
运行后数据分析如下
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.72ms 4.97ms 50.64ms 78.91%
Req/Sec 161.15 33.73 280.00 70.83%
96590 requests in 30.10s, 6.91MB read
Requests/sec: 3209.47
Transfer/sec: 235.07KB
profile数据如下
Showing top 10 nodes out of 99
flat flat% sum% cum cum%
51.81s 43.81% 43.81% 51.81s 43.81% runtime.memmove
13.29s 11.24% 55.05% 13.29s 11.24% runtime.memclrNoHeapPointers
13.04s 11.03% 66.07% 13.19s 11.15% syscall.syscall
5.60s 4.74% 70.81% 5.60s 4.74% runtime.pthread_kill
4.98s 4.21% 75.02% 4.99s 4.22% runtime.asyncPreempt
4.68s 3.96% 78.98% 4.68s 3.96% runtime.pthread_cond_wait
4.59s 3.88% 82.86% 4.59s 3.88% runtime.usleep
3.76s 3.18% 86.04% 77.21s 65.29% main.counter
3.26s 2.76% 88.80% 3.27s 2.77% runtime.kevent
2.61s 2.21% 91.00% 2.61s 2.21% runtime.pthread_cond_signal
整体吞吐数据下降了,从分析数据看主要是memmove大幅增加了, 原因可能与GC和系统内存分配有关,下面调小初次分配slice的大小为1000,吞吐量马上提升
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.14ms 3.46ms 32.02ms 64.85%
Req/Sec 205.56 33.36 373.00 63.98%
123046 requests in 30.07s, 8.80MB read
Requests/sec: 4092.56
Transfer/sec: 299.75KB
验证下GC关闭的情况,手动关闭GC
// 关闭GC
debug.SetGCPercent(-1)
结果情况更糟
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.89ms 3.77ms 62.83ms 80.45%
Req/Sec 223.38 42.17 370.00 68.62%
133731 requests in 30.07s, 9.57MB read
Requests/sec: 4447.47
Transfer/sec: 325.74KB
经过多次测试,理想的分配值在1000左右
设置大小1000
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.15ms 3.48ms 37.80ms 65.03%
Req/Sec 205.09 32.57 340.00 65.03%
122758 requests in 30.07s, 8.78MB read
Requests/sec: 4083.08
Transfer/sec: 299.05KB
代码地址:https://github.com/zyong/go.programming/tree/main/pprof
参考:
- https://mp.weixin.qq.com/s/szSAAyWqK3Bz2NxVxJfwXw
- https://github.com/google/pprof/blob/master/doc/README.md#interpreting-the-callgraph
- https://morsmachine.dk/http2-causalprof
- https://zhuanlan.zhihu.com/p/141640004
以上是关于golang pprof工具的主要内容,如果未能解决你的问题,请参考以下文章