pprof

pprof是性能分析的工具。

pprof可以用来做什么？

CPU Profiling：CPU 分析，按照一定的频率采集所监听的应用程序 CPU（含寄存器）的使用情况，可确定应用程序在主动消耗 CPU 周期时花费时间的位置。
Memory Profiling：内存分析，在应用程序进行堆分配时记录堆栈跟踪，用于监视当前和历史内存使用情况，以及检查内存泄漏。
Block Profiling：阻塞分析，记录 Goroutine 阻塞等待同步（包括定时器通道）的位置，默认不开启，需要调用 runtime.SetBlockProfileRate 进行设置。
Mutex Profiling：互斥锁分析，报告互斥锁的竞争情况，默认不开启，需要调用 runtime.SetMutexProfileFraction 进行设置。
Goroutine Profiling： Goroutine 分析，可以对当前应用程序正在运行的 Goroutine 进行堆栈跟踪和分析。这项功能在实际排查中会经常用到，因为很多问题出现时的表象就是 Goroutine 暴增，而这时候要做的事情之一就是查看应用程序中的 Goroutine 正在做什么事情，因为什么阻塞了，然后再进行下一步。

信息采集得到的概要文件(Profile)

用于分析程序性能的概要文件有三种，分别是：CPU 概要文件（CPU Profile）、内存概要文件（Mem Profile）和阻塞概要文件（Block Profile）。

这些概要文件中包含的都是：在某一段时间内，对 Go 程序的相关指标进行多次采样后得到的概要信息。

对于 CPU 概要文件来说，其中的每一段独立的概要信息都记录着，在进行某一次采样的那个时刻，CPU 上正在执行的 Go 代码。

而对于内存概要文件，其中的每一段概要信息都记载着，在某个采样时刻，正在执行的 Go 代码以及堆内存的使用情况，这里包含已分配和已释放的字节数量和对象数量。

至于阻塞概要文件，其中的每一段概要信息，都代表着 Go 程序中的一个 goroutine 阻塞事件。

信息采集之工具型应用

import "runtime/pprof"  // 工具型应用用这个包

CPU执行信息采集

开启CPU信息采集：

pprof.StartCPUProfile(w io.Writer)

停止CPU信息采集：

pprof.StopCPUProfile()

应用执行结束后，就会生成一个文件，保存了 CPU profiling 数据。得到采样数据之后，使用go tool pprof工具进行程序的CPU性能问题检查。

内存信息采集

记录程序的堆栈信息

pprof.WriteHeapProfile(w io.Writer)

应用执行结束后，就会生成另一个文件，下一步使用go tool pprof工具进行程序的内存问题排查。

go tool pprof默认是使用-inuse_space进行统计，还可以使用-inuse-objects查看分配对象的数量。

使用示例

package main

import (
	"flag"
	"fmt"
	"os"
	"runtime/pprof"
	"time"
)

// 一段有问题的代码
func logicCode() {
	var c chan int
	for {
		select {
		case v := <-c:
			fmt.Printf("recv from chan, value:%v\n", v)
		default:

		}
	}
}

func main() {
	var isCPUPprof bool
	var isMemPprof bool

	flag.BoolVar(&isCPUPprof, "cpu", false, "turn cpu pprof on")
	flag.BoolVar(&isMemPprof, "mem", false, "turn mem pprof on")
	flag.Parse()

	if isCPUPprof {
		file, err := os.Create("./cpu.pprof")
		if err != nil {
			fmt.Printf("create cpu pprof failed, err:%v\n", err)
			return
		}
		pprof.StartCPUProfile(file)
		defer pprof.StopCPUProfile()
	}
	for i := 0; i < 8; i++ {
		go logicCode()
	}
	time.Sleep(30 * time.Second)
	if isMemPprof {
		file, err := os.Create("./mem.pprof")
		if err != nil {
			fmt.Printf("create mem pprof failed, err:%v\n", err)
			return
		}
		pprof.WriteHeapProfile(file)
		file.Close()
	}
}

通过flag可以通过命令行控制是否开启CPU和Mem的性能分析，执行时加上-cpu命令行参数如下：

go run main.go -cpu

等待30秒后会在当前目录下生成一个cpu.pprof文件。

信息采集之服务型应用

如果应用程序是一直运行的，比如 web 应用，那么可以使用net/http/pprof库，它能够在提供 HTTP 服务的同时进行数据采集。

如果使用了默认的http.DefaultServeMux（通常是代码直接使用http.ListenAndServe(“0.0.0.0:8000”, nil)），只需要在web server端代码中按如下方式导入net/http/pprof

import _ "net/http/pprof"  // 服务型应用用这个包

如果使用自定义的 Mux，则需要手动注册一些路由规则：

r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)  // CPU
r.HandleFunc("/debug/pprof/heap", pprof.Index)  // 内存
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)

如果使用的是gin框架，那么推荐使用github.com/gin-contrib/pprof，在代码中通过以下命令注册pprof相关路由。

pprof.Register(router)

不管哪种方式，都可以通过访问上述路由查看相关内容：

/debug/pprof/profile：访问这个链接会自动进行 CPU profiling，持续 30s，并生成一个文件供下载
/debug/pprof/heap： Memory Profiling 的路径，访问这个链接会得到一个内存 Profiling 结果的文件
/debug/pprof/block：block Profiling 的路径
/debug/pprof/goroutines：运行的 goroutines 列表，以及调用关系

对采集的数据进行分析

不管是工具型应用还是服务型应用，使用相应的pprof库获取到数据文件之后，下一步都要对这些数据进行分析检查。

go tool pprof [binary] [source]

其中：

binary 是应用的二进制文件，用来解析各种符号；
source 表示 profile 数据的来源，可以是本地的文件，也可以是 http 地址。

注意事项： 获取的 Profiling 数据是动态的，要想获得有效的数据，请保证应用处于较大的负载（比如正在生成中运行的服务，或者通过其他工具模拟访问压力）。否则如果应用处于空闲状态，得到的结果可能没有任何意义。

CPU Profiling 分析使用示例

以上述代码为案例，生成了cpu.pprof文件，然后在命令行执行：

go tool pprof cpu.pprof

执行上面的代码会进入交互界面如下：

runtime_pprof $ go tool pprof cpu.pprof
Type: cpu
Time: Jun 28, 2019 at 11:28am (CST)
Duration: 20.13s, Total samples = 1.91mins (568.60%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

接下来可以在交互界面输入top3来查看程序中占用CPU前3位的函数：

(pprof) top3
Showing nodes accounting for 100.37s, 87.68% of 114.47s total
Dropped 17 nodes (cum <= 0.57s)
Showing top 3 nodes out of 4
      flat  flat%   sum%        cum   cum%
    42.52s 37.15% 37.15%     91.73s 80.13%  runtime.selectnbrecv
    35.21s 30.76% 67.90%     39.49s 34.50%  runtime.chanrecv
    22.64s 19.78% 87.68%    114.37s 99.91%  main.logicCode

其中：

flat：仅当前函数的CPU耗时(不含它内部调用函数的执行时间)
flat%：当前函数CPU耗时百分比
sum%：仅当前函数的CPU耗时累计百分比(不含它内部调用函数的执行时间)
cum：当前函数加上该函数内部调用函数的CPU总耗时
cum%：当前函数加上该函数内部调用函数的CPU总耗时百分比
最后一列：函数名称

在列表中找出自己编写的函数，可以看出，main.logicCode对CPU的消耗过高，接下来使用list命令对该函数的实现代码CPU耗时进行下一步检查：

(pprof) list logicCode
Total: 1.82mins
ROUTINE ======================== main.logicCode in /Users/zhhnzw/workspace/mygithub/chat/t.go
    23.19s   1.82mins (flat, cum)   100% of Total
         .          .     11:// 一段有问题的代码
         .          .     12:func logicCode() {
         .          .     13:   var c chan int
         .          .     14:   for {
         .          .     15:           select {
    23.19s   1.82mins     16:           case v := <-c:
         .          .     17:                   fmt.Printf("recv from chan, value:%v\n", v)
         .          .     18:           default:
         .          .     19:
         .          .     20:           }
         .          .     21:   }

通过检查发现大部分CPU资源被第16行占用，select语句中的default没有内容会导致select代码块不会发生阻塞，从而case v:=<-c:被for无限制的循环执行，浪费了CPU资源。解决办法：可以在default分支里添加一行time.Sleep(time.Second)。

Goroutine、Mutex、Block 分析使用示例

package main

import (
	"net/http"
	_ "net/http/pprof"
	"runtime"
	"sync"
)

func init() {
	runtime.SetMutexProfileFraction(1)  // 默认不采集互斥锁，需要手动设置
	runtime.SetBlockProfileRate(1)  // 默认不采集阻塞信息，需要手动设置
}

func WriteMap() {
	var m sync.Mutex
	var datas = make(map[int]struct{})
	for i := 0; i < 999; i++ {
		go func(i int) {
			m.Lock()
			defer m.Unlock()
			datas[i] = struct{}{}
		}(i)
	}
}

func main() {
	go func() {
		for {
			WriteMap()
		}
	}()
	_ = http.ListenAndServe(":6060", nil)
}

Goroutine Profiling

$ go tool pprof http://localhost:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6060/debug/pprof/goroutine
Saved profile in /Users/zhhnzw/pprof/pprof.goroutine.006.pb.gz
Type: goroutine
Time: Apr 14, 2022 at 3:27pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

使用traces命令，会打印出对应的所有调用栈，以及指标信息，可以查看到整个调用链路有什么，分别在哪里使用了多少个 goroutine，并且能够通过分析查看到谁才是真正的调用方：

(pprof) traces
Type: goroutine
Time: Apr 14, 2022 at 3:27pm (CST)
-----------+-------------------------------------------------------
      1291   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_SemacquireMutex
             sync.(*Mutex).lockSlow
             sync.(*Mutex).Lock (inline)
             main.WriteMap.func1
-----------+-------------------------------------------------------
         1   runtime.newobject
             main.WriteMap
             main.main.func1
-----------+-------------------------------------------------------
         1   runtime.blockevent
             runtime.semacquire1
             sync.runtime_SemacquireMutex
             sync.(*Mutex).lockSlow
             sync.(*Mutex).Lock (inline)
             main.WriteMap.func1
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.netpollblock
             internal/poll.runtime_pollWait
             internal/poll.(*pollDesc).wait
             internal/poll.(*pollDesc).waitRead (inline)
             internal/poll.(*FD).Read
             net.(*netFD).Read
             net.(*conn).Read
             net/http.(*connReader).backgroundRead
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.netpollblock
             internal/poll.runtime_pollWait
             internal/poll.(*pollDesc).wait
             internal/poll.(*pollDesc).waitRead (inline)
             internal/poll.(*FD).Accept
             net.(*netFD).accept
             net.(*TCPListener).accept
             net.(*TCPListener).Accept
             net/http.(*Server).Serve
             net/http.(*Server).ListenAndServe
             net/http.ListenAndServe (inline)
             main.main
             runtime.main
-----------+-------------------------------------------------------
         1   runtime.goready
             runtime.readyWithTime
             runtime.semrelease1
             sync.runtime_Semrelease
             sync.(*Mutex).unlockSlow
             sync.(*Mutex).Unlock
             main.WriteMap.func1
-----------+-------------------------------------------------------
         1   runtime.goyield (inline)
             runtime.semrelease1
             sync.runtime_Semrelease
             sync.(*Mutex).unlockSlow
             sync.(*Mutex).Unlock
             main.WriteMap.func1
-----------+-------------------------------------------------------
         1   runtime/pprof.runtime_goroutineProfileWithLabels
             runtime/pprof.writeRuntimeProfile
             runtime/pprof.writeGoroutine
             runtime/pprof.(*Profile).WriteTo
             net/http/pprof.handler.ServeHTTP
             net/http/pprof.Index
             net/http.HandlerFunc.ServeHTTP
             net/http.(*ServeMux).ServeHTTP
             net/http.serverHandler.ServeHTTP
             net/http.(*conn).serve
-----------+-------------------------------------------------------

在调用栈上来讲，其展示顺序是自下而上的，以第5栏为例，runtime.main方法调用了main.main方法，main.main方法又调用了 net/http.ListenAndServe方法，这里对应的也就是所使用的示例代码了，结合代码来排查会非常方便，也可以用web指令来可视化traces调用链，排查起来就更加方便了，可视化见下一个小节。

main.WriteMap.func1对应有1291个goroutine，不断增长的goroutine数量就是协程泄露/内存泄露问题。

Mutex Profiling

$go tool pprof http://localhost:6060/debug/pprof/mutex    
Fetching profile over HTTP from http://localhost:6060/debug/pprof/mutex
Saved profile in /Users/zhhnzw/pprof/pprof.contentions.delay.015.pb.gz
Type: delay
Time: Apr 14, 2022 at 3:27pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 85.65s, 100% of 85.65s total
      flat  flat%   sum%        cum   cum%
    85.65s   100%   100%     85.65s   100%  sync.(*Mutex).Unlock
         0     0%   100%     85.65s   100%  main.WriteMap.func1
(pprof) list main.WriteMap.func1
Total: 85.65s
ROUTINE ======================== main.WriteMap.func1 in /Users/zhhnzw/workspace/mygithub/go_demos/main.go
         0     85.65s (flat, cum)   100% of Total
         .          .     18:   for i := 0; i < 999; i++ {
         .          .     19:           go func(i int) {
         .          .     20:                   m.Lock()
         .          .     21:                   defer m.Unlock()
         .          .     22:                   datas[i] = struct{}{}
         .     85.65s     23:           }(i)
         .          .     24:   }
         .          .     25:}
         .          .     26:
         .          .     27:func main() {
         .          .     28:   go func() {

先用top命令，查看互斥量的排名，然后执行list+函数名命令可以查看该函数具体是哪一行代码导致了锁开销，如上是23行。

Block Profiling

$go tool pprof http://localhost:6060/debug/pprof/block    
Fetching profile over HTTP from http://localhost:6060/debug/pprof/block
Saved profile in /Users/zhhnzw/pprof/pprof.contentions.delay.016.pb.gz
Type: delay
Time: Apr 14, 2022 at 3:27pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 5.20hrs, 100% of 5.20hrs total
Dropped 4 nodes (cum <= 0.03hrs)
      flat  flat%   sum%        cum   cum%
   5.20hrs   100%   100%    5.20hrs   100%  sync.(*Mutex).Lock (inline)
         0     0%   100%    5.20hrs   100%  main.WriteMap.func1

如上可知，是main.WriteMap.func1内部的sync.(*Mutex).Lock产生的阻塞。

用可视化工具分析

Mac下安装：

brew install graphviz

web指令

在pprof命令行：

(pprof) web

pprof 可视化 web

关于图形的说明：每个框代表一个函数，理论上框越大表示占用的CPU资源越多。方框之间的线条代表函数之间的调用关系。线条上的数字表示函数调用耗费的时间。方框中的第一行数字表示仅当前函数消耗的CPU耗时(sum，不是sum%)，第二行数字表示当前函数加上该函数调用函数的占用CPU总耗时cum。

除了分析CPU性能数据，pprof也支持分析内存性能数据。比如，使用下面的命令分析http服务的heap性能数据，查看当前程序的内存占用以及热点内存对象使用的情况。

# 查看内存占用数据
go tool pprof -inuse_space http://127.0.0.1:8080/debug/pprof/heap
go tool pprof -inuse_objects http://127.0.0.1:8080/debug/pprof/heap
# 查看临时内存分配数据
go tool pprof -alloc_space http://127.0.0.1:8080/debug/pprof/heap
go tool pprof -alloc_objects http://127.0.0.1:8080/debug/pprof/heap

单独的站点

# 这条命令会先获取 CPU Profiling 采集文件，然后运行一个 pprof 分析用的web站点 
$ go tool pprof -http=:6001 http://localhost:6060/debug/pprof/profile

Graph

VIEW默认展示的是Graph分析界面：

Flame Graph(火焰图)

火焰图的调用顺序从上到下，每个方块代表一个函数，它下面一层表示这个函数会调用哪些函数，方块的大小代表了占用 CPU 使用的长短。火焰图的配色并没有特殊的意义，默认的红、黄配色是为了更像火焰而已。可以通过点击方块来 zoom in 检查它内部的内容深入分析，以下点击了WriteMap.func1区块，进入到了下一层级：

Top

跟交互式终端top指令给出的信息含义一样。

Peek

相较于 Top 视图，增加了所属的上下文信息的展示，也就是函数的输出调用者/被调用者。

Source

面向源代码的追踪和分析，可以看到主要开销在哪里。

pprof与benchmark基准测试结合使用

-cpuprofile：cpu profiling 保存的文件路径
-memprofile：memory profiling 保存的文件路径
-blockprofile：block profiling 保存的文件路径

使用示例：

go test -bench . -benchmem -cpuprofile=cpu.prof -memprofile=mem.prof -blockprofile=block.prof

生成采集信息文件后，就跟上文中的分析方法一致了。

pprof与压力测试结合使用

压测工具wrk

https://github.com/wg/wrk 或 https://github.com/adjust/go-wrk

使用步骤

访问：http://127.0.0.1:8080/debug/pprof/
启动压力测试：go-wrk -n 50000 http://127.0.0.1:8080/version
点击 web 页面的 profile 按钮，收集信息，30 s 后会自动下载 profile 文件
拿到 profile 文件后：go tool pprof -http=":8081" profile

优化案例

以优化基准测试的lengthOfNonRepeatingSubStr函数为例

执行go test -bench=Substr -v -cpuprofile=cpu.prof -benchmem在性能测试的同时取得 CPU profiling 数据文件。
执行go tool pprof -http=":8081" cpu.prof，访问http://localhost:8081

通过上图可以看出，大部分时间都花在了map相关操作

知道了性能瓶颈所在之后，优化源代码，用[]int代替map

func lengthOfNonRepeatingSubStr1(s string) int {
	lastOccurred := make([]int, 0xffff)
	start := 0
	maxLength := 0

	for i, ch := range []rune(s) {
		if lastI := lastOccurred[ch]; lastI > start {
			start = lastI
		}
		if i-start+1 > maxLength {
			maxLength = i - start + 1
		}
		lastOccurred[ch] = i + 1
	}

	return maxLength
}

再执行测试命令：

zhhnzw$ go test -bench=Substr -v -cpuprofile=cpu.prof -benchmem
=== RUN   TestSubstr
--- PASS: TestSubstr (0.00s)
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
BenchmarkSubstr
    nonrepeating_test.go:41: len(s) = 491520
    nonrepeating_test.go:41: len(s) = 491520
    nonrepeating_test.go:41: len(s) = 491520
BenchmarkSubstr-8            601           1968375 ns/op         1179663 B/op          2 allocs/op
PASS

可以看到，执行效率从5152279 ns/op进步到了1968375 ns/op，内存消耗从655590 B/op退步到了1179663 B/op，再执行go tool pprof -http=":8081" cpu.prof，访问http://localhost:8081

优化后

可以看到，现在CPU性能消耗只剩下了stringtoslicerune，涉及到的就是rune(s)操作，这里是需要一个utf-8的解码操作耗费了CPU计算时间，因为必须要支持中文，这个解码操作是省不了的，所以CPU的优化到此就完成了。

接下来再来看内存，观察到函数每次被调用每次都会创建一个slice，这一点可以优化，把lastOccurred变量提取到全局变量，在函数内每次都清掉它的值，这样这个slice变量就只用初始化一次了

var lastOccurred = make([]int, 0xffff)
func lengthOfNonRepeatingSubStr2(s string) int {

	for i,l:=0,len(lastOccurred);i<l;i++ {
		lastOccurred[i]=0
	}
	start := 0
	maxLength := 0

	chs := []rune(s)
  // 这也是个优化细节，用 i<len(chs) 会每次调用 len 函数，下面这样写，len 函数只会调用一次
	for i,l:=0,len(chs);i<l;i++ {  
		if lastI := lastOccurred[chs[i]]; lastI > start {
			start = lastI
		}
		if i-start+1 > maxLength {
			maxLength = i - start + 1
		}
		lastOccurred[chs[i]] = i + 1
	}
  
	return maxLength
}

再执行测试命令：

zhhnzw$ go test -bench=Substr -v -cpuprofile=cpu.prof -benchmem
=== RUN   TestSubstr
--- PASS: TestSubstr (0.00s)
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
BenchmarkSubstr
    nonrepeating_test.go:41: len(s) = 491520
    nonrepeating_test.go:41: len(s) = 491520
    nonrepeating_test.go:41: len(s) = 491520
BenchmarkSubstr-8            620           1927372 ns/op          655362 B/op          1 allocs/op
PASS

可以看到，内存消耗进步到了655362 B/op，内存分配次数从2次优化到了1次，至此CPU和Memory都优化完毕了。

Previous测试 Nextpprof进阶使用

Last updated 2 years ago

Was this helpful?