Tools/Profiler

Posted 2022-12-12 WhateverYoung

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Tools/Profiler相关的知识，希望对你有一定的参考价值。

NVProfiler

Visual Profiler
nvprof

1.1. Focused Profiling
不需要对程序做任何修改就可以进行profiling，说明依赖的是GPU上的硬件计数器等等，和程序无关。但是可以通过一些开始和结束标识来标记profiling开始和结束的位置，来达到更好的效果，几种典型的场景适合这种固定区域的profiling：

代码分为初始化，拷贝数据，算法kernel运行，拷贝数据，数据校验和后处理，感兴趣的位置是kernel，此时可以采用；
程序是分阶段的，每个阶段互相之间无依赖，每个阶段有不同的算法kernel，此时可以对每个阶段单独分析
程序的迭代次数很多，每次迭代之前性能变化不明显，此时可以对一小部分迭代做分析
API接口：
cudaProfilerStart()/cudaProfilerStop() cuda_profiler_api.h
cuProfilerStart()/cuProfilerStop() cudaProfiler.h
nvprof –profile-from-start off 关闭从程序开始的profile

event nvprof –query-events，事件是一种硬件计数器，在kernel运行期间不断累计；
metric nvprof –query-metrics 度量是根据一种或多种计数器计算得到的该kernel特有的运行特征

1.2 Marking Regions of CPU Activity
Visual Profiler可以看到所有cpu线程如何调用cuda kernel，为了看到CPU线程在执行GPU函数之外的执行轨迹，需要使用NVIDIA Tools Extension API (NVTX)来修改应用程序，nvprof同样支持。
1.3. Naming CPU and CUDA Resources
You can use the NVIDIA Tools Extension API to assign custom names for your CPU and GPU resources. Your custom names will then be displayed in the Timeline View.
1.4. Flush Profile Data
性能数据默认收集到缓存中，以低优先级落盘，为防止性能数据没及时下盘。可以在所有线程退出之前，调用cuProfilerStop() 强制刷盘。

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-overview
https://docs.nvidia.com/cuda/cupti/r_main.html#r_main

Visual Profiler

图形界面，可以看到程序运行的性能测量结果。很强大，很多功能，需要具体下载下来使用一次才能体会。具体TODO：

nvprof

命令行，可以分析程序运行的性能测量结果。默认输出到std::err，可以用–log-file看来做重定向。
有非常多的options，cuda/cpu/print/IO 等等options，还有一些执行模式和控制模式可以指定。具体TODO，需要每个指令尝试一下，或者才有需要的时候可以查询解决问题。

Remote Profiling

You can profile your remote application directly from nsight or the Visual Profiler.
Or you can use nvprof to collect the profile data on the remote system and then use nvvp on the host system to view and analyze the data.
TODO尝试运行一次

NVIDIA Tools Extension

提供API接口，完成两个功能

Tracing of CPU events and time ranges.
Naming of OS and CUDA resources.

TODO

MPI Profiling With nvprof

MPI程序也可以使用nvprof来进行性能分析。

mpirun -np 2 nvprof –annotate-mpi openmpi ./my_mpi_app

TODO

MPS Profiling

You can collect profiling data for a CUDA application using Multi-Process Service(MPS) with nvprof and then view the timeline by importing the data in the Visual Profiler.
TODO

Dependency Analysis

没特别理解什么意思。大概是说程序中不同的片段彼此之间的依赖关系，可以通过这个工具进行分析。
TODO

Metrics Reference

根据硬件事件计数器计算得到的一些性能衡量指标。可根据实际情况进行查询。

Warp State

Instruction issued - An instruction or a pair of independent instructions was issued from a warp.
Stalled - Warp can be stalled for one of the following reasons.
- Stalled for instruction fetch - The next instruction was not yet available.指令缓存导致stall
- Stalled for execution dependency.依赖的寄存器还没准备好，前面的计算指令，FP64，barrier. try to increase instruction-level parallelism (ILP)
- Stalled for memory dependency - The next instruction is waiting for a previous memory accesses to complete.依赖的寄存器还没准备好，前面的访存指令LD。
- Stalled for memory throttle - A large number of outstanding memory requests prevents forward progress. 带宽限制，global和shared memory都有一定的带宽限制。
- Stalled for texture
- Stalled for sync - The warp is waiting for all threads to synchronize after a barrier instruction.
- Stalled for constant memory dependency.常量内存的访存行为
- Stalled for pipe busy - The warp is stalled because the functional unit required to execute the next instruction is busy.FP64导致busy
- Stalled for not selected - Warp was ready but did not get a chance to issue as some other warp was selected for issue.充分优化的程序
- Stalled for other - Warp is blocked for an uncommon reason like compiler or hardware reasons. barrier > 18，stall pipeline

以上是关于Tools/Profiler的主要内容，如果未能解决你的问题，请参考以下文章