PyProf profiles and analyzes the GPU performance of PyTorch models. It aggregates the following information from Nsight Systems or NvProf for every GPU kernel.
- Kernel name e.g. turing_fp16_s884gemm_fp16_64x128_ldg8_f2f_tn.
- Kernel duration.
- Device id, stream id.
- Grid dimensions, block dimensions.
- Thread id.
In addition it provides the following information for almost every GPU kernel.
- PyTorch module and op name e.g. torch.nn.functional,linear.
- Tensor shape and data type of all input arguments e.g. [32,3,224,224]fp16;[64,3,7,7]fp16.
- Total data movement (bytes) and floating point operations (flops).
- Tensor Core usage.
- Call stack e.g. ncf.py:352, ncf.py:277, apex/amp/_initialize.py:197, /home/ubuntu/dlperf/NCF/neumf.py:107.
- Direction i.e. forward or backward.
- Forward-backward correlation. The tool correlates the GPU kernels invoked during back propagation to the corresponding kernels during forward propagation.
With additional user annotation (advanced mode):
- Associate layer names e.g. BERT:Encoder_2:FFN:LayerNormto a GPU kernel.
# clone
$ git clone https://github.com/adityaiitb/PyProf.git
# install
$ pip3 install . --user
# verify
$ pip3 list | grep pyprofThere are four steps to the tool flow.
- Import library and annotate code.
Import and initialize the tool.  Run the training
/ inference loop within the PyTorch NVTX context
manager
as shown below. In addition, you can use profiler.start() and
profiler.stop() to pick an iteration(s) for which you would like to
capture data.
import torch.cuda.profiler as profiler
# Import and initialize PyProf
import pyprof
pyprof.init()
iters = 500
iter_to_capture = 100
# Define network, loss function, optimizer etc.
# PyTorch NVTX context manager
with torch.autograd.profiler.emit_nvtx():
    for iter in range(iters):
        if iter == iter_to_capture:
            profiler.start()
        output = net(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        if iter == iter_to_capture:
            profiler.stop()- Profile using Nsight Systems or NVProf to obtain a SQLite3 database.
NVProf is currently being phased out, and it is recommended to use Nsight Systems.
Generate a SQLite database as follows.
$ nsys profile 
    -f true                  # Overwrite existing files
    -o net                   # Create net.qdrep (used by Nsys viewer)
    -c cudaProfilerApi       # Optional argument required for profiler start/stop
    --stop-on-range-end true # Optional argument required for profiler start/stop
    --export sqlite          # Export net.sql (similar to NVProf) 
    python net.pyIf using profiler.start() and profiler.stop() in net.py, the options
-c cudaProfilerApi --stop-on-range-end true are required.
If you experience slow profiling,
nsyscontains an option-s nonewhich disables CPU sampling and significantly speeds up profiling.
Generate a SQL (NVVP) file. This file can also be opened with Nvidia Visual Profiler (NVVP).
If you used profiler.start() and profiler.stop(), then do
$ nvprof 
    -f 
    -o net.sql 
    --profile-from-start off  # Profiling start/stop inside net.py
    python net.pyIf you did not use profiler.start() and profiler.stop(), then do
$ nvprof
    -f            # Overwrite existing file
    -o net.sql    # Create net.sql
    python net.pyIf you get a message such as ERR_NVGPUCTRPERM The user running <tool_name/application_name> does not have permission to access NVIDIA GPU Performance Counters on the target device, follow the
steps here.
- Parse the SQL file.
Run parser on the SQL file. The output is an ASCII file. Each line is a python dictionary which contains information about the kernel name, duration, parameters etc. This file can be used as input to other custom scripts as well. Nsys will create a file called net.sqlite.
$ python -m pyprof.parse net.sqlite > net.dict- Run the prof script.
Using the python dictionary created in step 3 as the input, PyProf can
produce a CSV output, a columnated output (similar to column -t for
terminal readability) and a space separated output (for post processing
by AWK for instance). It produces 20 columns of information for every
GPU kernel and you can select a subset of columns using the -c flag.
Note that a few columns might have the value na implying either its a
work in progress or the tool was unable to extract that information.
| Column | Description | 
|---|---|
| idx | Index | 
| seq | PyTorch Sequence Id | 
| altseq | PyTorch Alternate Sequence Id | 
| tid | Thread Id | 
| layer | User annotated NVTX string (can be nested) | 
| trace | Function Call Trace | 
| dir | Direction | 
| sub | Sub Sequence Id | 
| mod | PyTorch Module | 
| op | Operation | 
| kernel | Kernel Name | 
| params | Parameters | 
| sil | Silicon Time (in ns) | 
| tc | Tensor Core Usage | 
| device | GPU Device Id | 
| stream | Stream Id | 
| grid | Grid Dimensions | 
| block | Block Dimensions | 
| flops | Floating point ops (FMA = 2 FLOPs) | 
| bytes | Number of bytes in and out of DRAM | 
Here are a few examples of how to use prof.
# Print usage and help. Lists all available output columns.
$ python -m pyprof.prof -h
# Columnated output of width 150 with default columns.
# The default options are "idx,dir,sub,mod,op,kernel,params,sil".
$ python -m pyprof.prof -w 150 net.dict
# CSV output.
$ python -m pyprof.prof --csv net.dict
# Space seperated output.
$ python -m pyprof.prof net.dict
# Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time.
$ python -m pyprof.prof -w 130 -c idx,dir,kernel,params,sil net.dict
# CSV output with columns index,direction,kernel name,parameters,silicon time.
$ python -m pyprof.prof --csv -c idx,dir,kernel,params,sil net.dict
# Space separated output with columns index,direction,kernel name,parameters,silicon time.
$ python -m pyprof.prof -c idx,dir,kernel,params,sil net.dict
# Input redirection.
$ python -m pyprof.prof < net.dictWith some additional annotations in the model code, you can get
even richer profiling information e.g. the name of the layer, say
BERT:Encoder_2:FFN:LayerNorm, associated with every GPU kernel. It
is also easy to enable profiling of modules and functions with custom
forward and backward methods. One can also extend the tool to
add bytes and flops calculations for such custom functions. See
here for instructions.
If you use PyProf and would like to cite us, we suggest the following.
@misc{nvidia-pyprof,
  author = {Agrawal, Aditya and Kolodziej, Marek},
  title = {"PyProf"},
  year = {2019},
  publisher = {"Nvidia Corporation"},
  howpublished = {\url{https://github.com/adityaiitb/PyProf}}
}
Contributions are more than welcome. To contribute make a pull request and follow the guidelines here.