Install some dependencies
sudo apt install libnuma-dev libboost-all-dev
Install CUDA and CUDNN.
...
Modify env.sh to point to the right libraries.
Build the profiling library (prof.so).
make
Make sure your CUDA application is not statically-linked, which is the default when you are building your own CUDA code.
This will record data by appending to an output.cprof file, so usually remove that file first. ./env.sh sets up the LD_PRELOAD environment and invokes your app.
rm -f output.cprof
./env.sh <your app>
Do something with the result:
cprof2<something>.py
Other info
env.sh sets LD_PRELOAD to load the profiling library and its dependences.