Short Introduction to
Debugging/Profiling
Tools
George Markomanolis
20 May 2015
Outline
! Profiling – Cray tools
! Perftools-lite
! Apprentice 2
! Reveal
! Debugging
! Lgdb
Performance Analysis
! Why performance analysis?
! Investigate the bottlenecks of an application
! Identify potential improvements
! Better usage of the hardware
! Profiling
! Sampling
! Lightweight
! Overhead depends on the sampling frequency
! Can lack resolution if there are small function calls
! Event Tracing
! Detailed information
! Captures every event
! Can capture communication events
! Drawbacks, overhead and large amounts of data
CrayPat overview
! Assist the user with application performance
analysis and optimization
! Provides concrete suggestions instead of just reporting
! Basic functionalities apply for all the compilers on
the system
! Requires no source code or Makefile modification
(for most of the cases)
3 steps of CrayPAT
! Instrumentation
! Use pat_build to apply instrumentation to program binaries
! Data collection
! Via execution
! Analysis: Sampling/Tracing
! Use tools pat_report, Cray Apprentice2, Reveal
! Automatic Performance Analysis (APA) combines the two
approaches
! Loop profiling is a special flavor of event tracing
CrayPat – lite I
! Provide automatic application performance statistics at
the end of a job
! Usage for NPB/LU
! vim config/make.def
! MPIF77 = ftn
! module load perftools-lite
! make clean
! Compile LU benchmark, class C for 64 MPI processes
! make LU NPROCS=64 CLASS=C
! sbatch execute_lu.sh
! Two files with extension rpt and ap2 are created
CrayPat – lite II
Table 1: Profile by Function Group and Function (top 10 functions shown)
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
100.0% | 1715.7 | -- | -- |Total
|--------------------------------------------
| 85.8% | 1472.3 | -- | -- |USER
|| 39.0% | 668.7 | 56.3 | 7.9% |rhs_
|| 10.0% | 171.7 | 25.3 | 13.0% |buts_
|| 9.6% | 165.0 | 22.0 | 12.0% |jacld_
|| 9.6% | 163.9 | 21.1 | 11.6% |blts_
|| 9.4% | 161.9 | 23.1 | 12.7% |jacu_
|| 3.7% | 63.7 | 27.3 | 30.5% |ssor_
|| 3.2% | 54.5 | 31.5 | 37.2% |exchange_3_
||===========================================
| 14.2% | 243.1 | -- | -- |MPI
||-------------------------------------------
|| 6.9% | 119.1 | 118.9 | 50.8% |MPI_RECV
|| 4.1% | 69.8 | 64.2 | 48.7% |mpi_bcast
|| 1.5% | 26.5 | 102.5 | 80.7% |mpi_wait
|============================================
CrayPat – lite III
Table 2: Profile by Group, Function, and Line
File: rhs.f, lines 39 - 47
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | Source
| | | | Line do k = 1, nz
| | | | PE=HIDE
do j = 1, ny
100.0% | 1715.7 | -- | -- |Total do i = 1, nx
|--------------------------------------------------------------------
| 85.8% | 1472.3 | -- | -- |USER do m = 1, 5
||-------------------------------------------------------------------
|| 39.0% | 668.7 | -- | -- |rhs_ rsd(m,i,j,k) = - frct(m,i,j,k)
3| | | | | NPB3.3.1/NPB3.3-MPI/LU/rhs.f
||||----------------------------------------------------------------- end do
4||| 2.7% | 45.5 | 19.5 | 30.4% |line.43
4||| 2.2% | 37.0 | 16.0 | 30.6% |line.96 end do
4||| 1.7% | 28.5 | 15.5 | 35.7% |line.228
4||| 1.8% | 31.6 | 11.4 | 26.9% |line.246 end do
4||| 3.8% | 65.0 | 17.0 | 21.1% |line.336 end do
CrayPAT – lite with sample profiling,
big case
! Big case, regarding NAS Parallel Benchmarks
LU, class E, 2048 MPI processes
Overhead ! 0.58%!
! Better MPI mapping topology detected
MPI Grid Detection:
There appears to be point-to-point MPI communication in a 32 X 64
grid pattern. The 14.7% of the total execution time spent in MPI
functions might be reduced with a rank order that maximizes
communication between ranks on the same node. The effect of several
rank orders is estimated below.
A file named MPICH_RANK_ORDER.Grid was generated along with this
CrayPAT – lite with sample profiling,
big case
Rank Order On-Node On-Node MPICH_RANK_REORDER_METHOD
Bytes/PE Bytes/PE%
of Total
Bytes/PE
Custom 5.981e+12 84.20% 3
SMP 4.614e+12 64.96% 1
RoundRobin 2.342e+12 32.98% 0
Fold 7.209e+10 1.01% 2
! A file entitled MPICH_RANK_ORDER.Grid has been
created
! Execution improved by 2.1%
Apprentice2 - I
Apprentice2 - II
Apprentice2 - III
Reveal tool
! Compile your code with Cray compiler for using the
results with Reveal tool
! MPIF77 = ftn -h profile_generate -h pl=npb_lu.pl -h noomp
-h noacc
! module load perftools
! make LU NPROCS=64 CLASS=C
! pat_build –w lu.C.64
! New file is called lu.C.64+pat
! Execute lu.C.64+pat executable
! pat_report –o lu_c_64.txt lu.C.64+XXX.xf
! New file called lu_c_64.ap2 is created
! reveal /path/npb_lu.pl /path/lu_c_64.ap2
Reveal tool I
KAUST King Abdullah University of Science and Technology
15
Reveal tool II
KAUST King Abdullah University of Science and Technology
16
Reveal tool III
KAUST King Abdullah University of Science and Technology
17
Reveal tool IV
Debugging – LGDB
! LGDB is a line mode parallel debugger for Cray systems
! Usage: module load cray-lgdb
! Binaries should be compiled with -g or -Gfast
! Many features from GDB but includes extensions for handling parallel
processes
ftn -g -o exec exec.f
salloc
module load cray-lgdb
lgdb
launch $pset{8} ./exec
break exec.f:3
continue
print $pset::myRank
pset[0]: 0
…
pset[7]: 7
! Other tools are available such as Totalview, DDT
Conclusions
! There are many tools that could help you understand the
insights of your application
! Perftools-lite is straight forward for a new user
! Potential to port code from a serial or MPI version to
OpenMPI and hybrid respectively
! Get advantage of the tools
Thank you!
[email protected]