ADVANCED PYTHON
Imad Kissami1
1 Mohammed VI Polytechnic University, Benguerir, Morocco
October 12, 2023
OUTLINE
• Why HPC?
• What’s Supercomputer?
• Data locality
• How to make Python Faster
OUTLINE
• The flood of Data
• Big data problem
• What’s HPC ?
• Typical HPC workloads
• Data Analytics Process
THE FLOOD OF DATA
In 2021
• Internet user ∼ 1.9 GB per day
• Self driving car ∼ 4 TB per day
• Connected airplane ∼ 5TB per day
• Smart factory ∼ 1 PB per day
• Cloud video providers ∼ 750 PB per day
THE FLOOD OF DATA
A self-driving car
• Radar ∼ 10 − 100 KB per second
• Sonar ∼ 10 − 100 KB per second
• GPS ∼ 50 KB per second
• Lidar ∼ 10 − 70 MB per second
• Cameras ∼ 20 − 40 MB per second
• 1 car ∼ 5 Exaflops per hour
BIG DATA PROBLEM
Too much data Not enough computer
power, storage or
infrastructure
WHAT'S HPC?
Leveraging distributed compute resources to solve complex
problems
• Terabytes −→ Petabytes −→ Zetabytes of data
• Results in minutes to hours instead of days or weeks
TYPICAL HPC WORKLOADS
* Source:
https://www.xilinx.com/applications/data-center/high-performance-computing.html
DATA ANALYTICS PROCESS
Inspecting, cleaning, transforming and modeling à
decision-making.
SUMMARY
• Larger datasets require distributed computing
• Several open source HPC frameworks available
OUTLINE
• A brief introduction on hardware
• Modern supercomputers
A BRIEF INTRODUCTION ON HARDWARE
Modern architecture (CPU)
A BRIEF INTRODUCTION ON HARDWARE
Moore's Law
• Number of
transistors: from
37.5 million(2000) to
fifty billion(2022)
• Cpu speed: from
1.3GHz to 3.4GHz
A BRIEF INTRODUCTION ON HARDWARE
CPU vs RAM speeds
A BRIEF INTRODUCTION ON HARDWARE
Common Processors
Processor Launched Nb. of Cores Freq. (Ghz)
Xeon Platinum 9282 (formerly Cascade Lake) 2019-Q2 28 2.6-3.8
Xeon Platinum 8376H (formerly Cooper Lake) 2019-Q2 28 2.6-4.3
i9-12900H (Mobile, 12th generation) 2022-Q1 4-16 3.8-5.0
i9-12900KS ( Desktop, formerly Alder Lake ) 2022-Q1 8-16 2.5-5.5
Table: Some Intel processors
Processor L3 cache Nb. of Cores Freq. (Ghz)
AMD EPYC 7773X 768 MB 64 2.2-3.5
AMD EPYC 7763 256 MB 64 2.45-3.5
AMD Ryzen 9 5950X (Desktop) 72 MB 16-32 3.4-4.9
AMD Ryzen 9 3900X (Desktop) 70 MB 12-24 3.4-4.6
Table: Some AMD processors
MODERN SUPERCOMPUTERS
What is a supercomputer?
• cdc 6600: 1964 - three million calculations per second
• Summit: 2018 - 36000 processors - 200 quadrillion
calculations per second
• Frontier: 2022 - 8 million processors - AMD EPYC with 64 cores
and speed up to 2GHz - quintillion calculations per second
Toubkal: 2021 - 69000 processors
MODERN SUPERCOMPUTERS
What is a supercomputer?
Frontier (USA)
MODERN SUPERCOMPUTERS
What is a supercomputer?
Cluster
Processor
Chip Node
Shared memory Shared memory Shared memory
Network
MODERN SUPERCOMPUTERS
Top 500
• Cray 2: Gigascale
milestone in 1985
• Intel ASCI Red
System: Terascale in
1997
• IBM Roadrunner
System: Petascale in
2008
• Frontier: Exascale in
2022
Modern supercomputers
Top 500 Family system share evolution November 2009
1 https://www.top500.org/statistics/list/
Modern supercomputers
Top 500 Family system share evolution November 2011
1 https://www.top500.org/statistics/list/
Modern supercomputers
Top 500 Family system share evolution November 2015
1 https://www.top500.org/statistics/list/
Modern supercomputers
Top 500 Family system share evolution November 2017
1 https://www.top500.org/statistics/list/
Modern supercomputers
Top 500 Family system share evolution
June 2022
1 https://www.top500.org/statistics/list/
SUMMARY
• Highlights
• New architectures are available
• Supercomputers achieve Exascale
• Consequence for the developers
• Writing dedicated codes
OUTLINE
• Some definitions
• FLOPS
• Frequency
• Memory Bandwidth
• Memory Latency
• Computational Intensity
• Two level memory model
SOME DEFINITIONS
FLOPS
Floating point operations per second (FLOPS or flop/second).
SOME DEFINITIONS
Frequency
Speed at which a processor or other component operates (Hz)
SOME DEFINITIONS
Memory Bandwidth
Rate at which data can be transferred between the CPU and the memory (bytes/second).
SOME DEFINITIONS
Memory Latency
Time delay between a processor requesting data from memory and the moment that
the data is available for use (clock cycles or time units).
COMPUTATIONAL INTENSITY
Algorithms have two costs (measured in time or energy):
• Arithmetic (FLOPS)
• Communication: moving data between
- levels of a memory hierarchy (sequential case)
- processors over a network (parallel case)
Computational Intensity
It is the ratio between arithmetic complexity (or cost) and memory complexity (cost).
TWO LEVEL MEMORY MODEL
Modern architecture (CPU)
Typical sizes
• RAM ∼ 4 GB − 128 GB
even higher on servers
• L3 ∼ 4 MB − 50 MB
• L2 ∼ 256 KB − 8 MB
à Holds data that is likely
to be accessed by the
CPU
• L1 ∼ 256 KB
Cache Hit or Miss à Instruction and Data
• Cache Hit: CPU is able to find the Data in L1/L2/L3 cache
• Cache Miss: CPU is not able to find the Data in
L1-L2-L3 and must retrieve it from RAM
MATRIX MULTIPLICATION: THREE NESTED LOOP
1 for i in range(0, n):
2 #read row i of A into fast memory
3 for j in range(0, n):
4 #read row C[i,j] into fast memory
5 #read col j of B into fast memory
6 for k in range(0, n):
7 C[i,j] = C[i,j] + A[i,k]*B[k,j]
8 #write C[i,j] back to slow memory
1 arithmetic cost :: n**3*( ADD + MUL) = 2n**3 arithmetic operations
2 memory cost :: n**3* READ + n**2* READ + n**2*( READ + WRITE) = n**3 + 3n**2
3 computational intensity :: 2n**3/(n**3 + 3n**2 ) ~= 2
SUMMARY
• Running time of an algorithm is sum of 3 terms:
- N_flops * time_per_flop
- N_words / bandwidth
- N_messages * latency
à Avoiding communication algorithms come with a significant speedup
• Some examples
- Up to 12x faster for 2.5D matmul on 64K core IBM BG/P
- Up to 3x faster for tensor contractions on 2K core Cray XE/6
- Up to 6.2x faster for All-Pairs-Shortest-Path on 24K core Cray CE6
OUTLINE
• Data Locality
- The Penalty of Stride
- High Dimensional Arrays
• Block Matrix Multiplication
DATA LOCALITY
• Data locality is key for improving per-core
performance,
• Memory hierarchy has 4 levels,
• Processor looks for needed data in memory
hierarchy,
• Simple or complex manipulations can increase
speedup,
• Blocking version of mxm can increase
computational intensity.
DATA LOCALITY
The Penalty of Stride > 1?
• Data should be arranged for unit stride access,
• Not doing so can result in severe performance penalty
Example:
1 do i=1, N*i_stride ,i_stride
2 mean = mean + a(i)
3 end do
• Compilation with all optimization and vectorization disabled (-O0)
• Compulation with (-O2) that activates some optimizations
DATA LOCALITY
The Penalty of Stride: CPU time
DATA LOCALITY
High Dimensional Arrays
• High Dimensional Arrays are
stored as a contiguous
sequence of elements
à Fortran uses Column-Major
ordering
à C uses Row-Major ordering
mxm in Fortran N = 1000
• Naive version: CPU-time 1660.6
(msec)
• Transpose version: CPU-time
1139.8 (msec)
BLOCK MATRIX MULTIPLICATION
mxm example: Using block version (cache optimization)
1 for ii in range(0, n, nb):
2 for jj in range(0, n, nb):
3 for kk in range(0, n, nb):
4 for i in range(ii, min(ii+nb, n)):
5 for j in range(jj, min(jj+nb, n)):
6 for k in range(kk, min(kk+nb, n)):
7 c[i][j] +=a[i][k] *b[k][j]
BLOCK MATRIX MULTIPLICATION
mxm block version: CPU time & Bandwidth
SUMMARY
• Access contiguous, stride-one
memory addresses
• Emphasize cache reuse
• Use data structures that improve
locality
• Minimize communication across
different memory levels
• Use parallelism to improve locality
OUTLINE
• About Python
• Python is slow !
• Profiling a Python code
ABOUT PYTHON
• Python was created by Guido van Rossum in 1991 (last version 3.11 - 24/10/2022)
• Python is simple
• Python is fully featured
• Python is readable
• Python is extensible
• Python is ubiquitous, portable, and free
• Python has many third party libraries, tools, and a large community
ABOUT PYTHON
• Python was created by Guido van Rossum in 1991 (last version 3.11 - 24/10/2022)
• Python is simple
• Python is fully featured
• Python is readable
• Python is extensible
• Python is ubiquitous, portable, and free
• Python has many third party libraries, tools, and a large community
à But Python is slow!!
ABOUT PYTHON
• Python was created by Guido van Rossum in 1991 (last version 3.11 - 24/10/2022)
• Python is simple
• Python is fully featured
• Python is readable
• Python is extensible
• Python is ubiquitous, portable, and free
• Python has many third party libraries, tools, and a large community
à When does it really matter?
PYTHON IS SLOW
When does it matter?
• Is my code fast?
• How many CPUh?
• Problems on the system?
• How much effort is it to make it run faster?
PROFILING A PYTHON CODE: WHY?
• Code bottlenecks
• Premature optimization is the root of all evil D. Knuth
• First make it work. Then make it right. Then make it fast. K. Beck
• How?
PROFILING A PYTHON CODE: PROFILERS
• Deterministic and statistical profiling
- the profiler will be monitoring all the events
- it will sample after time intervals to collect that information
• The level at which resources are measured; module, function or line level
• Profile viewers
PROFILING A PYTHON CODE: TOOLS
• Inbuilt timing modules
• profile and cProfile
• pstats
• line_profiler
• snakeviz
PROFILING A PYTHON CODE: USE CASE
1 def linspace(start, stop, n):
2 step =float(stop -start) / (n -1)
3 return [start +i *step for i in range(n)]
4
5 def mandel(c, maxiter):
6 z =c
7 for n in range(maxiter):
8 if abs(z) >2:
9 return n
10 z = z*z +c
11 return n
12
13 def mandel_set(xmin=-2.0, xmax=0.5, ymin=-1.25, ymax=1.25,
14 width=1000, height=1000, maxiter=80):
15 r = linspace(xmin, xmax, width)
16 i = linspace(ymin, ymax, height)
17 n = [[0]*width for _ in range(height)]
18 for x in range(width):
19 for y in range(height):
20 n[y][x] =mandel(complex(r[x], i[y]), maxiter)
21 return n
PROFILING A PYTHON CODE: TIMEIT
The very naive way
1 import timeit
2
3 start_time =timeit.default_timer()
4 mandel_set()
5 end_time =timeit.default_timer()
6 # Time taken in seconds
7 elapsed_time =end_time -start_time
8
9 print('> Elapsed time', elapsed_time)
or using the magic method timeit
1 [In] %timeit mandel_set()
2 [Out] 3.01 s +/- 84.6 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
PROFILING A PYTHON CODE: PRUN
1 [In] %prun -s cumulative mandel_set()
which is, in console mode, equivalent to
1 python -m cProfile -s cumulative mandel.py
1 25214601 function calls in 5.151 seconds
2
3 Ordered by: cumulative time
4
5 ncalls tottime percall cumtime percall filename:lineno(function)
6 1 0.000 0.000 5.151 5.151 {built -in method builtins.exec}
7 1 0.002 0.002 5.151 5.151 <string >:1(< module >)
8 1 0.291 0.291 5.149 5.149 <ipython -input -4 -9421 bc2016cb >:13( mandel_set)
9 1000000 3.461 0.000 4.849 0.000 <ipython -input -4 -9421 bc2016cb >:5( mandel)
10 24214592 1.388 0.000 1.388 0.000 {built -in method builtins.abs}
11 1 0.008 0.008 0.008 0.008 <ipython -input -4 -9421 bc2016cb >:17( < listcomp >)
12 2 0.000 0.000 0.000 0.000 <ipython -input -4 -9421 bc2016cb >:1( linspace)
13 2 0.000 0.000 0.000 0.000 <ipython -input -4 -9421 bc2016cb >:3(< listcomp >)
14 1 0.000 0.000 0.000 0.000 {method 'disable ' of '_lsprof.Profiler ' objects}
PROFILING A PYTHON CODE: LINE LEVEL
• Use the line_profiler package
1 [In] %load_ext line_profiler
2 [In] %lprun -f mandel mandel_set()
1 Timer unit: 1e-06 s
2 Total time: 12.4456 s
3 File: <ipython -input -2 -9421 bc2016cb >
4 Function: mandel at line 5
5 #Line Hits Time Per Hit % Time Line Contents
6 ==============================================================
7 5 def mandel(c, maxiter):
8 6 1000000 250304.0 0.3 1.1 z = c
9 7 24463110 6337732.0 0.3 27.7 for n in range(maxiter):
10 8 24214592 8327289.0 0.3 36.5 if abs(z) > 2:
11 9 751482 201108.0 0.3 0.9 return n
12 10 23463110 7658255.0 0.3 33.5 z = z*z + c
13 11 248518 65444.0 0.3 0.3 return n
PROFILING A PYTHON CODE: LINE LEVEL
This can be done in console mode as well
1 @profile
2 def mandel(c, maxiter):
3 z =c
4 for n in range(maxiter):
5 if abs(z) >2:
6 return n
7 z = z*z +c
8 return n
Then on the command line
1 kernprof -l -v mandel.py
Then
1 python3 -m line_profiler mandel.py.lprof
PROFILING A PYTHON CODE: MEMORY
• Use the memory_profiler package
1 [In] %load_ext memory_profiler
2 [In] %mprun -f mandel mandel_set()
1 Line # Mem usage Increment Occurrences Line Contents
2 =============================================================
3 8 118.2 MiB -39057.7 MiB 1000000 def mandel(c, maxiter):
4 9 118.2 MiB -39175.5 MiB 1000000 z = c
5 10 118.2 MiB -293081.8 MiB 24463110 for n in range(maxiter):
6 11 118.2 MiB -292425.7 MiB 24214592 if abs(z) > 2:
7 12 118.2 MiB -38519.6 MiB 751482 return n
8 13 118.2 MiB -253906.1 MiB 23463110 z = z*z + c
9 14 118.2 MiB -656.4 MiB 248518 return n
PROFILING A PYTHON CODE: MEMORY
• Use the memory_profiler package
1 @profile
2 def mandel(c, maxiter):
3 z =c
4 for n in range(maxiter):
5 if abs(z) >2:
6 return n
7 z = z*z +c
8 return n
Then on the command line
1 mprof run mandel.py
Then
1 mprof plot
Or
1 python3 -m memory_profiler mandel.py
OUTLINE
• Accelerate a Python code
- Using Numpy
- Using Cython
- Using Numba
- Using Pyccel
• Some Benchmarks
ACCELERATE A PYTHON CODE: NUMPY
• Library for scientific computing in Python,
• High-performance multidimensional array object,
• Integrates C, C++, and Fortran codes in Python,
• Uses multithreading.
ACCELERATE A PYTHON CODE: NUMPY VS LISTS
1 import numpy, time
2
3 size =1000000
4
5 print("Concatenation: ")
6 list1 =[i for i in range(size)]; list2 =[i for i in range(size)]
7
8 array1 =numpy.arange(size); array2 =numpy.arange(size)
9
10 # List
11 initialTime =time.time()
12 list1 =list1 +list2
13 # calculating execution time
14 print("Time taken by Lists :", (time.time() -initialTime), "seconds")
15
16 # Numpy array
17 initialTime =time.time()
18 array =numpy.concatenate((array1, array2), axis =0)
19 # calculating execution time
20 print("Time taken by NumPy Arrays :", (time.time() -initialTime), "seconds")
1 Concatenation:
2 Time taken by Lists : 0.021048307418823242 seconds
3 Time taken by NumPy Arrays : 0.009451150894165039 seconds
ACCELERATE A PYTHON CODE: CYTHON
• Cython is an optimizing static compiler for:
• Python programming language
• Cython programming language (based
on Pyrex)
• Cython gives you the combined power of
Python.
ACCELERATE A PYTHON CODE: CYTHON
• Python
1 def mandelbrot(m, size, iterations):
2 for i in range(size):
3 for j in range(size):
4 c = -2 + 3./size*j +1j*(1.5-3./size*i)
5 z =0
6 for n in range(iterations):
7 if np.abs(z) <=10:
8 z = z*z +c; m[i, j] =n
9 else:
10 break
ACCELERATE A PYTHON CODE: CYTHON
• Cython
1 def mandelbrot_cython(int[:,::1] m,int size, int iterations):
2 cdef int i, j, n
3 cdef complex z, c
4 for i in range(size):
5 for j in range(size):
6 c = -2 + 3./size*j +1j*(1.5-3./size*i)
7 z =0
8 for n in range(iterations):
9 if z.real**2 +z.imag**2 <=100:
10 z = z*z +c; m[i, j] =n
11 else:
12 break
ACCELERATE A PYTHON CODE: CYTHON
• Execution time
1 %% timeit -n1 -r1
2 m = np.zeros(s, dtype=np.int32)
3 mandelbrot(m, size , iterations)
4 >> 12.2 s +/- 0 ns per loop (mean +/- std. dev. of 1 run , 1 loop each)
5
6
7 %% timeit -n1 -r1
8 m = np.zeros(s, dtype=np.int32)
9 mandelbrot_cython (m, size , iterations)
10 >> 29.8 ms +/- 0 ns per loop (mean +/- std. dev. of 1 run , 1 loop each)
ACCELERATE A PYTHON CODE: NUMBA
• Open source Just-In-Time compiler for python functions.
• Uses the LLVM library as the compiler backend.
ACCELERATE A PYTHON CODE: NUMBA
• Python
1 import numpy as np
2
3 def do_sum():
4 acc =0.
5 for i in range(10000000) :
6 acc +=np.sqrt(i)
7 return acc
• Numba
1 from numba import njit
2
3 @njit
4 def do_sum_numba():
5 acc =0.
6 for i in range(10000000) :
7 acc +=np.sqrt(i)
8 return acc
1 Time for Pure Python Function: 7.724030017852783
2 Time for Numba Function: 0.015453100204467773
ACCELERATE A PYTHON CODE: PYCCEL
• Pyccel is a static compiler for Python 3, using Fortran or C as a backend language.
• Python function:
1 import numpy as np
2
3 def do_sum_pyccel():
4 acc =0.
5 for i in range(10000000) :
6 acc +=np.sqrt(i)
7 return acc
ACCELERATE A PYTHON CODE: PYCCEL (F90)
• Compilation using fortran:
1 pyccel --language=fortran pyccel_example.py
1 module pyccel_example
2 use , intrinsic :: ISO_C_Binding , only : i64 => C_INT64_T , f64 => C_DOUBLE
3 implicit none
4 contains
5 ! ........................................
6 function do_sum_pyccel () result(acc)
7
8 implicit none
9 real(f64) :: acc
10 integer(i64) :: i
11 acc = 0.0 _f64
12 do i = 0_i64 , 9999999 _i64 , 1_i64
13 acc = acc + sqrt(Real(i, f64))
14 end do
15 return
16 end function do_sum_pyccel
17 ! ........................................
18 end module pyccel_example
1 Time for Pure Python Function: 7.400242328643799
2 Time for Pyccel Function: 0.01545262336730957
ACCELERATE A PYTHON CODE: PYCCEL (C)
• Compilation using c:
1 pyccel --language=c pyccel_example.py
1 # include "pyccel_example.h"
2 # include <stdlib.h>
3 # include <math.h>
4 # include <stdint.h>
5 /* ........................................ */
6 double do_sum_pyccel(void)
7 {
8 int64_t i;
9 double acc;
10 acc = 0.0;
11 for (i = 0; i < 10000000; i += 1)
12 {
13 acc += sqrt (( double)(i));
14 }
15 return acc;
16 }
17 /* ........................................ */
SOME BENCHMARKS
Rosen-Der
Tool Python Cython Numba Pythran Pyccel-gcc Pyccel-intel
Timing (µs) 229.85 2.06 4.73 2.07 0.98 0.64
Speedup − × 111.43 × 48.57 × 110.98 × 232.94 × 353.94
Black-Scholes
Tool Python Cython Numba Pythran Pyccel-gcc Pyccel-intel
Timing (µs) 180.44 309.67 3.0 1.1 1.04 6.56 10−2
Speedup − × 0.58 × 60.06 × 163.8 × 172.35 × 2748.71
Laplace
Tool Python Cython Numba Pythran Pyccel-gcc Pyccel-intel
Timing (µs) 57.71 7.98 6.46 10−2 6.28 10−2 8.02 10−2 2.81 10−2
Speedup − × 7.22 × 892.02 × 918.56 × 719.32 × 2048.65