MPI Python Workshop Day1 Fall2024

The document provides an overview of MPI (Message-Passing Interface) using Python, focusing on parallel computing concepts and the mpi4py library. It covers the differences between serial and parallel computing, programming models, and common use cases for MPI, such as perfectly parallel computations and domain decomposition for partial differential equations. Additionally, it discusses performance measurement, Amdahl's Law, and Gustafson's Law in the context of parallel computing efficiency and scalability.

Uploaded by

Ankit Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views22 pages

MPI Python Workshop Day1 Fall2024

Uploaded by

Ankit Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Intro to MPI using Python: Parallel Theory &

MPI Overview
November 17, 2024

Presented by:
Nicholas A. Danes, PhD
Computational Scientist
Research Computing Group, Mines IT
Preliminaries
• HPC Experience (one of these):
• Know the basics of
• Linux Shell
• Python 3
• Scientific Computing
• Active HPC User
• Mines specific: Wendian, Mio
• Off-premise: Cloud, NSF Access, CU Boulder Alpine, etc.
• Previously taken our “Intro to HPC” workshop
• Offered once per semester
Review of Parallel vs Serial Computing
• When a program uses a single process (“task”) with 1 core
(“cpu”), we say it is a serial computing program.
• When a program uses multiple cores, we say it is a parallel
computing program.
• Typically, we try to optimize a serial computing program
before trying to write it in parallel
• For this workshop, we’re going to assume we are well
equipped to deal with the serial code situation
Shared Memory Parallelism:
1 task, 4 threads
Parallel Programming Models
CPU Core CPU Core CPU Core CPU Core
#1 #2 #3 #4
• Shared vs Distributed Memory
Programming
• Shared (e.g. OpenMP)
• All CPU cores have access to the same Memory (RAM)
pool of memory
• Typically, all CPU cores are on the same
CPU node
• Ideal for multi-threaded loops Distributed Memory Parallelism:
• Distributed-memory program (e.g. 4 tasks, 1 thread per task
MPI)
• Each CPU core is given access to a CPU Core CPU Core CPU Core CPU Core
specific pool of memory, which may or #1 #2 #3 #4
may not be shared
• A “communicator” designates how each
CPU core can talk to another CPU core
• CPU cores do not have to live on the Mem Mem Mem Mem
same CPU node
Part #1 Part #2 Part #3 Part#4
Overview of MPI
• MPI stands for message-passing interface, standard provided as a library for exchanging data (called
messages) between objects.
• Different libraries have implemented the MPI standard:
• OpenMPI
• MPICH
• Intel MPI
• Typically used with C, C++ and Fortran
• Objects that can be used to send messages are separated by memory
• Can be entire CPU nodes, or CPU cores (or even a GPU!)
• By breaking up by memory of each tasks, a rank can send messages theoretically anywhere as long as
there is another layer of network communication
• MPI most commonly uses Infiniband for node-to-node communication
• Intra-node communication uses CPU architecture
• Called vader/BTL on OpenMPI
• There are many moving parts involving networking for MPI
• For more information: easybuild_tech_talks_01_OpenMPI_part2_20200708.pdf (open-mpi.org)
Heuristics for writing MPI Programs: Overview
• Typically, MPI programs take a single program, multiple data (SPMD)
model approach
• Single program: Encapsulate all desired functions and routines under one program
• Multiple data: The single program is duplicated with multiple copies of data, and runs
on the system each on its own process.
• Think about your largest data size and how it can be broken up into
smaller chunks
• The multiple processes then can communicate (i.e. share data) using MPI
library functions written by the user
• MPI data communication steps should be brought to a minimum, as they
can slow down performance significantly.
Common Use Case #1: Perfectly Parallel
Computations
• Perfectly (or trivially) parallel programs are ones that do not require
any MPI communication functions
• MPI is still useful, since it allows the program to run across more than
one computer/compute node
• Examples include:
• Matrix/Vector Addition
• Markov Chain Monte Carlo (MCMC) Simulations
Common Use Case #2: Domain Decomposition for
Partial Differential Equations
• Solving a spatial partial differential equation
• Domain is a 1-3D mesh with multiple grid point/cells that can be broken up using
domain decomposition.
• Each processor contains a subset of the domain’s mesh and solves the numerical
problem for the differential equation on that subdomain
• Derivatives in differential equations typically use finite difference/volume/element
approximations, which require knowing values of a function around the evaluated
grid point
• This can require data from other processors
• MPI can be used to send grid data on the edges of the decomposed domain to the
other processors
• Commonly referred to as “ghost” cells/nodes/volumes
• Popular frameworks provide tracking these grid points within the mesh object
• parMETIS,SCOTCH, PETSc, Ansys Fluent
Important MPI concepts
• Initialize – MPI must explicitly started in the code
• Helps MPI identify what resources were requested
• Rank – How the number of processes are labeled/tracked
• Common practice: ranks = # of CPU cores requested
• Other practices: 1 compute node per rank, 1 GPU card per rank
• Size – Total number of ranks
• In most MPI-only programs, size = number of processors requested
• Finalize – Close MPI within the program
1 2
Important MPI concepts
• Communicator – How ranks know their relation to
others 0
• “MPI_COMM_WORLD” – Every rank knows every other rank
• “MPI_COMM_SELF” – Every rank knows itself
MPI_COMM_WORLD
• Communication Types
• Point-to-Point – Synchronized MPI function between ranks
• Send/Receive – Every send must have a receive 1 2
• Calls can be blocking or non-blocking
• Collective - MPI function on all ranks
• Broadcast – One rank sends data to all other ranks
• Scatter – One rank sends a chunk of data to each rank 0
• Gather – One rank receives data from all other ranks
• One-sided
• Not covering this MPI_COMM_SELF
MPI with Python: mpi4py
• mpi4py is a Python library that allows one to use MPI-2 C++
style bindings with Python in an object-oriented way
• Supports various python objects for the buffer interface
• NumPy Arrays
• Pickled Objects (lists, dictionaries, etc)
• Documentation: https://mpi4py.readthedocs.io/en/stable/
• We will be using mpi4py for this entire workshop!
mpi4py vs Other Parallel Python Options
• mpi4py alternatives – Also implements the MPI standard in python
• PyPar: https://github.com/daleroberts/pypar
• Scientific Python: https://github.com/khinsen/ScientificPython/
• pyMPI: https://sourceforge.net/projects/pympi/
• Mpi4py.futures: mpi4py.futures — MPI for Python 4.0.1 documentation
• Based on concurrent.futures (standard Python) to pool workers. Mpi4py futures lets us go across multiple
nodes.
• Multiprocessing – spawns multiple processes (called workers) which can distribute
work for a function
• Easier to implement, but limited to single machine/node
• There are some communication options: multiprocessing — Process-based
parallelism — Python 3.12.2 documentation
• Dask – Provides a full parallel job scheduler framework in Python
• More high-level and communication is more implicit
• Task-scheduling and works well with Jupyter Notebooks
• Can used in combination with MPI (DASK-MPI)
• More details: https://www.dask.org/
Lab #1 (15-20 min):
1. Setting up mpi4py anaconda environment
2. Running our first programs
Today’s files:
/sw/examples/MPI_Workshop_Nov172024.tar.gz
Basic Parallel Computing Theory
• We use parallelization to improve performance of
scientific codes
• How do we measure that?
• Can we predict performance based on various factors?
• Serial performance
• Hardware
• Problem size
• Can we determine how the problem scales as we increase
compute resources?
Measuring Parallel Performance
Variable Description

𝑃 Number of processors (“ranks“)

𝑛 Problem size (e.g 𝑛 is number of mesh cells, etc)

𝑇 𝑃,𝑚𝑎𝑥 Max wall time with 𝑃 processors

𝑇 𝑃,𝑎𝑣𝑔 Average wall time across 𝑃 processors

𝑇 𝑃,𝑚 Wall time from the 𝑚-th out of 𝑃 processors

𝑆𝑃 Speedup with 𝑃 processors

𝐸𝑃 Efficiency with 𝑃 processors

β𝑃 Load balance with 𝑃 processors

Speed-up, Efficiency, & Load-Balancing
• Speed-up: the ratio of the serial wall time to the parallel (with 𝑃 processors) wall
time
𝑇 1,𝑚𝑎𝑥
𝑆𝑃 =
𝑇{𝑃,𝑚𝑎𝑥}
• When 𝑆𝑃 = 𝑃, the speed-up is ideal.
• Efficiency:
𝑆𝑃
𝐸𝑃 =
𝑃
• When 𝐸𝑃 = 1, the efficiency is ideal.
• Load-balancing:
𝑇 𝑃,𝑎𝑣𝑔
β𝑃 =
𝑇{𝑃,𝑚𝑎𝑥}
When β𝑃 = 1, the efficiency is ideal.
Basic Parallel Computing Theory: Amdahl’s Law

• In 1967, Gene Amdhal proposed a way to predict how much a code

can scale due to a serial bottleneck [4].
• Amdhal’s Law can be summarized with the following equation
relating to speedup:
1
S𝑃,𝐴𝑚 =
1 − 𝐹𝑠
𝐹𝑠 −
𝑃
Where 𝐹𝑠 is the theoretical serial fraction, the proportion of the
runtime of a code that is run with only 1 processor.
Basic Parallel Computing Theory: Amdahl’s Law

• Amdahl’s law shows a a severe constraint to

parallel scalability if a large portion of your code is
in serial.
• Plot on the right shows Amdahl’s Law with 𝑃 =
1024 processors
• If the serial fraction is about 0.5% of the
runtime, then we see about a 167 times
speedup, implying a 167/1024 ~ 16.3% parallel
efficiency.
• If the serial fraction is about 10% of the
runtime, then the speedup drops to about 10,
10/1024 ~ about 0.97% parallel efficiency.
• Main takeaway: Amdahl’s Law states that
minimizing the time a code spends in serial is
crucial for scaling up your parallel program.
Amdahl’s Law Limitations
• Amdahl’s Law makes many assumptions about your compute
situation
• Doesn’t account for hardware limitations
• CPU configuration (cache, memory, etc)
• Disk performance (read/write speeds, etc)
• The fraction of the code spend in parallel could also depend on
the number of processors, i.e.
1 − 𝐹𝑠 = 𝐹𝑃 = 𝐹𝑃 (𝑃)
• It assumes that your problem size is fixed
• In practice, when performing a benchmark with increasing number
of processors with a fixed problem size, we call this Strong
Scaling.
Gustafson’s Law
• In response, John Gustafson argued that the
assumptions from Amdahl’s Law for was not
appropriate for all parallel workloads [4].
• In particular, the serial time spent by the processor
was not independent of the number of processors
• More processors used on a CPU means the
cores will compete for memory bandwidth
• As an approximation, Gustafson approximated
speedup by assuming the parallel part of the program
is linearly proportional to the number of processors:
S𝑃,𝐺𝑢 = 𝑃 + 1 − 𝑃 F𝑠
• This equation is often referred to as scaled speedup.
• When one increases the problem size with the number
of processors linearly, we call this weak scaling.
Lab #2 (15-20 min):
Calcuating pi in parallel using Leibiniz’s
formula
Today’s files:
/sw/examples/MPI_Workshop_Nov172024.tar.gz
References
• [1] https://www.cs.uky.edu/~jzhang/CS621/chapter7.pdf
• [2] https://www.youtube.com/watch?v=pDBIoil-LTk
• [3] https://www-
inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
• [4] Gustafson, John L. “Reevaluating Amdahl’s law.”
Communications of the ACM 31, no. 5 (1988): 532-533:
http://www.johngustafson.net/pubs/pub13/amdahl.htm
• [5] https://xlinux.nist.gov/dads/HTML/singleprogrm.html

Bcs702 Parallel Computing Module 1
100% (1)
Bcs702 Parallel Computing Module 1
35 pages
Parallel Computing-Module2 Notes
No ratings yet
Parallel Computing-Module2 Notes
10 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Chapter 12 Multiprocessor Systems
No ratings yet
Chapter 12 Multiprocessor Systems
110 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
CS621 Final Term Current Papers
100% (1)
CS621 Final Term Current Papers
9 pages
Lab Mpi
No ratings yet
Lab Mpi
29 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
New Performance Analysis of Parallel Image Rotation-1
No ratings yet
New Performance Analysis of Parallel Image Rotation-1
9 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Iit M Foundation An Exam Qdf2 13 July 2025
No ratings yet
Iit M Foundation An Exam Qdf2 13 July 2025
204 pages
MPI - Python Lab
No ratings yet
MPI - Python Lab
33 pages
Learn Python Quickly and Python Coding Exercises Coding For Beginners B08RDY478F
100% (1)
Learn Python Quickly and Python Coding Exercises Coding For Beginners B08RDY478F
124 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Cs621 Final Term Current Papers
No ratings yet
Cs621 Final Term Current Papers
10 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
MPI Tutorial Fall Break 2022
No ratings yet
MPI Tutorial Fall Break 2022
60 pages
CS-3006 - 5 - MPI Basics
No ratings yet
CS-3006 - 5 - MPI Basics
53 pages
Parallel Distributed Computing Using Pyt
No ratings yet
Parallel Distributed Computing Using Pyt
41 pages
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
No ratings yet
Introduction To Parallel Computing: What Is Parallel Computing? CS 480 - II Parallel and Scientific Computing
10 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Amdahl's Law & Parallel Computing
No ratings yet
Amdahl's Law & Parallel Computing
19 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
03 MPIProgramStructure
No ratings yet
03 MPIProgramStructure
42 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
Parallel Programming and MPI
No ratings yet
Parallel Programming and MPI
54 pages
Lab Mpi
No ratings yet
Lab Mpi
32 pages
Python I Compiled Notes
100% (3)
Python I Compiled Notes
321 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Week 10
No ratings yet
Week 10
52 pages
Sidharth Anand Resume
No ratings yet
Sidharth Anand Resume
1 page
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Introduction To C MPI PM
No ratings yet
Introduction To C MPI PM
50 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
Python U 5 One Shot Notes - 080fb705 335c 4bee Afc8 Ee7a01d3a11e
No ratings yet
Python U 5 One Shot Notes - 080fb705 335c 4bee Afc8 Ee7a01d3a11e
122 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
MPI Communication and Functions Guide
No ratings yet
MPI Communication and Functions Guide
16 pages
Python Parallel Computing Guide
No ratings yet
Python Parallel Computing Guide
26 pages
3 Mpi
No ratings yet
3 Mpi
44 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Computing LLNL Gov
No ratings yet
Computing LLNL Gov
42 pages
Fresher Jobs - 07 Jan 2025
No ratings yet
Fresher Jobs - 07 Jan 2025
10 pages
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
No ratings yet
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
53 pages
PA
No ratings yet
PA
87 pages
Computational Tools and Software MATLAB Python
No ratings yet
Computational Tools and Software MATLAB Python
5 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
Python for Numerical Methods
No ratings yet
Python for Numerical Methods
41 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Advanced Parallel Programming Guide
No ratings yet
Advanced Parallel Programming Guide
32 pages
PP QB (LR24) Sem 2 Cie1
No ratings yet
PP QB (LR24) Sem 2 Cie1
64 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
MPI4Py Python Tutorial for HPC
No ratings yet
MPI4Py Python Tutorial for HPC
28 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
ReportLab - PDF Processing With Python by Michael Driscoll
100% (1)
ReportLab - PDF Processing With Python by Michael Driscoll
421 pages
Sasi Final Year Project-1
No ratings yet
Sasi Final Year Project-1
24 pages
SERC IntroMPI 2019-09-14 v0
No ratings yet
SERC IntroMPI 2019-09-14 v0
43 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Sunil Kumar L 24
No ratings yet
Sunil Kumar L 24
21 pages
Python vs Java: A Beginner's Guide
No ratings yet
Python vs Java: A Beginner's Guide
8 pages
Unit - I Introduction To Data Analytics
No ratings yet
Unit - I Introduction To Data Analytics
89 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
Cyber Security Study Plan
No ratings yet
Cyber Security Study Plan
3 pages
Se Final Report1 PDF
100% (1)
Se Final Report1 PDF
26 pages
Python Intership Report
No ratings yet
Python Intership Report
22 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
GE Syllabus NEP
No ratings yet
GE Syllabus NEP
5 pages
Driver Drowsiness Detection System
No ratings yet
Driver Drowsiness Detection System
10 pages
Python World: Speaker
No ratings yet
Python World: Speaker
48 pages
Sweta Project File
No ratings yet
Sweta Project File
48 pages
How To Install PyInstaller
No ratings yet
How To Install PyInstaller
38 pages
Railway Reservation System Project
No ratings yet
Railway Reservation System Project
38 pages
Ankush IITG Resume MLE
No ratings yet
Ankush IITG Resume MLE
1 page
Flask
No ratings yet
Flask
29 pages
Virtual Assistant Project Report
No ratings yet
Virtual Assistant Project Report
24 pages
The Yin Yang of AI - 2024
No ratings yet
The Yin Yang of AI - 2024
19 pages
Data Analyst & Consultant Profile
No ratings yet
Data Analyst & Consultant Profile
1 page
OMGawande Resume
No ratings yet
OMGawande Resume
1 page
Ranjeet
No ratings yet
Ranjeet
3 pages

MPI Python Workshop Day1 Fall2024

Uploaded by

MPI Python Workshop Day1 Fall2024

Uploaded by

Intro to MPI using Python: Parallel Theory &

𝑃 Number of processors (“ranks“)

𝑛 Problem size (e.g 𝑛 is number of mesh cells, etc)

𝑇 𝑃,𝑚𝑎𝑥 Max wall time with 𝑃 processors

𝑇 𝑃,𝑎𝑣𝑔 Average wall time across 𝑃 processors

𝑇 𝑃,𝑚 Wall time from the 𝑚-th out of 𝑃 processors

𝑆𝑃 Speedup with 𝑃 processors

𝐸𝑃 Efficiency with 𝑃 processors

β𝑃 Load balance with 𝑃 processors

• In 1967, Gene Amdhal proposed a way to predict how much a code

• Amdahl’s law shows a a severe constraint to

You might also like