0% found this document useful (0 votes)

79 views25 pages

Parallel Computing in CFD: Milovan Perić

This document provides a summary of the history and concepts of parallel computing in computational fluid dynamics (CFD). It discusses: - The early experiments with parallel CFD in the 1980s using specialized parallel computers with domain decomposition approaches. - The development of message passing interfaces like MPI that helped standardize communication between processors. Clusters with standard processors eventually became the dominant platform. - The key concepts of domain decomposition in space, which splits the computational domain into subdomains assigned to individual processors, and domain decomposition in time, which allows solving multiple time steps in parallel. - How implicit methods require modification of the iterative solvers to account for lagged neighbor data, while explicit methods can run identically on

Uploaded by

Mohamed Ouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views25 pages

Parallel Computing in CFD: Milovan Perić

Uploaded by

Mohamed Ouda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

PARALLEL COMPUTING IN CFD

Milovan Perić
CoMeT Continuum Mechanics Technologies GmbH
&
Institute of Ship Technology, Ocean Engineering and Transport
Systems, University of Duisburg-Essen

[email protected] [email protected]
Introduction

• In 1970s, mainframe-computers with vector processors were

the fastest ones…
• However, many algorithms were difficult or impossible to
vectorize, so other alternatives to speed-up the computations
in a scalable way were sought…
• Parallel computing was one option – and it became the
dominant way of computing in present days (even cell-phones
have quad-core processors nowadays…).
• Parallel CFD started in early 1980s, using experimental parallel
computers (there was nothing on the market yet).
• University of Erlangen in Germany had one such computer
called DIRMU (Distributed Reconfigurable Multiprocessor).
History of Parallel CFD – I

• DIRMU had 24 processors and each could read from private

memory of 7 of its neighbors.
• It was thus suitable for CFD on structured grids using domain
decomposition, where in 3D each grid block would have 6
neighbor blocks…
• Unfortunately, its processors were quickly outdated and there
was no next model…

DIRMU parallel computer,

exhibited in the Regional
Computer Centre Erlangen
History of Parallel CFD – II

• Germany launched in 1985 SUPRENUM-project (Supercomputer

for numerical applications), but it was not continued after the
first phase…
• More successful were private companies Parsytec (Germany)
and Meiko (UK): they built and sold many systems based on
special processors called “Transputers”.
• Transputer systems died out when the latest processor model,
T9000, failed to meet the specifications…
• Several other companies who built parallel computers (Kendal
Square Research and Thinking Machines Corporation, USA) also
do not exist any more…
• The breakthrough came with clusters built with standard
processors that are used for PCs; nowadays largest clusters are
built by companies who also make PCs (Dell, IBM, HP…).
History of Parallel CFD – III

• In early days, porting of codes from serial to parallel computers

was a tedious job…
• Several communication libraries were introduced (TCGMSG ,
PVM, MPI), designed to help programmers by hiding low-level
code.
• Eventually, MPI (message-passing interface) became a de-facto
standard for inter-processor communication.
• On the hardware side, there were also different options of which
few survived:
– Ethernet (in various flavors),
– InfiniBand.
• In early 1990s, European Union supported parallelization of
commercial engineering software (also CFD-software STAR-CD).
Parallelization Concepts – I

• Parallelization at loop level is subject to Amdahl’s law and thus

not very efficient…
• Several other concepts were tried, but they do not reach
efficiency above 50%.
• The standard scalable approach in parallel CFD is based on
domain decomposition.
• The solution domain is split into contiguous subdomains and
each subdomain is assigned to one processor.
• In FV, subdomain boundaries correspond to CV-boundaries.
Each processor computes the solution in its subdomain.
• However, both the discretization and the solution process
require some data that is computed by neighbor processors
(cells next to subdomain boundary refer to one or two layers of
cells on the other side).
Parallelization Concepts – II

• Shared-memory concept allows access to such data, but

memory access becomes a bottleneck – not scalable…
• Distributed memory is the standard concept these days: each
processor has its private memory for data it computes and for a
copy of data it needs from neighbors…
• Data along subdomain boundaries is exchanged typically once
per inner iteration – this constitutes local communication.
• This communication is scalable: it takes place in parallel
between pairs of neighbor processors.
• Local communication depends only on the number of neighbor
subdomains – not on the total number of processors.
• Global communication is also required – gathering of some
information by the master and broadcast to all processors…
Parallelization Concepts – III

• Examples of global communication are:

– Computation of residual norm to estimate iteration errors
(gathering of norms from subdomains);
– Broadcasting of convergence criterion decision;
– Computation of scalar products of two vectors (e.g. in conjugate-
gradient type solvers).
• Global communication is not perfectly scalable – the effort
grows as the total number of processors increases…
• Both local and global communication can often be overlapped
with computation (if supported by the hardware) which does
not require exchanged data.
• Optimization of communication overhead requires re-writing of
parts of the code (more details later)…
Main Influencing Factors – I

• The main factors affecting the efficiency of communication

are:
– Latency (setup time for communication): needed to initialize
communication between two processors;
– Data-transfer rate (bandwidth of the communication channel);
– Amount of data to be transferred.
• The total efficiency depends on the ratio of communication to
computing time – thus processor computing speed is also
important.
• Mathematical models to estimate the efficiency of parallel
computing can be built when the algorithm and the above
parameters are known…
• Some options could be differently invoked depending on
hardware.
Main Influencing Factors – II

• Explicit methods are easy to parallelize because new solution is

computed using only past data…
• Data needs to be exchanged once per time step (after solution
has been updated); there is no global communication…
• The sequence of operations and the solution is identical on one
and on many processors…
• However, even explicit pressure-based methods require the
solution of a Poisson-equation for pressure, so an equation
system needs to be solved – like in implicit methods.
• Implicit methods are more difficult to parallelize; they usually
require adaptations to the iteration matrix, so the sequence of
operations and the solution are not identical on one and on
many processors…
The number of iterations per time step may need to
be increased if too many processors are used!
Domain Decomposition in Space – I

• There are only a few solvers for linear equation systems which
run the same in parallel as in serial mode.
• The obvious one is Jacobi-method, but it is almost never used.
• The so-called “red-black” Gauss-Seidel method is another
solver that needs no adaptation on structured grids.
• ILU-type solvers can in principle be parallelized so that they
execute the same sequence of operations, but only on
structured grids and in a not really scalable way…
• The conjugate-gradient method can also be parallelized
without modification, but only without pre-conditioning; also
almost never used in that form...
• All commonly used solvers run slightly differently on parallel
computers, depending on the number of subdomains.
Domain Decomposition in Space – II

• The usual modification of the iterative solution method is to

lag the data from neighbor subdomains by one iteration.
• Thus, on mth iteration in subdomain i, variable values from
neighbor subdomains used in algebraic equations are taken
from iteration m-1 and treated as known.
• This corresponds to splitting the coefficient matrix A into
diagonal block-matrices Aii (which refer only to variables from
subdomain i) and off-diagonal block-matrices Aij (which refer
to variables in subdomain j from subdomain i).
• The iteration matrix M is also modified correspondingly; it
usually contains only the diagonal blocks Mii:
Domain Decomposition in Space – III

Single
domain

Two sub-
domains
Domain Decomposition in Space – IV

• This approach is generic – any iteration solver can easily be

adapted to it.
• Local communication is required after each inner iteration to
update the neighbor data stored in private memory.
• In the limit of each subdomain containing just one CV/node,
the solver would reduce to Jacobi-method.
• The number of CVs/nodes per subdomain is typically much
larger than the number of subdomains, so it is not that bad…
• However, when number of processors becomes very large,
solver performance would become much worse than on a
single processor.
• Multigrid methods help improve solver performance, but then
part of the work must be done on fewer processors…
Domain Decomposition in Time – I

• Implicit methods that perform outer iterations within a time

step offer another parallelization possibility – solving for
multiple time steps in parallel.
• Usually, one starts computation on a new time step when the
solution at the current time step is finished; all solutions from
previous time steps needed in the algorithm are then known:

• However, one can start computation for the new time step as
soon as the first outer iteration on the current time step is
finished (which provides the first estimate of the solution).
• We then have multiple processors operating on the same
spatial subdomain, but on different time levels.
Domain Decomposition in Time – II

• The equation solved at the mth outer iteration for time step
tn+1 is then (k denotes time level):

Structure of the global

matrix equation when
solving for 4 time steps
in parallel
Domain Decomposition in Time – III

• Local communication for time-parallelism involves one send

and one receive of the complete field per outer iteration
(course-grain communication).
• Local communication takes place in parallel between pairs of
processors.
• There is no global communication associated with time-
parallel computation.
• The use of provisional old data affects convergence of outer
iterations; if too many time steps are computed in parallel,
number of required outer iterations would increase.
• Rule-of-thumb: the number of time steps that can be
executed in parallel = half the number of outer iterations per
time step.
Efficiency of Parallel Computing – I

• The performance is measured by the speed-up or efficiency:

• is the execution time of the best serial algorithm on one

processor (not the execution of parallel algorithm on 1 proc.!);
• is the execution time of the parallel algorithm on n
processors.
• Ideal speed-up equals number of processors and ideal
efficiency is 1 (or 100%).
• Usually, speedup (or efficiency) is lower than ideal but,
depending on the hardware, one can obtain also higher values
(usually due to cashing of data).
Efficiency of Parallel Computing – II

• Processors are usually synchronized at the begin of each

iteration – there are thus idle times because one iteration may
last longer on some processors than on others…
• Reasons for uneven load: unequal number of cells per
subdomain, different boundary conditions or local phenomena,
branches in the algorithm, different number of neighbor
subdomains…
• For a single processor, the computing time can be expressed as:

• For a parallel algorithm executed on n processors:

is the communication time which halts computation.

Efficiency of Parallel Computing – III

• The total efficiency can thus be re-written as:

is the numerical efficiency, which accounts for higher

demand for computing operations by the parallelized
algorithm.

is the parallel efficiency, which accounts for

the time spent on communication during
which computation has to be halted.

is the load-balancing efficiency which accounts for

idle times due to uneven load.
Efficiency of Parallel Computing – IV

• When parallelization is performed in both space and time, the

overall efficiency is the product of spatial and temporal total
efficiencies.
• The total efficiency is easily obtained by measuring total
execution times on a single processor and on n processors.
• The parallel efficiency can be approximately determined by
measuring the execution times when the subdomains are of
equal size and the number of inner and outer iterations is fixed.
• To determine numerical efficiency, one needs either to count
operations or divide the total efficiency by the product of
parallel and load-balancing efficiency.
• Load-balancing efficiency can be estimated by the ratio of the
number of CVs per subdomain, but there are other effects that
can be substantial…
Efficiency of Parallel Computing – V

• When multigrid solvers are used, one needs to agglomerate

subdomains on coarse grid levels…
• Some processors are then left idle until coarsest grids are
visited by fewer processors.
• Lagrangian multiphase and other models may also introduce
different computing load per processors (e.g. dynamic adaptive
grid refinement or overlapping grids which move)…
• Some communication can be overlapped with computation…
• For massive parallelism, numerical efficiency and global
communication are the limiting factors.
• Fortunately, massively parallel flow problems are usually
transient – the numerical efficiency then does not suffer much.
CFD on Graphics Cards

• Graphics cards contain many efficient processors for simple

operations.
• CFD-codes have been ported to or developed for graphics
cards – mostly FD-methods for structured grids and Lattice-
Boltzmann methods.
• Commercial codes have been tested but not seriously used on
graphics cards…
• Some parts of algorithms on unstructured grids are inefficient
on graphics cards due to indirect addressing – memory access
becomes a bottleneck…
• Porting general-purpose codes to graphics cards is not the hot
topic today, but hardware is changing and one day this may
change…
Examples of Parallel Performance – I
Segregated solution, LES,
AMG-solver, Flamelet-based
combustion model, 692
million cells, STAR-CCM+
software

42 times faster execution on

64 times more processors
Examples of Parallel Performance – III

Coupled solution, k- turbulence model, AMG-

solver, 1.02 billion cells; STAR-CCM+ Software

Coupled solver
super-linear!

Chloride 80-Net Ups Manual
100% (3)
Chloride 80-Net Ups Manual
126 pages
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
No ratings yet
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
15 pages
Notes PDF
No ratings yet
Notes PDF
9 pages
Domain Decomposition for Parallel Computing
No ratings yet
Domain Decomposition for Parallel Computing
17 pages
Class 8
No ratings yet
Class 8
72 pages
Unit 3
No ratings yet
Unit 3
10 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Meep Openmp
No ratings yet
Meep Openmp
13 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
Content PDF
No ratings yet
Content PDF
14 pages
kbiswas,+ACES Journal May 2019 Paper 14
No ratings yet
kbiswas,+ACES Journal May 2019 Paper 14
7 pages
L03 Geometric Decomposition
No ratings yet
L03 Geometric Decomposition
27 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
High Performance Parallel Computing of Flows in Complex Geometries - Part 1 - Methods
No ratings yet
High Performance Parallel Computing of Flows in Complex Geometries - Part 1 - Methods
25 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Distributed NUMA Implementation of A Direct Solver For DDM Preconditioning
No ratings yet
Distributed NUMA Implementation of A Direct Solver For DDM Preconditioning
2 pages
LectureNoteInCS1573 (VECPAR'98)
No ratings yet
LectureNoteInCS1573 (VECPAR'98)
11 pages
Numerical Simulation in Automotive Design: G. Lonsdale C&C Research Laboratories, NEC Europe LTD., St. Augustin, Germany
No ratings yet
Numerical Simulation in Automotive Design: G. Lonsdale C&C Research Laboratories, NEC Europe LTD., St. Augustin, Germany
7 pages
Distributed Performance Improvement of Alternating Iterative Methods Running On Master Worker Paradigm With Mpi
No ratings yet
Distributed Performance Improvement of Alternating Iterative Methods Running On Master Worker Paradigm With Mpi
13 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
Dutto 1999
No ratings yet
Dutto 1999
14 pages
HPC Lab: Parallel Computing Basics
No ratings yet
HPC Lab: Parallel Computing Basics
58 pages
Pda 2
No ratings yet
Pda 2
105 pages
Thomson 2006
No ratings yet
Thomson 2006
5 pages
Parallel Computing A Comparative
No ratings yet
Parallel Computing A Comparative
65 pages
CHAPTER 3 Wave Equation Computations and Truly Parallel Processing - 1989 - Handbook of Geophysical Exploration Seismic Exploration
No ratings yet
CHAPTER 3 Wave Equation Computations and Truly Parallel Processing - 1989 - Handbook of Geophysical Exploration Seismic Exploration
20 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
ADA233453
No ratings yet
ADA233453
25 pages
Parallel Computing Challenges & Trends
No ratings yet
Parallel Computing Challenges & Trends
81 pages
Parallel Electromagnetic Field Computation
No ratings yet
Parallel Electromagnetic Field Computation
6 pages
Parallel Processing - Openfoam
No ratings yet
Parallel Processing - Openfoam
44 pages
S R T S: OME Esearch Opics For Tudents
No ratings yet
S R T S: OME Esearch Opics For Tudents
3 pages
Numerical Methods in Fluid Dynamics
No ratings yet
Numerical Methods in Fluid Dynamics
354 pages
The Parallel Finite Difference Time Domain (FDTD) Project
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
4 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
829-Article Text-5973-2-10-20210114
No ratings yet
829-Article Text-5973-2-10-20210114
11 pages
UNIT - I: Parallel and Distributed Computing
No ratings yet
UNIT - I: Parallel and Distributed Computing
58 pages
CFD Terminology Guide
No ratings yet
CFD Terminology Guide
10 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Unit 1
No ratings yet
Unit 1
22 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
14 pages
Thesis 1997 Abdullah
No ratings yet
Thesis 1997 Abdullah
259 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Daa 1
No ratings yet
Daa 1
40 pages
Parallel-Port-Example-Computer-Science-2004-7-7-The-Point-Jacobi-Iteration - PRG Örnekleri
No ratings yet
Parallel-Port-Example-Computer-Science-2004-7-7-The-Point-Jacobi-Iteration - PRG Örnekleri
60 pages
Hon Pro
No ratings yet
Hon Pro
8 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
No ratings yet
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
22 pages
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
No ratings yet
Final Project Report Transient Stability of Power System (Programming Massively Parallel Graphics Multiprocessors Using CUDA)
5 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Search and Rescue in The Mediterranean: Briefing
No ratings yet
Search and Rescue in The Mediterranean: Briefing
8 pages
Yongcuncheng No SV 1 68 45 PDF
No ratings yet
Yongcuncheng No SV 1 68 45 PDF
1 page
FVC Reconstruct
No ratings yet
FVC Reconstruct
1 page
Geotechnical Engineering Models
100% (1)
Geotechnical Engineering Models
119 pages
Geotechnical Engineering Models
100% (1)
Geotechnical Engineering Models
119 pages
Discretization Method For Moving Grids: Milovan Perić
No ratings yet
Discretization Method For Moving Grids: Milovan Perić
20 pages
RANS Turbulence Modeling Guide
No ratings yet
RANS Turbulence Modeling Guide
57 pages
18nov-5th Sem Green Synthesis
No ratings yet
18nov-5th Sem Green Synthesis
21 pages
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
No ratings yet
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
15 pages
Village Map: Taluka: Kaij District: Bid
100% (1)
Village Map: Taluka: Kaij District: Bid
1 page
Screening and Assessment LD
No ratings yet
Screening and Assessment LD
63 pages
Đề Khảo Sát Cuối Kỳ Ii
No ratings yet
Đề Khảo Sát Cuối Kỳ Ii
5 pages
HW 683608 1answe
No ratings yet
HW 683608 1answe
4 pages
en - GASP 2020 2022 Global Aviation Safety Plan
No ratings yet
en - GASP 2020 2022 Global Aviation Safety Plan
144 pages
Colour Dilution Alopecia in Doberman Pinschers With Blue or Fawn Coat Colours - A Study On The Incidence and Histopathology of This Di
No ratings yet
Colour Dilution Alopecia in Doberman Pinschers With Blue or Fawn Coat Colours - A Study On The Incidence and Histopathology of This Di
10 pages
Well Productivity in An Iranian Gas-Cond
No ratings yet
Well Productivity in An Iranian Gas-Cond
11 pages
2022 Article 3361
No ratings yet
2022 Article 3361
18 pages
Namma Kalvi 12th Zoology Question Bank em 217045
No ratings yet
Namma Kalvi 12th Zoology Question Bank em 217045
45 pages
Student Animal Research Booklets
100% (1)
Student Animal Research Booklets
45 pages
True or False Items
No ratings yet
True or False Items
17 pages
Why Weightlifting Is Superior
No ratings yet
Why Weightlifting Is Superior
4 pages
First Term TT-2 CL 9,10,11&12
No ratings yet
First Term TT-2 CL 9,10,11&12
1 page
History of Computers
No ratings yet
History of Computers
12 pages
Gotaq QPCR Master Mix Quick Protocol
No ratings yet
Gotaq QPCR Master Mix Quick Protocol
1 page
CH 11
No ratings yet
CH 11
21 pages
Hoc Sinh Gioi 8 - 2022
No ratings yet
Hoc Sinh Gioi 8 - 2022
10 pages
Government Arts College Salem-7
No ratings yet
Government Arts College Salem-7
2 pages
CH1O3 Questions PDF
No ratings yet
CH1O3 Questions PDF
52 pages
E Illustrated Parts C-Arm C-Arm IPM Contents
67% (3)
E Illustrated Parts C-Arm C-Arm IPM Contents
73 pages
SM Chalisa CH 5 - Strategic Implementation & Evaluation - Unlocked
No ratings yet
SM Chalisa CH 5 - Strategic Implementation & Evaluation - Unlocked
28 pages
Lab Report: Submitted To
No ratings yet
Lab Report: Submitted To
6 pages
Criminology MCQs
100% (1)
Criminology MCQs
4 pages
Construction Professionals' Epoxy Guide
No ratings yet
Construction Professionals' Epoxy Guide
3 pages
alloy20DataSheet PDF
No ratings yet
alloy20DataSheet PDF
2 pages
How Do Trusses Work
No ratings yet
How Do Trusses Work
14 pages

Parallel Computing in CFD: Milovan Perić

Uploaded by

Parallel Computing in CFD: Milovan Perić

Uploaded by

PARALLEL COMPUTING IN CFD

• In 1970s, mainframe-computers with vector processors were

• DIRMU had 24 processors and each could read from private

DIRMU parallel computer,

• Germany launched in 1985 SUPRENUM-project (Supercomputer

• In early days, porting of codes from serial to parallel computers

• Parallelization at loop level is subject to Amdahl’s law and thus

• Shared-memory concept allows access to such data, but

• Examples of global communication are:

• The main factors affecting the efficiency of communication

• Explicit methods are easy to parallelize because new solution is

• The usual modification of the iterative solution method is to

• This approach is generic – any iteration solver can easily be

• Implicit methods that perform outer iterations within a time

Structure of the global

• Local communication for time-parallelism involves one send

• The performance is measured by the speed-up or efficiency:

• is the execution time of the best serial algorithm on one

• Processors are usually synchronized at the begin of each

• For a parallel algorithm executed on n processors:

is the communication time which halts computation.

• The total efficiency can thus be re-written as:

is the numerical efficiency, which accounts for higher

is the parallel efficiency, which accounts for

is the load-balancing efficiency which accounts for

• When parallelization is performed in both space and time, the

• When multigrid solvers are used, one needs to agglomerate

• Graphics cards contain many efficient processors for simple

42 times faster execution on

Coupled solution, k- turbulence model, AMG-

You might also like