PARALLEL COMPUTING IN CFD
Milovan Perić
CoMeT Continuum Mechanics Technologies GmbH
&
Institute of Ship Technology, Ocean Engineering and Transport
Systems, University of Duisburg-Essen
[email protected] [email protected]
Introduction
• In 1970s, mainframe-computers with vector processors were
the fastest ones…
• However, many algorithms were difficult or impossible to
vectorize, so other alternatives to speed-up the computations
in a scalable way were sought…
• Parallel computing was one option – and it became the
dominant way of computing in present days (even cell-phones
have quad-core processors nowadays…).
• Parallel CFD started in early 1980s, using experimental parallel
computers (there was nothing on the market yet).
• University of Erlangen in Germany had one such computer
called DIRMU (Distributed Reconfigurable Multiprocessor).
History of Parallel CFD – I
• DIRMU had 24 processors and each could read from private
memory of 7 of its neighbors.
• It was thus suitable for CFD on structured grids using domain
decomposition, where in 3D each grid block would have 6
neighbor blocks…
• Unfortunately, its processors were quickly outdated and there
was no next model…
DIRMU parallel computer,
exhibited in the Regional
Computer Centre Erlangen
History of Parallel CFD – II
• Germany launched in 1985 SUPRENUM-project (Supercomputer
for numerical applications), but it was not continued after the
first phase…
• More successful were private companies Parsytec (Germany)
and Meiko (UK): they built and sold many systems based on
special processors called “Transputers”.
• Transputer systems died out when the latest processor model,
T9000, failed to meet the specifications…
• Several other companies who built parallel computers (Kendal
Square Research and Thinking Machines Corporation, USA) also
do not exist any more…
• The breakthrough came with clusters built with standard
processors that are used for PCs; nowadays largest clusters are
built by companies who also make PCs (Dell, IBM, HP…).
History of Parallel CFD – III
• In early days, porting of codes from serial to parallel computers
was a tedious job…
• Several communication libraries were introduced (TCGMSG ,
PVM, MPI), designed to help programmers by hiding low-level
code.
• Eventually, MPI (message-passing interface) became a de-facto
standard for inter-processor communication.
• On the hardware side, there were also different options of which
few survived:
– Ethernet (in various flavors),
– InfiniBand.
• In early 1990s, European Union supported parallelization of
commercial engineering software (also CFD-software STAR-CD).
Parallelization Concepts – I
• Parallelization at loop level is subject to Amdahl’s law and thus
not very efficient…
• Several other concepts were tried, but they do not reach
efficiency above 50%.
• The standard scalable approach in parallel CFD is based on
domain decomposition.
• The solution domain is split into contiguous subdomains and
each subdomain is assigned to one processor.
• In FV, subdomain boundaries correspond to CV-boundaries.
Each processor computes the solution in its subdomain.
• However, both the discretization and the solution process
require some data that is computed by neighbor processors
(cells next to subdomain boundary refer to one or two layers of
cells on the other side).
Parallelization Concepts – II
• Shared-memory concept allows access to such data, but
memory access becomes a bottleneck – not scalable…
• Distributed memory is the standard concept these days: each
processor has its private memory for data it computes and for a
copy of data it needs from neighbors…
• Data along subdomain boundaries is exchanged typically once
per inner iteration – this constitutes local communication.
• This communication is scalable: it takes place in parallel
between pairs of neighbor processors.
• Local communication depends only on the number of neighbor
subdomains – not on the total number of processors.
• Global communication is also required – gathering of some
information by the master and broadcast to all processors…
Parallelization Concepts – III
• Examples of global communication are:
– Computation of residual norm to estimate iteration errors
(gathering of norms from subdomains);
– Broadcasting of convergence criterion decision;
– Computation of scalar products of two vectors (e.g. in conjugate-
gradient type solvers).
• Global communication is not perfectly scalable – the effort
grows as the total number of processors increases…
• Both local and global communication can often be overlapped
with computation (if supported by the hardware) which does
not require exchanged data.
• Optimization of communication overhead requires re-writing of
parts of the code (more details later)…
Main Influencing Factors – I
• The main factors affecting the efficiency of communication
are:
– Latency (setup time for communication): needed to initialize
communication between two processors;
– Data-transfer rate (bandwidth of the communication channel);
– Amount of data to be transferred.
• The total efficiency depends on the ratio of communication to
computing time – thus processor computing speed is also
important.
• Mathematical models to estimate the efficiency of parallel
computing can be built when the algorithm and the above
parameters are known…
• Some options could be differently invoked depending on
hardware.
Main Influencing Factors – II
• Explicit methods are easy to parallelize because new solution is
computed using only past data…
• Data needs to be exchanged once per time step (after solution
has been updated); there is no global communication…
• The sequence of operations and the solution is identical on one
and on many processors…
• However, even explicit pressure-based methods require the
solution of a Poisson-equation for pressure, so an equation
system needs to be solved – like in implicit methods.
• Implicit methods are more difficult to parallelize; they usually
require adaptations to the iteration matrix, so the sequence of
operations and the solution are not identical on one and on
many processors…
The number of iterations per time step may need to
be increased if too many processors are used!
Domain Decomposition in Space – I
• There are only a few solvers for linear equation systems which
run the same in parallel as in serial mode.
• The obvious one is Jacobi-method, but it is almost never used.
• The so-called “red-black” Gauss-Seidel method is another
solver that needs no adaptation on structured grids.
• ILU-type solvers can in principle be parallelized so that they
execute the same sequence of operations, but only on
structured grids and in a not really scalable way…
• The conjugate-gradient method can also be parallelized
without modification, but only without pre-conditioning; also
almost never used in that form...
• All commonly used solvers run slightly differently on parallel
computers, depending on the number of subdomains.
Domain Decomposition in Space – II
• The usual modification of the iterative solution method is to
lag the data from neighbor subdomains by one iteration.
• Thus, on mth iteration in subdomain i, variable values from
neighbor subdomains used in algebraic equations are taken
from iteration m-1 and treated as known.
• This corresponds to splitting the coefficient matrix A into
diagonal block-matrices Aii (which refer only to variables from
subdomain i) and off-diagonal block-matrices Aij (which refer
to variables in subdomain j from subdomain i).
• The iteration matrix M is also modified correspondingly; it
usually contains only the diagonal blocks Mii:
Domain Decomposition in Space – III
Single
domain
Two sub-
domains
Domain Decomposition in Space – IV
• This approach is generic – any iteration solver can easily be
adapted to it.
• Local communication is required after each inner iteration to
update the neighbor data stored in private memory.
• In the limit of each subdomain containing just one CV/node,
the solver would reduce to Jacobi-method.
• The number of CVs/nodes per subdomain is typically much
larger than the number of subdomains, so it is not that bad…
• However, when number of processors becomes very large,
solver performance would become much worse than on a
single processor.
• Multigrid methods help improve solver performance, but then
part of the work must be done on fewer processors…
Domain Decomposition in Time – I
• Implicit methods that perform outer iterations within a time
step offer another parallelization possibility – solving for
multiple time steps in parallel.
• Usually, one starts computation on a new time step when the
solution at the current time step is finished; all solutions from
previous time steps needed in the algorithm are then known:
• However, one can start computation for the new time step as
soon as the first outer iteration on the current time step is
finished (which provides the first estimate of the solution).
• We then have multiple processors operating on the same
spatial subdomain, but on different time levels.
Domain Decomposition in Time – II
• The equation solved at the mth outer iteration for time step
tn+1 is then (k denotes time level):
Structure of the global
matrix equation when
solving for 4 time steps
in parallel
Domain Decomposition in Time – III
• Local communication for time-parallelism involves one send
and one receive of the complete field per outer iteration
(course-grain communication).
• Local communication takes place in parallel between pairs of
processors.
• There is no global communication associated with time-
parallel computation.
• The use of provisional old data affects convergence of outer
iterations; if too many time steps are computed in parallel,
number of required outer iterations would increase.
• Rule-of-thumb: the number of time steps that can be
executed in parallel = half the number of outer iterations per
time step.
Efficiency of Parallel Computing – I
• The performance is measured by the speed-up or efficiency:
• is the execution time of the best serial algorithm on one
processor (not the execution of parallel algorithm on 1 proc.!);
• is the execution time of the parallel algorithm on n
processors.
• Ideal speed-up equals number of processors and ideal
efficiency is 1 (or 100%).
• Usually, speedup (or efficiency) is lower than ideal but,
depending on the hardware, one can obtain also higher values
(usually due to cashing of data).
Efficiency of Parallel Computing – II
• Processors are usually synchronized at the begin of each
iteration – there are thus idle times because one iteration may
last longer on some processors than on others…
• Reasons for uneven load: unequal number of cells per
subdomain, different boundary conditions or local phenomena,
branches in the algorithm, different number of neighbor
subdomains…
• For a single processor, the computing time can be expressed as:
• For a parallel algorithm executed on n processors:
is the communication time which halts computation.
Efficiency of Parallel Computing – III
• The total efficiency can thus be re-written as:
is the numerical efficiency, which accounts for higher
demand for computing operations by the parallelized
algorithm.
is the parallel efficiency, which accounts for
the time spent on communication during
which computation has to be halted.
is the load-balancing efficiency which accounts for
idle times due to uneven load.
Efficiency of Parallel Computing – IV
• When parallelization is performed in both space and time, the
overall efficiency is the product of spatial and temporal total
efficiencies.
• The total efficiency is easily obtained by measuring total
execution times on a single processor and on n processors.
• The parallel efficiency can be approximately determined by
measuring the execution times when the subdomains are of
equal size and the number of inner and outer iterations is fixed.
• To determine numerical efficiency, one needs either to count
operations or divide the total efficiency by the product of
parallel and load-balancing efficiency.
• Load-balancing efficiency can be estimated by the ratio of the
number of CVs per subdomain, but there are other effects that
can be substantial…
Efficiency of Parallel Computing – V
• When multigrid solvers are used, one needs to agglomerate
subdomains on coarse grid levels…
• Some processors are then left idle until coarsest grids are
visited by fewer processors.
• Lagrangian multiphase and other models may also introduce
different computing load per processors (e.g. dynamic adaptive
grid refinement or overlapping grids which move)…
• Some communication can be overlapped with computation…
• For massive parallelism, numerical efficiency and global
communication are the limiting factors.
• Fortunately, massively parallel flow problems are usually
transient – the numerical efficiency then does not suffer much.
CFD on Graphics Cards
• Graphics cards contain many efficient processors for simple
operations.
• CFD-codes have been ported to or developed for graphics
cards – mostly FD-methods for structured grids and Lattice-
Boltzmann methods.
• Commercial codes have been tested but not seriously used on
graphics cards…
• Some parts of algorithms on unstructured grids are inefficient
on graphics cards due to indirect addressing – memory access
becomes a bottleneck…
• Porting general-purpose codes to graphics cards is not the hot
topic today, but hardware is changing and one day this may
change…
Examples of Parallel Performance – I
Segregated solution, LES,
AMG-solver, Flamelet-based
combustion model, 692
million cells, STAR-CCM+
software
42 times faster execution on
64 times more processors
Examples of Parallel Performance – III
Coupled solution, k- turbulence model, AMG-
solver, 1.02 billion cells; STAR-CCM+ Software
Coupled solver
super-linear!