Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views139 pages

HPC Lecture 3

High Performance Computing (HPC) utilizes parallel processing to efficiently run complex applications, often exceeding teraflop performance levels. It is crucial for scientific discoveries and industrial advancements, enabling faster computations for various fields such as AI and climate modeling. However, HPC faces challenges including high costs, scalability issues, and the need for specialized programming and data management techniques.

Uploaded by

Kashish Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views139 pages

HPC Lecture 3

High Performance Computing (HPC) utilizes parallel processing to efficiently run complex applications, often exceeding teraflop performance levels. It is crucial for scientific discoveries and industrial advancements, enabling faster computations for various fields such as AI and climate modeling. However, HPC faces challenges including high costs, scalability issues, and the need for specialized programming and data management techniques.

Uploaded by

Kashish Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 139

High Performance Computing

Introduction
• It is the use of parallel processing for running advanced
application programs efficiently, relatives, and quickly.
• Function above a teraflop (1012) (floating opm per second).
• High-performance computing is occasionally used as a
synonym for supercomputing.
• Although technically a supercomputer is a system that
performs at or near currently highest operational rate for
computers. Some supercomputers work at more than a
petaflop (1012) floating points opm per second. The most
common HPC system all scientific engineers & academic
institutions. Some Government agencies particularly military
are also relying on APC for complex applications.
• High Performance Computing (HPC) refers to the
practice of combining computing power to deliver
far greater performance than a typical desktop or
workstation, in order to solve complex problems.

• Processors, memory, disks, and OS are elements


of high-performance computers are really clusters
of computers, called as Nodes.

• Each individual computer in a commonly


configured small cluster has between one and
four processors.
work together to solve a problem larger than any one
computer can easily solve.
• These nodes are so connected that they can communicate
with each other in order to produce some meaningful work.
• There are two popular HPC’s software i. e, Linux, and
windows.
• Most of installations are in Linux because of Linux legacy in
supercomputer and large scale machines. But one can use it
with his / her requirements.
Importance of High performance Computing :
• It is used for scientific discoveries, game-changing innovations,
and to improve quality of life.
• It is a foundation for scientific & industrial advancements.
• It is used in technologies like IoT, AI, 3D imaging evolves &
amount of data that is used by organization is increasing
exponentially to increase ability of a computer, we use High-
performance computer.
• HPC is used to solve complex modeling problems in a spectrum
of disciplines. It includes AI, Nuclear Physics, Climate
Modelling, etc.
• HPC is applied to business uses, data warehouses & transaction
processing.
• Need of High performance Computing :
• It will complete a time-consuming operation in less time.
• It will complete an operation under a light deadline and perform a
high numbers of operations per second.
• It is fast computing, we can compute in parallel over lot of
computation elements CPU, GPU, etc. It set up very fast network to
connect between elements.

• Need of ever increasing Performance :


• Climate modeling
• Drug discovery
• Data Analysis
• Protein folding
• Energy research
How Does HPC Work?

• User/Scheduler → Compute cluster → Data storage

• To create a high-performance computing architecture, multiple


computer servers are networked together to form a compute
cluster.
• Algorithms and software programs are executed simultaneously on
the servers, and the cluster is networked to data storage to retrieve
the results.
• All of these components work together to complete a diverse set of
tasks.
• To achieve maximum efficiency, each module must keep pace with
others, otherwise, the performance of the entire HPC infrastructure
would suffer.
Challenges with HPC
• Cost: The cost of the hardware, software, and energy
consumption is enormous, making HPC systems exceedingly
expensive to create and operate. Additionally, the setup and
management of HPC systems require qualified workers,
which raises the overall cost.

• Scalability: HPC systems must be made scalable so they may


be modified or expanded as necessary to meet shifting
demands. But creating a scalable system is a difficult
endeavour that necessitates thorough planning and
optimization.
• Data Management: Data management can be
difficult when using HPC systems since they produce
and process enormous volumes of data. These data
must be stored and accessed using sophisticated
networking and storage infrastructure, as well as
tools for data analysis and visualization.

• Programming: Parallel programming techniques,


which can be more difficult than conventional
programming approaches, are frequently used in
HPC systems. It might be challenging for developers
to learn how to create and optimise algorithms for
parallel processing.
• Support for software and tools: To function
effectively, HPC systems need specific software and
tools. The options available to users may be
constrained by the fact that not all software and tools
are created to function with HPC equipment.

• Power consumption and cooling: To maintain the


hardware functioning at its best, specialised cooling
technologies are needed for HPC systems’ high heat
production. Furthermore, HPC systems consume a lot
of electricity, which can be expensive and difficult to
maintain.
Difference Between High Performance Computing and High
Throughput Computing

• HPC • HTC
HPC is defined as the type of computing HTC is defined as a type of computing
that makes use of multiple computer that parallelly executes a large number
processors in order to perform complex of simple and computationally
computations parallelly. independent tasks.

HTC consists of running a large number


HPC consists of running large-scale,
of tasks that are independent and small
complex, and computationally intensive
in size and does not require a large
applications that need significant
amount of memory and resources.
resources and memory.

HTC is designed to increase the number


It is designed to provide maximum
of tasks that needs to be completed in a
performance and speed for large tasks.
given specific amount of time.
For resource management to For the resource management to
processes, HPC makes use of job processes, HTC makes use of
schedulers and resource managers. distributed management resources.

To reduce the risk of data loss and data HTC systems do not affect any other
corruption HPC systems have a complex running processes due to the failure of
fault tolerance mechanism. an individual task.

HTC systems scale horizontally for


HPC scales up when few users are
simple tasks and require less
running together.
computational speed.

HPC can be used in applications such as HTC can be used in applications such as
engineering design, weather bioinformatics, research applications,
forecasting, drug discovery etc. etc.
Parallel Computing Vs Distributed Computing

S.NO Parallel Computing Distributed Computing

Many operations are performed System components are located at


1.
simultaneously different locations
2. Single computer is required Uses multiple computers
Multiple processors perform Multiple computers perform
3.
multiple operations multiple operations
It may have shared or distributed
4. It have only distributed memory
memory

Processors communicate with Computer communicate with each


5.
each other through bus other through message passing.

Improves system scalability, fault


6. Improves the system performance tolerance and resource sharing
capabilities
Instruction-level parallelism (ILP)
• ILP refer to the architecture in which multiple operations
can be performed parallelly in a particular process, with
its own set of resources – address space, registers,
identifiers, state, and program counters.

• It refers to the compiler design techniques and


processors designed to execute operations, like memory
load and store, integer addition, and float multiplication,
in parallel to improve the performance of the processors.

• Measured in the average number of instructions run per


step of this parallel execution.
ILP Vs concurrency
In ILP there is a single specific thread of execution of
a process.
Concurrency involves the assignment of multiple
threads to a CPU's core in a strict alternation, or in
true parallelism. if there are enough CPU cores,
ideally one core for each runnable thread.

ILP Vs Pipeline processing


Pipeline processing has the work of breaking down
instruction execution into stages, where as ILP
focuses on executing the multiple instructions at
the same time.
• There are two approaches to instruction-level
parallelism: Hardware and Software.
• Hardware level works upon dynamic parallelism,
whereas the software level works on static parallelism.
• Dynamic parallelism means the processor decides at run
time which instructions to execute in parallel, whereas
static parallelism means the compiler decides which
instructions to execute in parallel.[

• Architectures that exploit ILP are VLIWs (Very long


instruction words) and superscalar Architecture.
• ILP processors have the same execution hardware as RISC
processors .
• Example:
Consider the following program:
E=A+B
F=C+D
M=E*F
• Operation 3 depends on the results of operations 1
and 2, so it cannot be calculated until both of them
are completed.
• Operations 1 and 2 do not depend on any other
operation, so they can be calculated simultaneously.
• If we assume that each operation can be completed
in one unit of time then these three instructions can
be completed in a total of two units of time.
• ILP = 3/2
• ILP allows the compiler and the processor to overlap the
execution of multiple instructions or even to change the
order in which instructions are executed.
• How much ILP exists in programs is very application
specific.
• In certain fields, such as graphics and scientific
computing the amount can be very large. Workloads
such as cryptography may exhibit much less parallelism.
Micro-architectural techniques that are
used to enhance ILP

• Instruction pipelining is used for


implementing instruction-level parallelism within a
single processor.
• Superscalar execution, VLIW (Very long instruction words)
and the explicitly parallel instruction
computing concepts, in which multiple execution
units are used to execute multiple instructions in
parallel.
• A superscalar processor is a CPU that implements
ILP within a single processor.
• A scalar processor can execute at most one single
instruction per clock cycle, a superscalar processor can
execute more than one instruction during a clock cycle by
simultaneously dispatching multiple instructions to
different execution units on the processor.
• Superscalar processor allows more throughput (the
number of instructions that can be executed in a unit of
time) than would otherwise be possible at a given clock
rate.
• Superscalar and Pipelining execution are considered
different performance enhancement techniques.
• The Superscalar executes multiple instructions in parallel by
using multiple execution units, whereas the Pipelining
executes multiple instructions in the same execution unit in
parallel by dividing the execution unit into different phases.
• Out-of-order execution (or dynamic execution) where
a processor executes instructions in an order governed
by the availability of input data and execution
units, rather than by their original order in a program.
• Instructions execute in any order that does not violate
data dependencies. This technique is independent of
both pipelining and superscalar execution.
• It is used in high-performance central processing
units to make use of instruction cycles that would
otherwise be wasted.
• The processor can avoid being idle while waiting for
the preceding instruction to complete and can, in the
meantime, process the next instructions that are able
to run immediately and independently.
Data Hazards
• When more than one instruction references
a particular location as an operand, either by
reading it (as an input) or by writing to it (as
an output), executing those instructions in an
order different from the original program
order can lead to three kinds of data hazards:

1. Read-after-write (RAW)
2. Write-after-write (WAW)
3. Write-after-read (WAR)
• Read-after-write (RAW): a read from a register or memory
location must return the value placed there by the last
write in program order, not some other write.
• Also called true dependency or flow dependency, and
requires the instructions to execute in program order.
• It occurs when an instruction depends on the result of a
previous instruction.
• 1. A = 3; 2. B = A; 3. C = B
• Instruction 3 is truly dependent on instruction 2, as the
final value of C depends on the instruction updating B.
• Instruction 2 is truly dependent on instruction 1, as the
final value of B depends on the instruction updating A.
• Since instruction 3 is truly dependent upon instruction 2
and instruction 2 is truly dependent on instruction 1,
instruction 3 is also truly dependent on instruction 1.
• Write-after-write (WAW): successive writes to a particular
register or memory location must leave that location
containing the result of the second write.
• This can be resolved by squashing/cancelling the first
write if necessary.
• WAW dependencies are also known as output
dependencies.
• Occurs when the ordering of instructions will affect the
final output value of a variable.
• Example:
• 1. B = 3; 2. A = B + 1; 3. B = 7
• there is an output dependency between instructions 3 and
1.
• Changing the ordering of instructions in this example will
change the final value of A.
• Write-after-read (WAR): a read from a register or memory
location must return the last prior value written to that
location, and not one written programmatically after the
read.
• This is a sort of false dependency that can be resolved by
renaming. WAR dependencies are also known as anti-
dependencies.
• Occurs when an instruction requires a value that is later
updated.
• Example:
• 1. B = 3; 2. A = B + 1; 3. B = 7
• instruction 2 anti-depends on instruction 3 — the ordering
of these instructions cannot be changed, nor can they be
executed in parallel (possibly changing the instruction
ordering), as this would affect the final value of A.
• Register renaming is a technique that abstracts
logical registers from physical registers.
• Every logical register has a set of physical registers
associated with it.
• When a machine language instruction refers to a
particular logical register, the processor transposes
this name to one specific physical register.
• This technique is used to eliminate false data
dependencies arising from the reuse of registers by
successive instructions that do not have any real
data dependencies between them.
• The elimination of these false data dependencies
reveals more instruction-level parallelism in an
instruction stream.
• Speculative execution is an optimization technique where
a system performs some task that may not be needed.
• Work is done before it is known whether it is actually needed, so as
to prevent a delay that would have to be incurred by doing the work
after it is known that it is needed.
• Ex.of speculative execution is control flow speculation (the order in
which the computer executes statements) where instructions are
executed before the target of the control flow instruction is
determined.
• Speculative execution driven by value prediction, memory
dependence prediction and cache latency prediction.
• Branch prediction attempts to guess whether a conditional jump will
be taken or not.
• Branch target prediction attempts to guess the target of a taken
conditional or unconditional jump before it is computed by decoding
and executing the instruction itself.
• Branch prediction and branch target prediction are often combined
into the same circuitry.
Instruction Level Parallelism (ILP) Architecture

• Instruction Level Parallelism is achieved when multiple


operations are performed in a single cycle, which is done
by either executing them simultaneously or by utilizing
gaps between two successive operations that are created
due to the latencies.
• The decision of when to execute an operation depends on
the compiler rather than the hardware.
• But the extent of the compiler’s control depends on the
type of ILP architecture where information regarding
parallelism given by the compiler to hardware via the
program varies.
Classification of ILP Architectures

• Sequential Architecture: Here, the program is not expected to


explicitly convey any information regarding parallelism to
hardware, like superscalar architecture.

• Dependence Architectures: Here, the program explicitly


mentions information regarding dependencies between
operations like dataflow architecture.

• Independence Architecture: Here, programme gives information


regarding which operations are independent of each other so
that they can be executed instead of the ‘nops.

• In order to apply ILP, the compiler and hardware must determine


data dependencies, independent operations, and scheduling of
these independent operations, assignment of functional units,
and register to store data.
Techniques To Enhance
Instruction-level Parallelism

• Reordering - idea of reordering the instruction is


to pick another independent instruction (which
will be executed anyway) to fill the gap.
Therefore, we do not need to waste clock cycles
to wait and do nothing.
• Out-of-Order Execution
• Speculative Execution
• Branch Prediction
Advantages of Instruction-Level Parallelism

• Improved Performance: ILP can significantly improve the


performance of processors by allowing multiple instructions to be
executed simultaneously or out-of-order. This can lead to faster
program execution and better system throughput.
• Efficient Resource Utilization: ILP can help to efficiently utilize
processor resources by allowing multiple instructions to be executed
at the same time. This can help to reduce resource wastage and
increase efficiency.
• Reduced Instruction Dependency: ILP can help to reduce the number
of instruction dependencies, which can limit the amount of
instruction-level parallelism that can be exploited. This can help to
improve performance and reduce bottlenecks.
• Increased Throughput: ILP can help to increase the overall throughput
of processors by allowing multiple instructions to be executed
simultaneously or out-of-order. This can help to improve the
performance of multi-threaded applications and other parallel
processing tasks.
Disadvantages of Instruction-Level Parallelism
• Increased Complexity: Implementing ILP can be complex
and requires additional hardware resources, which can
increase the complexity and cost of processors.
• Instruction Overhead: ILP can introduce additional
instruction overhead, which can slow down the execution
of some instructions and reduce performance.
• Data Dependency: Data dependency can limit the amount
of instruction-level parallelism that can be exploited. This
can lead to lower performance and reduced throughput.
• Reduced Energy Efficiency: ILP can reduce the energy
efficiency of processors by requiring additional hardware
resources and increasing instruction overhead. This can
increase power consumption and result in higher energy
costs.
• Processor Arrays are the initial days development in parallel computing.
examples are the vector computers.

• A Vector Computer is a CPU system designed to operate simultaneously on


all elements of a one-dimensional array, compared to an ordinary
processor which works on a single scalar only.

• if a = b + c, sequential the computer will consider that a is a variable, b is


another single valued variable, c is another single valued variable and add
b and c and get a value.

• If vector computer then a is an array or a vector, a has said 100 elements,


similarly b and c have 100 elements. So, the same operation will be
processed over all the 100 elements of b and c, there will be multiple
computers which are doing the same operation on different elements of a
vector.

• So, instead of processing one by one of different elements on the vector,


all elements of the vector will be worked on.
• In a Vector Computer there is a CPU which has its own memory and
I/O devices.
• This CPU is connected to a scalar bus and then to an interconnect
network switch and this interconnect network switch further
connects different processors which have their own small memory
units.
• They are again connected to a global result bus and this result bus is
connected to the CPU
• That means, scalar memory is one by one goes from the CPU and
resides in the memories of different vector processors present in the
processor array. And these processor arrays have different CPUs and
each CPU is connected with their small memory part.
• So, this scalar memory bus is connected to the memories of different
processors and each processor is also connected with the
interconnection bus and the instruction goes from the CPU directly
to each of this processor.
• They take the small amount of memory associated with that, work
on themand the result goes to a global bus which returns to the CPU.
Processor Arrays
• If the distributed multiprocessors do not map
to the same memory, they are called multi
computers; i.e., multiple computers connected
together through an interconnect network
switch.

• It has multiple CPUs each one has its own local


memory and local cache.

• Local memories do not map to a single global


memory; but they are connected to a single
common file system through I/O switch and
internet.
• We can make an intranet and connect many
computers to make multi computer systems
and they can perform distributed computing
HPC computations using that system.

• This is a scalable system because they are


different computers connected through a
network switch.

• We can increase the number of computers


using higher end switches and can have 10,000
or more computers connected together.
• Asymmetric Multi Computer: there is a user who can
talk to the internet and through the internet he can
access the front-end computer and the front-end
computer is connected with an interconnection
network, which is connected with many other
computers.

• So, if any job is given it is given to the front-end


computer, it launches it to the slave computers
attached to it. This is called an asymmetric multi
computer.

• Users can only log on to the front-end computer and


instruct it to run a parallel job at the back-end systems.
• The front-end computer only needs to talk to the internet and user;
therefore, it has an updated operating system with all the
functionalities.

• The back ends are only designed to run some of the parallel applications
which are launched by the front-end computer to them. So, they may
have a primitive operating system and may not need to have even
graphic support, they are only supposed to do some of the arithmetic
logical operations as instructed by the front-end computer.

• Each of these computers have their own memory and they are
connected through the internet connect network and front-end
computer to a file system where each of the computers data can be
written down.

• But while operating, they are operating only on the local memory given
to them and they do not point to a single common address space and
this is an asymmetric multi-computer.
• Symmetric Multi Computer where the internet is
connected to the interconnection network and it can
access any of the computers and these computers can
communicate with other computers and make a group
among them and launch a job.

• Each node has functionalities and can shoot parallel jobs


alongside this with them.

• A very well developed cluster and we can access any of


these computers log on to any of these computers and
launch a job.
HYBRID HPC CLUSTER
Distributed Memory, Distributed Node System
• Hybrid HPC cluster is a distributed memory, distributed node
system, and inside each node there are multiple multiprocessors.
So, each node is itself a shared memory multiprocessor or UMA
(Uniform Memory Access).

• Multiple computers are connected via high speed myrinet


switches. Myrinet switches are high speed network
communication switches.

• Each computer has multiple processors with shared memory


connected through interconnectors or the same bus. This is a
scalable system it can connect near million processors.

• There is user internet which can connect to the switch Ethernet


to a front-end computer and then through the network switch,
the front end computer can launch jobs into multiple computers
in between them, and each computer is like today’s multi core
computer.
• Each of the computers is a multiprocessor also which is
connected to a local memory, but this is a shared memory
space for all the processors inside the computer.

• Apart from the Ethernet, they are connected through myrinet


which is a much faster network connection and by this we can
connect thousands of computers and each computer can have
a few hundred processors inside it, as a multiprocessor system.
So, by this we can have a near million processor system.

• Each computer may be connected with GPUs or coprocessor


accelerators to give barrier speed.

• In a leading supercomputer this multiprocessor computer


arrangement is there and each computer is connected with a
GPU or coprocessor. These are the most developed HPC
architectures.
Classification Of Parallel Architecture:
Flynn’s Taxonomy
• Computer architecture is an evolving area, what we see
today’s GPU was not there 20 years back.

• And what was the fastest supercomputer 20 years back,


probably if we consider its performance, it will probably not
be among top 5000 supercomputers today.

• Flynn’s taxonomy is something which classified the parallel


computing architecture, and even today any of the high-
performance computing platforms can be classified in one
of the classifications given by Professor Flynn in 1966.
Flynn’s Taxonomy
• SISD, is very much like a
sequential computer.

• Computers in this category can


decode a single instruction in unit
time.

• There is a processing unit and a


data pool, there is some
instruction given to it, once an
instruction comes the processing
unit, it takes data from the data
pool and processes the interest
information.

• It is a non-parallel computing
method ,that there is one data
pool, CPU is taking data from the
RAM, one particular location
data ,one CPU is accessing one
data and it is getting one
instruction and it is processing it.
• Check In Desk In The Airport

• You go to the check in desk, present your ticket, the person


sitting at the check in desk looks into the data pools, sees
whether what is your ticketing number PNR, what should
be the seat number printed on the boarding pass and
gives you back.

• The next person comes, he has to wait behind you, once


you are done.

• It is a SISD means a single disk in the single check in desk in


the airport counter and this is the way an ordinary
computer or a sequential computer works.
• In SIMD there is an
array of processors.

• It is like a vector
processor or a
processor array,
which is executing
the same
instruction, but on
different data.

• So, a single
instruction is given
to all the processors.
EXAMPLE

• Many desks are there in a airport check in counter, and a


supervisor in megaphone giving instruction that you take
everybody’s ticket, look into their check in data, (there
are say 4 counters in parallel).

• So, at a go 4 person can stand in 4 counters, give their


tickets he will check their data, then the supervisor is
giving instruction that print their boarding pass so, 4
boarding passes will be printed and 4 people will be
done.
• Say we ask a parallel computer to do a matrix vector
multiplication and each computer will take care of one of the
rows of the matrix.

• So, it is one row per computer, and the entire matrix can be
computed at a go, because each row instead of looking into
running a loop over all the rows each computer is responsible for
one row.

• Many computers are doing the same work, taking the first
element of the row, multiplying with the first element of the
vector, adding with the second element, multiplying with the
second element of the vector and so on, finishing for each row.

• So, SIMD is used in many shared memory and GPU based


applications.
• Some also consider
this category to be
empty; that means
that this is not
much used in
parallel computing.

• That you have the


same data pool,
but there are
multiple
instructions which
are operated over
the same data
stream.
EXAMPLES

• For the same passenger, different desks are doing different jobs.
Like once the passenger goes, one desk gives his ticket, so, the
PNR number and all the information are obtained, one desk is
processing his ticket, other desk is taking his luggage, other desk
is printing his ticket etcetera. So, instead of one person working
on the same data and doing different operations, different
persons are doing different operations on the same data.

• In income tax department work, a PAN is given and different


financial information is required about how much tax he has
paid, what are his bank transactions, what is his foreign money
exchange etcetera. So, it is the same data, it is the same PAN
against which all this different information can be found out and
differently processing the data.
• MIMD is the most parallelized
version. There are many CPUs,
each CPU can access different
memory, each CPU can work on
different memory locations and
each of them are in parallel
running their own data.

• So, there is a data pool which is


accessed by different
computers, it can be multi
computers with different data.

• It is not the same instruction


pool, it is a different instruction
pool, which is operated by
different computers in parallel.
• These computers connected to each other, they exchange
information, they can pass some messages from one to
another, they can have some synchronization.

• So, many desks working at their own pace with different


data synchronized, but they have a central database.

• Therefore there is some synchronization in between them,


both in terms of data as well in terms of the operations.

• It needs more communication between each processor;


because they have much flexibility and in order to have
synchronization they need to communicate more.
• When ants are moving in a line, it's only communication with the previous or the
following is through some chemical exchange and less amount of communication is
required.

• When you give more flexibility, say 15 students from 15 different backgrounds are
coming to an examination hall and after the exam is over, they will go out and catch
their buses. If they will reach the same destination within the stipulated time, then
they need more communication.

• So, more flexibility means more synchronization, if you need to use the same data
you need more communication. Flexibility means more communication and this is
heavily used in scientific computing.

• Research problems will use MIMD because it is more flexible, you have a much
complex algorithm, you need much more flexibility to parallelize it and get the
efficient parallel performance.

• Top 500 supercomputers follow this architecture. They provide a platform where
you can run programs which are following MIMD architecture.
• MIMD is one of the very important architecture for high performance
scientific computing.

• The first one is a single program with multiple data. The same program runs
in the system. It is the same executable file which runs in the system. And
the same executable file goes to different processors, and it works there.

• However, different instances of the program through if-else statements run


in different processors and operate on different data. It is a same program
which has different instances and depending on the processor number,
depending on the part of the data that processor is connected to, this is
operating.

• So, when you write a program about matrix multiplication, matrix inter
solutions, we ask many computers to work in parallel. We do not write
different programs on different computers. We write one program which is
executable over different computers,
Components of Cluster
• Tasks can be allocated independently to the
processors.

• If no. of processors are less than no. of tasks then


group of tasks may be given to individual processors.

• If there exists dependency among the tasks the first


construct the task dependency graph, and then
allocate the tasks/ group of tasks to the processors.
• It can be observed that the first
task dependency graph allows
more concurrent processes and
hence more favourable for
parallelization.
• But for large and complex
computation just by looking
into these figures we cannot
make a decision which one is
the better version.
• So, we need to have some
matrix to find out which is more
favourable for parallelization.
• The number of small tasks we can find out from a large problem
gives a granularity.

• When you get more granularity, parallelizing it will be easier.

• More number of fine tasks imply a fine grained composition else


it is a coarse grained composition.

• In some of the cases, say you are trying to find out something
over a query and you cannot decompose it into many small
numbers of tasks ,it has to be a large piece of work, it is a coarse
grained one.

• If it is fine grained, you can find out many small units of tasks it
will be better for parallelization.
• The number of programs that can run in parallel at any given
time is known as degree of concurrency. So, even if you have
good granularity there can be dependence across the tasks, but
if we can find out no. of tasks can be operated in parallel at one
particular instant, it is the degree of concurrency.

• More is the degree of concurrency more processors can do the


work and can get better parallelization.

• Concurrency is not fixed during the execution, at the first stage


there may be 4 concurrent tasks, in the later part there are 2
concurrent tasks.

• Maximum degree of concurrency is important to decide good


parallelization.
Average degree of
concurrency is given as
the total weights of the
entire process divided by
the length of the critical
path. A shorter critical
path ensures better
concurrency.
Task Interaction Graph
• The task dependency graph involving concurrency and granularity does
not consider the requirement of data sharing and communication across
tasks.
• Task interaction graph identifies which of the tasks have sharing of data.
• There are has to be cache coherency and some protocols for cache
coherency can introduce some of latency in that process.
• There can be false sharing which will also introduce latency.
• For a distributed memory system if there is sharing then some data has
to be shared from one processor to another processor, some data has to
be physically passed through the interconnectors from one processor to
another processor which will also put certain overhead in the
calculations.
• Therefore, this task interaction graphs graph is important to identify or
estimate how much overhead is going to happen for one particular task
dependency graph and how much data is shared across different
processors.
• Once we have all
the tasks we draw
the task
dependency graph.

• Next is to map the


tasks to the
processors and see
which processors
will run during the
execution and
which task will go
to which of the
processors.
HPC Software Stack
Data-Parallel Model
• Use the data decomposition technique and
asking different processes to work on different
parts of the data in parallel.
• There should be no data dependency across
processes.
• Contentions, false sharing and cache
coherency issues to be looked into.
Task Graph Model
• Some of tasks are dependent on previous tasks which is
not usually seen in data parallel models because it is the
same data in which different tasks can independently
work.

• There are some task dependencies and therefore, looking


into the task dependency model we need to find out the
critical path and get up a parallelized model.

• This is a more involved/research experience. It requires


involved effort to get a parallel algorithm than a data
parallel model.
Work Pool Model
• If we have a small amount of task dependency across
different tasks.

• So get the total amount of work, group different tasks


together and use dynamic load balancing to get part of
the task done by one processor, another part of the task
to be done by another processor.

• Make pools of tasks and assign it to different processors


and this can be done using dynamic load balancing, in
case task dependency is smaller or almost no task
dependency.
Master Slave Model
• In the Master slave model, one processor is called the
Master.
• The master knows that it has many tasks to do like to
run a for loop and it asks different slaves or different
processors under its own supervision to take care of
some of the tasks or arts of the task given to it.
• So, there is a master task which is forked or broken down
into many slave tasks and different processes take care of
different tasks.
• Challenges of Synchronization and optimization are
there.
Pipeline Model
Hybrid Model
• Any combination of models discussed so far can
give a hybrid model.
• Data parallel model combined with a master-slave
model or pipeline model combination with task
graph model.
• So, whenever there is a possibility of developing a
hybrid model then there is a possibility of being
innovative in terms of designing the algorithms.
• So, we can combine some of these ideas and
make our own model.
• Agglomeration is grouping the task looking into
communication over it. These group of tasks will go to
different processors through a mapping which will also
take care of load balancing.

• Mapping is to map the group of the tasks looking into


localization, communication overhead along with taking
care of load balancing, that they have nearly the same
number of tasks.

• Also, they will take care of the fact that some of the
processors those are interacting with another processor's
task and not very far from each other
• Say a parallel program is running over 8 processors (Diagram
of previous slide).

• The black one is the computation and this can be the essential
communication which was essentially present even in the
serial algorithm or the excess computation added due to our
parallelization of the job.

• The grey one is the inter processor communication and the


white one is the idle time.

• If we add up the total time used by all the processors, in


between 10 to 20 percent of the total time is gone for
something which is due to inter processor communication and
idling.
• As we are increasing number of processors, we should
get faster solution because many computers are
taking care of the large problem and it will be solved
faster.

• But the inter processor communication is also


increased and idling has also been increased due to
non uniform load balancing.

• As we are increasing more processors the load


balancing becomes more difficult and therefore, these
overheads have also increased well.
Performance metrics
• Execution time Tp, defined as how much time parallel
program is taking for execution.
• If we are running a program in 10 processors in one
particular infrastructure, what is the time taken.
• And this is compared with the Serial Programs Time Ts,
that if we do not use a parallel program. For the serial
program Ts is the time taken for execution of the same
problem. Ts is determined by how large is our problem
• Execution time and the speed of the parallel program are
inversely related.
• No. of iterations or how much computation has been done
is also a measurement of speed.
• Consider We have given a problem, then calculate the least time in which
this problem can be solved in a single sequential program in a single
processor.

• An inverse of this will give the sequential speed.

• If we compare parallel speed with the sequential speed we will get speed up
or scalability.

• This is obtained by dividing the speed of parallel programs by the fastest


sequential program for the same problem.

• It may not be the same program running in the same single processor.
Because when we make a parallel program there are many excess computing
computational steps also, but in the simple sequential program these are not
there.

• Also, a problem can be solved in many ways, but we consider the one which
gives us the fastest solution in a single processor.
• Speed up is denoted as S which is equal to Ts/ Tp.
• It is the time taken by the best performance sequential program
divided by the time taken by the parallel program.

• S always greater than 1; If it is less than 1 then or equal to 1,


then there is no meaning of using parallel infrastructure.

• If we draw speed up curve for any general parallel program we


see the speed up is increasing with the number of processors.

• In the ideal case when we are adding one processor more and
we are getting the same similar ratio of speed up. It should
follow a 45-degree line (directly proportional) In a real case, this
deviates from that 45-degree slope line due to the overheads.
• In a real case, this deviates from that 45-degree slope line due
to the overheads.
• As the number of processors speeds up, we are getting faster
solutions; however, it is not as fast as the number of processors
being added.
• Because with increasing in processors the overheads are
increasing and overhead depends on communication between
the processor.
• So, as we are increasing the number of processors, the
communication time or overhead is increasing as well as
synchronization overhead and load imbalance overhead all
these things will add.
• So, the actual speed of the curve will deviate from the 45-
degree slope line and eventually it might come to a maximum
and then it starts showing some reductions.
• It should start to show the
reduction and it will reduce the
value of speed up and will reduce
with increasing numbers of
processors if we use a very large
number of processors.
• At that point of time it is not
meaningful to use a large number
of processors.
• So, when it is reaching a maximum
it will probably stop. We cannot
add any more processors then that
is not beneficial.
• It is desirable that this curve
should be much closer to the ideal
curve that the overhead will be
less and we will get better speed
• The best possible parallel problem is
45 degrees is the linear scalability.
• As we are increasing the number of
processors, the speed is increasing
more than the curve shown here.
This is a much more acceptable
parallel program. These are some
cases when we get super linearity.
• It has more complex issues involving
hardware and memory access. But
our desire is that the curve should be
pushed more towards this line and
that the overhead will be reduced.
• So, what is the maximum limit to
which we can push? If we have
already reached the maximum limit,
we should not spend more time
increasing the performance. It is like
trying to violate laws.
• Efficiency E is given as Speed-up divided by the number of
processors.

• If we are having a large problem to solve and this is running


over a large number of computers and processors, these
computers cost huge amount of money in establishing the set
up as well as for running the setup.
• There is huge operational cost including the energy investment.
• Also they do contribute to our energy scenario as well as to
global warming. Data centers or the hubs in which we host the
parallel computers, they are infamous for sources of global
warming.
• We should be judicious while deciding what is the optimum
number of processors and what is the optimum size of the
computing environment in a computing infrastructure.
• 𝐸 = 𝑆/𝑝 = 𝑇𝑠/ 𝑝𝑇𝑝

• 𝑝𝑇𝑝 is the sum of the sequential computing time and the


computational overhead.

𝐸 = 𝑆/𝑝 = 𝑇𝑠/ 𝑝𝑇𝑝 = 𝑇𝑠/( 𝑇𝑜+𝑇𝑠) = 1/( 1+ 𝑇𝑜/𝑇𝑠) where To is


• Therefore

overhead

This 𝑇𝑜/𝑇𝑠 is the ratio of overhead divided by sequential computing


time therefore, efficiency is a function of the ratio of overhead and
sequential computing time.

Overhead occurs mainly due to interaction between the processors or


data transfer and load balancing.
As we increase the number of processors overhead will increase, more
data transfer time will be required.
More processors are there so, load balancing effort and latency due to
• 𝐸 = 𝑆/𝑝, Therefore 𝑆 = 𝑝𝐸

• Also 𝐸 = 1/( 1+ 𝑇𝑜/𝑇𝑠) where To is overhead

𝑆 = 𝑝 (1 + 𝑇𝑜/𝑇𝑠) = 1/( 1 /𝑝 + 𝑇𝑜/ 𝑝𝑇s)


• Therefore

• In case the overhead is very small say 𝑇𝑜 is a small number close to


0. Then S = 1/( 1 /𝑝 + 𝑇𝑜/ 𝑝𝑇s) = 𝑝
• So, if we are using five processors the computing time is one fifth of
the sequential time and which is an ideal system which is an efficient

• But it never happens because S never equal to p as 𝑇𝑜 is never is


one system.

equal to 0.
• So, there is a significant amount of overhead and it leads to reduction
in speed
Feature Multiprogramming Multitasking Multithreading Multiprocessing

Running Running
Running
Running multiple multiple multiple
multiple tasks
Definition programs on a single threads within a processes on
(applications)
CPU single task multiple CPUs
on a single CPU
(application) (or cores)

Resources (CPU, Resources (CPU, Each process


Resources (CPU,
Resource memory) are memory) are has its own set
memory) are shared
Sharing shared among shared among of resources
among programs
tasks threads (CPU, memory)

Uses priority- Uses priority-


Uses round-robin or
based or time- based or time- Each process
priority-based
slicing slicing can have its
Scheduling scheduling to
scheduling to scheduling to own scheduling
allocate CPU time to
allocate CPU allocate CPU algorithm
programs
time to tasks time to threads
Each program has Each task has Threads share Each process
Memory
its own memory its own memory space has its own
Management
space memory space within a task memory space

Requires a Requires a
Requires a
Requires a context context switch context switch
Context context switch
switch to switch to switch to switch
Switching to switch
between programs between between
between tasks
threads processes

Uses inter-
Uses thread
process
Uses message synchronization
Inter-Process Uses message communication
passing or mechanisms
Communicati passing or shared mechanisms
shared memory (e.g., locks,
on (IPC) memory for IPC (e.g., pipes,
for IPC semaphores)
sockets-) for IPC
for IPC
Difference between
Asymmetric and Symmetric
Multiprocessing
• Multiprocessing is the use of two or more central processing
units within a single computer system. Parallel computations are
made possible by multiprocessor systems.

• Asymmetric Multiprocessing: A multiprocessor computer


system where not all of the multiple interconnected central
processing units (CPUs) are treated equally. Here, only a master
processor runs the tasks of the operating system.
• The processors in this instance are in a master-slave relationship.
While the other processors are viewed as slave processors, one
serves as the master or supervisor process.
• In this system, the master processor is responsible for assigning
tasks to the slave processor.
• The disadvantage of this kind of multiprocessing system is the
unequal load placed on the processors. While the other processors
might be idle, one CPU might have a huge job queue.
• Symmetric Multiprocessing: It involves
a multiprocessor computer hardware and
software architecture where two or more
identical processors are connected to a
single, shared main memory, and have full
access to all input and output devices.

• In other words, Symmetric Multiprocessing


is a type of multiprocessing where each
processor is self-scheduling.
S. No. Asymmetric Multiprocessing Symmetric Multiprocessing

In symmetric
In asymmetric multiprocessing, the multiprocessing, all the
1.
processors are not treated equally. processors are treated
equally.

Tasks of the operating


Tasks of the operating system are done
2. system are done individual
by master processor.
processor.

All processors
No Communication between
communicate with another
3. Processors as they are controlled by
processor by a shared
the master processor.
memory.

In symmetric
In asymmetric multiprocessing,
multiprocessing, the
4. process scheduling approach used is
process is taken from the
master-slave.
ready queue.

Asymmetric multiprocessing systems Symmetric multiprocessing


5.
are cheaper. systems are costlier.
Symmetric multiprocessing
Asymmetric multiprocessing systems are
6. systems are complex to
easier to design.
design.

All processors can exhibit different The architecture of each


7.
architecture. processor is the same.

It is complex as
It is simple as here the master processor synchronization is required of
8.
has access to the data, etc. the processors in order to
maintain the load balance.

In case a master processor malfunctions


then slave processor continues the In case of processor failure,
9. execution which is turned to master there is reduction in the
processor. When a slave processor fails system’s computing capacity.
then other processors take over the task.

It is suitable for homogeneous or It is suitable for homogeneous


10.
heterogeneous cores. cores.
• Thread Parallelism

• Compilation And Execution Of OpenMP (Open


MultiProcessing) Program

• OpenMP Programming Model, And

• OpenMP Data Handling Using Shared And


Private Variables
Thread-Level Parallelism
(TLP)
• TLP refers to the ability of a processor to execute multiple
threads concurrently, enabling more efficient utilization of the
available resources and delivering remarkable gains in
computational throughput.

• Thread-Level Parallelism takes advantage of the inherent


parallelism within software programs.

• Traditionally, processors executed instructions sequentially,


limiting the potential for speedup, especially in applications
with inherently parallelizable tasks.

• TLP introduces a paradigm shift by allowing a processor to


execute multiple threads simultaneously, effectively breaking
down complex tasks into smaller, more manageable chunks
that can be processed concurrently.
• Types of Thread-Level Parallelism

• Instruction-Level Parallelism (ILP): This form of


TLP focuses on executing multiple instructions from a
single thread in parallel. Techniques such as pipelining
and superscalar architectures fall under this category.

• Data-Level Parallelism (DLP): DLP involves


executing the same operation on multiple data
elements simultaneously, often seen in SIMD (Single
Instruction, Multiple Data) architectures.

• Task-Level Parallelism (TLP):TLP refers to executing


multiple independent threads concurrently. This is
particularly relevant in today’s context, as it aligns
with the trend of increasing processor core counts.
Mechanisms to Exploit
Thread level parallelism(TLP)
• Multicore Processors: One of the most tangible
embodiments of TLP is the advent of multicore processors.
These processors feature multiple independent processing
cores on a single chip, each capable of executing threads
in parallel.

• Simultaneous Multithreading (SMT): SMT, often


referred to as Hyper-Threading, allows a single physical
core to execute multiple threads simultaneously,
effectively increasing core-level thread-level parallelism.

• Task Scheduling and Load Balancing: Efficient thread


scheduling and load balancing algorithms ensure that tasks
are distributed across available cores optimally,
maximizing resource utilization.
Open Multi-Processing (OpenMP)
• Open multi-processing (OpenMP) is a standard, shared-
memory, multi-processing application program interface (API)
for shared-memory parallelism.

• OpenMP is an add-on in a compiler.

• It provides a set of compiler directives, environment variables,


and runtime library routines for threads creation, management,
and synchronization.

• When a parallel region executes, the program creates a number


of threads running concurrently.

• With OpenMP, forked threads have access to shared memory.


Message Passing Interface (MPI)
Versus
Open Multi-Processing (OpenMP)
MPI OpenMP

Available from different An add-on in a compiler such


vendors and gets compiled as a GNU compiler and Intel
on Windows, macOS, and compiler.
Linux operating systems..

Supports parallel Supports parallel computation


computation for distributed- for shared-memory systems
memory and shared- only.
memory systems.
OpenMP
A process-based parallelism. A thread-based parallelism.

With MPI, each process has its With OpenMP, threads share the
own memory space and same resources and access
executes independently from the shared memory.
other processes.

Processes exchange data by There is no notion of message-


passing messages to each other. passing. Threads access shared
memory.
Process creation overhead It depends on the
occurs one time. implementation. More overhead
can occur when creating threads
to join a task.
HPC clusters can be optimized
automatically and can improve HPC
cluster performance
• Tunign HPC applications can be very difficult.

• There are many settings in the applications, the MPI stack, and the
underlying OS and hardware, making it next to impossible for
performance engineers to fully optimize their applications and systems.

• Tools that employ machine learning techniques, such as those used by the
Concertio Optimizer Studio product, can help get better results quicker.

• Some of the commercial software also have some optimization within


them so, that if you build them correctly for a multi CPU system, they have
their own OpenMP constructs in between them which we probably
cannot see them directly.
Thread Based Parallelization
• The threads may run independently in different
processors, and if they are pointing to the same
memory space they are needed to perform
synchronizations and cache coherency.

• Some memory location may be accessed by multiple


threads and therefore sequentiality will be required
there. This is another type of synchronization and
here also cache coherency protocols are required.
• Message passing is a distributed memory system programming
that each processor is connected with its own local memory space.
When one processor tries to update something it updates only in
its local memory space and at particular instances when the
programmer thinks that these updates are to be informed to some
other processor, it passes a message or passes some data through
network cable to that other processor. The programmer has to
control it when the data passing will be done by various flow
control techniques.

• But in Thread Parallelism programmer cannot control much


because it goes on to the hardware and operating system that is
sharing the same memory and different processors are writing to
and reading from the same memory. So, it is less on the program
and more on the system that how this memory will be accessed by
different processors.
Example: Matrix Multiplication

• Launch two threads, First thread will multiply first row with
the first column and a second thread will do for the second
row, second column.

• These two elements can be obtained in parallel if we run


two threads in two different processors.

• This is an example of a thread parallelism in that these two


are obtained operating over two columns of one of the input
data. And these operations can be done independently.
• In the serial part only one thread is active; when the parallel
part comes the thread parallelism forks it into multiple threads
and different threads take care of different parts or different
groups of the main program.

• Therefore, different instruction streams are launched.

• When we have to go to the serial part again, all these threads


join and one thread remains active and we reach to the serial
part.

• This is very important feature of thread parallelism that it's not


that all threads are active throughout the parallel program. At
different instance, we can launch number of threads which will
be active and in the sequential region only one thread is active
• Programmer can add OpenMP constructs over sequential
codes to convert to a multithreaded parallel program.

• If in our own program we identify that certain part can be


parallelized, then we can write OpenMP construct over our
sequential code like the C code, can write some OpenMP lines
and convert it into a multithreaded parallel program.

• So, parallelization of our already existing sequential program


can be done using OpenMP and that is one of the goals of
OpenMP.
• Some executable will be running through all the processors
connected with the main address space.

• Operating system will support for shared memory access


of different processors, and multi threading’s that i.e.,
different threads are active over different processors
connected to the same shared memory system has to be
supported by the operating system.

• When it is supported by the operating system, we can send


the instructions to the operating system and that will be
done by OpenMP runtime library.

• This is the systems layer, there is a software systems layer


which is operational over the hardware layer.
• The programmer layer: if we write OpenMP program then this program
will call OpenMP runtime libraries and hence they are interacting with the
operating system that the jobs are launched in different processors
concurrently connected with the same address space.

• When it compile this will call the OpenMP libraries, programmer call the
OpenMP library functions and the compile job will link to the OpenMP
libraries, and ask the OpenMP runtime libraries to interact with the
operating system to execute the program and also the programmer can
set some of the environment variables.

• On how many processors it should run or how many threads will be active
at one point of time? Which thread will be specifically asked to do
something else? This is programmers layer, and once this program is ready
it compiles and gets an executable, and finally can be used by the end
user who is from an application area. The end user is only running the
OpenMP program of that for the application he wants to solve.
• This is typical hierarchy of OpenMP software subsystems
where user only knows that it has an application job which
has to be worked.

• A programmer take the application, write a program inside,


include the directives, call the library functions during
compilation and also set the environment variables.

• This will go to the systems layer of the OpenMP library,


runtime libraries will be active and will interact with the OS
and the OS will launch the jobs, as multi-threaded jobs
connected with the shared memory system.
• OpenMP works through different threads, and all these
threads point to a common shared address space.
• All these threads execute a single instruction with multiple
parts of this data.
• The program instance that each of the threads are executing a
single instance, and we ask multiple threads to execute that
instance.
• So, a single line is written inside the program which is
compiled and given to different threads to execute, and if this
line contains some variable which is local to the thread (as we
are using a shared memory system) these local variables
sometimes become out of scope because there is a shared
memory.
• It is the same variable name which is given to all the threads
because it is the same programming instance which is being
compiled and given to all the threads.
Nested Parallelism
• Nested parallelism is a parallel loop within a parallel loop it
is. So, it is one more order of parallelization.

• OpenMP parallel regions can be nested inside each other.

• If nested parallelism is disabled, then the new team created


by a thread encountering a parallel construct inside a parallel
region consists only of the encountering thread.

• If nested parallelism is enabled, then the new team may


consist of more than one thread.
• We can see padding size 8 ; in a single
processor the time is almost the same 2.229
second which is the same as a single
processor calculation.
• In 2 processors it is reduced, almost by half
not exactly half because there will be certain
overhead; in 4 processors it is further
reduced; in 8 processors it is also further
reduced.
• We are getting parallel performance. It is
reducing almost by half not exactly by half
because we know that as we increase the
number of processors or number of threads
overheads are there. So, padding with extra
elements gives us good performance.
• But padding is very very much dependent on
the padding size and how the padding size
compares with the cache size of that
particular processor.
• The caveat is - padding requires the detailed knowledge of
cache architecture.
• Once code is executed in a different machine the required
padding size may be different because these 8 elements
which we have put here is sufficient to fill up the cache line.
• In a different computer the cache line size might be
different and with 8 it might not be able to fill up the space,
it might require more elements.
• So, it is not a portable code anymore. It is an architecture
hardware dependent program that is the cache size we
need to know and then we can specify the padding.
• We used a padding size 4 and we saw that the performance
did not improve much because still there is some false
sharing here.
• A race condition occurs when two or more threads access
a common resource, e.g. a variable in shared memory,
order of access of threads, progress of individual threads.
• A data race happens when two threads access same
object without synchronization, while race condition
happens when order of events affect the correctness of
the program.

You might also like