High Performance Computing
(410250)
f. Vaishali JorwekarWhat comes in your mind when you see these 3 pictures of computers.
of. Vaishali JorwekarPersonal Laptop Gaming Laptop Super Computer
First laptop on the left is personal laptop, which we use for our day to day use.
Gaming laptop has higher configuration and graphics card for high definition games.
‘Super computers used by scientist and big companies for complex mathematical modeling.
So the main differentiating factor is computing power. computing power refers to how fast and
capable a computer is in performing tasks and calculations.
of. Vaishali JorwekarPersonal Laptop
AMD Ryzen 5 7530U processor
[12 Threads | Speed
116 MB L3 Cache
upto
IMemory:
Miz, duat-channet capable
upgradable upto 40GB |
Storage: 512GB SSD M2
Gaming Laptop
Processor: 13th Gen Intel Core
|19-13980HX Processor 2.2 Gh
(Gem Cache, up to 24
Memory: 16GB (8GB SO-DIMM
*2) DDRS 4800 Miz Support
{Upto 3260 2¢50-DINM slots
‘Storage: 178 PCIe 4.0 NVMe
M2ssD
Super Computer
Peak Performance: 200 Pops
Number of Nodes! 4508
emery pee Node: 512 GB DORE + 96 G3 HAN2
1250 PB IBM Spectum Scale GPFS 2.5 TBis
Power Consumption: 13
Operating Syston:
‘Rod Ha Enterprise Linuk (RHEL) version 7.4
I have sample specifications here, and | want to highlight differences in key computing here.
As the computation need increases, processors requirements also increases, which is met through,
increasing number of processors and cores, cycle frequency.
Typical personal computers will have 6 cores up to 16GB RAM, whereas Gaming laptop will have
igh number of cores and RAM.
But if you compare super computer, you will see number of processors, number is in thousands.
And RAM in petabytes.
of. Vaishali JorwekarHigh Performance Computing
High Performance Computing (HPC) refers to the use of powerful computers
and parallel processing techniques to solve complex problems or perform
tasks at a much faster rate than traditional computers.
High Performance Computing (HPC) refers to the use of
powerful computers and parallel processing techniques to solve complex
problems or perform tasks at a much faster rate than traditional computers.
We saw in last slide, what makes computers powerful, that is number of
processors, its cores, frequency and RAM.
In first chapter we will see details about parallel processing techniques.
of. Vaishali JorwekarApplication of High Performance Computing
‘+ Financial institutions ~ Transactions and card fault detection
+ Bio-sciences and the human genome — Drug discovery, disease detection / prevention
+ Computer aided engineering - automotive design and testing, transportation commerce,
structural outlook, mechanical design
+ Chemical engineering -process and molecular design next line
‘+ Digital content creation and distribution-computer aided graphics in film and media
* Economics / financial-Wall Street risk analysis, portfolio management, automated trading
+ Electronic design and automation- electronic component design
+ Geo sciences and geo engineering - oil and gas exploration and reservoir modelling
‘+ Mechanical design and drafting-2D and 3D design and verification, mechanical modelling
+ Defense and energy-nuclear stewardship, basic and applied research
+ Government labs, universities/academic-basic and applied research
+ Meteorological departments-weather forecasting
Lets look at some of the application of high performance computing.
of. Vaishali JorwekarParallel Processing
A parallel computer is a set of processors that are able to work cooperatively
to solve a computational problem.
Parallel computing is a form of computation in which many instructions are
carried out simultaneously operating on the principle that large problems can
often be divided into smaller ones, which are then solved concurrently (in
parallel)
Here's a simplified example to help illustrate the concept of Parallel Processing.
Imagine you have a really challenging puzzle to solve, and you want to do it as quickly as.
possible. If you try to solve it alone, it might take a long time. However, if you have a
group of friends working together simultaneously, each focusing on a different part of the
puzzle, you can finish much faster.
In the context of computing, traditional computers are like individuals trying to solve the
puzzle on their own. High Performance Computing, on the other hand, is like having a
team of super-fast computers working together to tackle a complex problem.
of. Vaishali JorwekarSerial Processing Parallel Processing
— |
~ WMT
wu I! | |- Wut |
‘Tobe run ona single computer having a single CPU: Tobe run using multiple CPUs:
+ Aproblemis broken into a discrete series of instructions + Aproblemis broken into discrete parts that can be solved
+ Instructions are executed one after another ‘concurrently
+ Only one instruction may execute at any moment in time. + Each partis further broken down to a series of instructions
+ Instructions from each part execute simultaneously on
different CPUs
In serial programming, problem is broken into series of instructions. Just recall your C programs,
and each line will have some instructions.
This instructions will run one after another. And only one instruction will get executed at any
moment. It is nothing but serial programming.
First thing for parallel processing, we will need multiple CPUs.
So Problem is broken into part that can be solved parallelly.
Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on different CPUs.
of. Vaishali JorwekarSerial Processing Parallel Processing
Instructions
|
wil |B
Count from 1 to 1000.
re
Using @ regular computer, you would start at 1 and — With HPC, you could divide the task among multiple processors.
incrementally count each number one by one until you reach For instance, if you have 10 processors, each processor could be
1000. This process might take some time, but it's manageable responsible for counting a range of 100 numbers. So, Processor
for a personal computer. -Lcounts from 1 to 100, Processor 2 from 101 to 200, and so on.
AAU processors work simultaneously, and the entire task of
‘counting from 1 to 1000 is completed much faster compared to
‘personal computer.
Lets understand this concept with very simple example. Suppose you want to count from 1 to 1000.
Using a regular computer, you would start at 1 and incrementally count each number one by one
until you reach 1000. This process might take some time, but it's manageable for a personal
computer.
of. Vaishali JorwekarMotivating Parallelism
Reasons for Growth:
+ Advancements in specifying and coordinating complex concurrent tasks.
+ Portable algorithms facilitating parallel processing.
+ Specialized execution environments and software development toolkits.
Reasons:
+ Increased Computational Power
+ Enhanced Memory/Disk Speed
+ Improved Data Communication
In recent years, there has been a big improvement in how computers handle multiple tasks that is
parallel processing.
This is because we've gotten better at organizing and managing complex tasks happening at once,
creating portable algorithms (sets of instructions), using special environments for executing tasks,
and developing toolkits for making software. This progress is based on three main reasons:
Lincreased Computational Power: Modern computers, equipped with CMOS chip-based
processors and advanced networking, have become significantly more powerful. This has driven
the development of applications capable of handling multiple tasks simultaneously.
2.Enhanced Memory/Disk Speed: Progress in hardware interfaces has expedited the transi
from microprocessor creation to the development of entire machines that efficiently execute
parallel tasks.
3.Improved Data Communication: Standardization of programming environments has seen
notable advancements. This ensures that applications designed for parallel processing remain
relevant and useful for an extended period.
n
of. Vaishali JorwekarModern Processor
f. Vaishali JorwekarStored- program computer architecture
Stored program computer architecture is a design where instructions and data are stored in the same
‘memory, allowing a central processing unit to sequentially fetch, decode, and execute instructions, enabling
versatile programmability.
Central Processing Unit
Input
Device
ouput
Device
"Stored-program computer architecture is like having a recipe book for your computer. In this
analogy, the recipe book Is the computer's memory, and the chef Is the central processing unit
(CPU). Let's break it down:
Memory Unit:
Just like a recipe book contains both instructions and a list of ingredients, the computer's
memory stores both program instructions and data.
Chef (CPU):
The CPU acts like a chef following the instructions in the recipe book. It fetches each step,
processes it, and moves on to the next one.
Fetching and Execution:
Imagine the CPU as a chef turning the pages of the recipe book (fetching), reading the
instructions (decoding), and then cooking accordingly (execution).
Example is personal computer
of. Vaishali JorwekarGeneral-purpose Cache-based Microprocessor
architecture
General-purpose Cache-based Microprocessor architecture is a design incorporating a cache memory
hierarchy to enhance data access speed and overall performance in executing a wide range of computational
tasks.
FE] cote ener
cpu Pamary Memory Secondary Menor]
Word Transfer Pres
me
cru Cache Main Memory
Fast Slow
Again lets understand with same analogy of chef and kitchen.
Microprocessor (Chef): The microprocessor is like the chef, responsible for executing instructions and
processing data.
Cache (Countertop): Now, think of the cache as the countertop near the chef. This is where the chef
keeps ingredients they use frequently.
Main Memory (Pantry): The main memory is like the pantry, storing a larger quantity of ingredients.
However, it takes a bit more time for the chef to go to the pantry to get less frequently used ingredients.
Fetching Ingredients (Data):
When the chef needs an ingredient (data), here's what happens:
First, the chef checks the countertop (cache) for commonly used ingredients.
If the ingredient is on the countertop (in the cache), great! It's quickly accessed.
If not, the chef goes to the pantry (main memory) to retrieve the ingredient.
Everyday Products:
Smartphones and Laptops: Just like a chef needs quick access to ingredients, your smartphone and
laptop use cache memory to store frequently accessed data and instructions for faster processing.
of. Vaishali JorwekarWeb Browsing:
When you load a webpage, the browser uses cache memory to store elements of the page
for quicker retrieval. It's like having the ingredients ready for the chef without going to the
pantry every time.
of. Vaishali JorwekarParallel Programming
Platforms
of. Vaishali JorwekarExplicit Parallelism Implicit Parallelism
+ Programmer specifically defines and instructs + Automatically identifies and executes tasks
the system on parallel tasks. concurrently without explicit instructions from
+ Programmer actively Incorporate parallel the programmer.
constructs or directives into the code. + Programmer write regular, step-by-step code
+ System follows the programmer's explicit without specific parallel constructs.
instructions for parallel execution. + Compiler, runtime system, and hardware work
together to find and exploit parallel
opportunities.
Implicit parallelism is like type of parallelism in computing that automatically handles multiple
tasks at the same time without you needing to explicitly tell it to.
It means you can write your programs in a regular, step-by-step way, and behind the scenes, the
computer's compiler and hardware work together to find opportunities to speed things up by
doing tasks simultaneously.
So, as an engineer, you focus on your code's logic, and the system takes care of making it run faster
using parallel processing, all without you having to add any special parallel instructions.
of. Vaishali JorwekarImplicit Parallelism - Pipelining Execution
Pipelining in High-Performance Computing
+ Maximizing Processor Utilization
© Utilize ALU, buses, registers, etc.,
continuously.
+ Pipelining Concept I
© Instructions flow through the processor like a =<
pipe.
© Move through stages to accomplish
operations.
+ Continuous Processor Usage
© Each unit handles an instruction, keeping the
processor busy.
Imagine your processor is like a well-designed assembly line. Each part of the processor—like the
ALU, buses, and registers—has a specific job. The goal? Keep all these parts busy all the time.
So, what's pipelining? It's like turning your processor into a pipe. Instructions flow through it,
moving from one stage to the next to get the job done. This way, each part of the processor is
always working on something. No downtime.
In simpler terms, it's like a well-oiled machine where instructions smoothly move through different
stages, making sure your processor is always doing something useful."
of. Vaishali JorwekarOverlapping Execution with Pipelining
coffe
Implicit Parallelism - Pipelining Execution
+ Non-Pipelined Approach
© Fetch, decode, read, execute, and write
sequentially
© Hardware idle during waiting periods.
+ Pipelining Technique
© Overlap execution of several instructions.
© Two-stage pipelining example: Fetch and
Execute.
+ Benefits
© Faster execution by fetching next instruction
during the current one’s execution,
© Allunits busy, preventing idle time
"Now, let's explore why pipelining is good and how it improves the efficiency of our processors.
In the past, processors followed a step-by-step approach—fetch, decode, read, execute, and write,
one after another. The drawback? Many components of the hardware would remain inactive,
patiently waiting for others to complete their tasks.
In pipelining approach, It's like managing multiple instructions simultaneously. Picture this:
accomplishing two tasks in just two stages—fetching the next instruction while executing the
current one, It's an intelligent method to overlap tasks and maintain a smooth workflow.
What's the result? Quicker execution! Every part of the processor remains engaged, avoiding any
downtime. Think of it as orchestrating a production line where everyone has a role, and the line
keeps moving without interruptions.
of. Vaishali Jorwekar+ From Scalar to Superscalar
© Scalar processors had one pipelined unit for
{teger and one for floating-point operations.
for Parallelism
© Single pipeline isn’t enough for parallelism.
© Pipelines enable parallelism by having
multiple instructions at different stages.
© Superscalar processors execute more than
one instruction per clock cycle.
© Fetch and decode multiple instructions
simultaneously,
¢
Implicit Parallelism - Superscalar Execution
So, back in the day, processors were scalar, meaning they had one pipeline for integer operations
and one for floating-point operations. But designers realized that having just one pipeline wasn't
cutting it for getting things done faster. We needed more parallelism.
that's where superscalar came into picture. It's like having a processor that can do more than one
thing at a time during a single clock cycle. Imagine fetching and decoding multiple instructions
simultaneously. That's the essence of superscalar - making our processors more efficient by doing
multiple tasks at once."
of. Vaishali Jorwekar+ Instruction Level Parallelism (ILP)
‘© Superscalar architecture exploits Instruction
Level Parallelism (ILP)
‘© Multiple pipelines for various instructions
(eg, integer and floating-point)
+ Complexity Considerations
‘© Superscalar scheduler complexity and
hardware cost are crucial in processor design.
+ VLIW Solution
© Very Long Instruction Word (VLIW)
processors use compile-time analysis.
‘© Bundling instructions for concurrent
execution, addressing complexity.
Implicit Parallelism - Superscalar Execution
Integer regstor fle
Floating-point ogister fe
ir
[on
Pipelined integer
functional units
Pipelined floating
point functional units
Lets start with Instruction Level Parallelism (ILP).
We've got multiple pipelines for different instructions like arithmetic, load, and store. It's about
taking advantage of parallelism to speed things up.
Now, here's the catch - making a superscalar processor is not easy. It's complex, and the hardware
cost is something we really need to think about in processor design.
To tackle this, we have something called VLIW or Very Long Instruction Word processors. They use
a clever trick at compile time to identify and bundle together instructions that can be done at the
same time.
It's like putting a bunch of instructions in a very long instruction word to simplify the process.
of. Vaishali JorwekarImplicit Parallelism - VLIW Processor Structure
+ Need for Separate Units
© To perform multiple operations in one
execution stage, separate units for each
operation are essential.
+ VLIW Architecture
© Visual representation of separate units for
operations (Floating Point Add, Multiply,
Branching, Integer ALU).
© VLIW (Very Long Instruction Word) executes
more than one basic instruction at a time.
‘© Multiple operations stored in a single
instruction word,
When we want to do multiple things at once in a single execution stage, we need separate units for
‘each operation. Picture this: for floating point addition, multiplication, branching, and integer ALU,
we've got dedicated units. Check out Fig. 1.4.3 for a visual on this.
Now, VLIW stands for Very Long Instruction Word. It's a way for our processors to handle more
than one basic instruction at a time.
How?
By storing multiple operations in a single instruction word. So, when we issue one instruction,
multiple operations kick off simultaneously during the execution cycle of the pipelining process.
Simple, right?"
of. Vaishali JorwekarImplicit Parallelism - VLIW Processor Structure
Execution and Compiler Role
+ Simultaneous Operations
© VLIW executes multiple operations
simultaneously with one instruction.
+ Compiler's Role
© Compiler identifies parallelism, schedules
dependency-free code.
© Resolves dependencies among instructions at
‘compile time.
+ Characteristics
© Multiple independent operations in a VLIW
instruction, no flow dependences.
So, VLIW does multiple operations all at once with one instruction—no waiting around. But here's
the trick: the compiler is crucial. It spots where we can run things in parallel and arranges the
code to avoid any dependencies.
So, the compiler is very important, making sure everything plays in harmony, It identifies and
schedules operations that can run side by side, resolving any issues before the program even
runs. One more thing - in a VLIW instruction, all these operations are independent; they don't
rely on each other. It's like having a set of tasks that can be done simultaneously without any fuss.
of. Vaishali JorwekarDichotomy of Parallel Computing Platforms
Division based on logical and physical organization of parallel platforms
Physical organization is the actual hardware organization of a platform.
logical organization refers to a programmer's view of the platform.
Control Structure The Communication Model
+ The various ways of expressing parallel + The mechanisms for specifying
tasks is known as control structure. interaction between the parallel tasks is
called as communication model.
There are several platforms which facilitates parallel computing.
In this section the division based on logical and physical organization of parallel platforms will be
discussed.
Physical organization is the actual hardware organization of a platform. logical organization refers
to a programmer's view of the platform.
From programmers perspective the two important components of parallel computing are:
Control Structure
and
The Communication Model.
of. Vaishali JorwekarPhysical Organization of Parallel Platforms
Evolution
Lets start with at conventional architecture, representing the traditional uni-processor system.
While some parallel features can improve a single processor's speed, there are limitations.
The foundation of processor architecture traces back to the Von Neumann Computer,
characterized by its CPU, Memory, and I/O devices.
This system follows the Von Neumann architecture, where the CPU consists of Arithmetic and
Control units, operating on the stored program concept. Both program and data share the same
memory unit, each location having a unique address. Execution proceeds sequentially unless the
program explicitly alters this flow.
Fig. 1.8.2 marks the initial steps toward parallelism, introducing lookahead, overlapping fetch and
execute, and parallelism in functions. This latter concept involves two mechanisms: pipelining and
multiple functional units. In the second mechanism, various functional units operate
simultaneously, enhancing processing speed. Vector instructions, akin to massive arrays of data
with a common operation, were initially managed by pipeline processors controlled by software
looping. Subsequently, explicit processors tailored for vector instructions emerged. Two variations
in vector processing include memory-to-memory and register-to-register, with the former utilizing
of. Vaishali Jorwekarmemory for operand storage and the latter using registers.
The evolution of register-to-register architecture led to the creation of two processor types:
Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD).
These developments signify the gradual integration of parallelism in processors, contributing
to enhanced processing capabilities.
of. Vaishali JorwekarPhysical Organization of Parallel Platforms
Parallel Random Access Machine (PRAM)
Various PRAM models differ in how they handle read or write
conflicts
+ EREW : Exclusive Read Exclusive Write p processors can
simultaneously read and write the content of p distinct
memory locations.
+ CREW: Concurrent Read Exclusive Write p processors
can simultaneously read the content of p! memory
locations, where p'