# Chapter Subtitle
Parallel computing is simple
arrangement of processors (multiple
processors) in the system to enhance
the performance of the same.
Parallel processing gives the way how
the system works i. e. Scheduling,
mapping, etc. by using multiple
processors. It also concerns with
synchronization concept.
1
# Chapter Subtitle
Parallel Computing
Definition:
Parallel computing is a broader
concept that encompasses the simultaneous use of
multiple compute resources to solve a computational
problem. It involves dividing a task into smaller sub-
tasks that can be processed concurrently.
Scope: It includes various forms of parallelism, such
as data parallelism (processing large datasets
simultaneously), task parallelism (executing different
tasks concurrently), and pipeline parallelism (stages of
a task are processed in parallel).
Applications: Used in scientific computing,
simulations, big data processing, and any area
requiring high computational power.
2
# Chapter Subtitle
Parallel Processing
Definition:
Parallel processing specifically refers
to the execution of multiple processes or threads
simultaneously. It focuses more on the execution
aspect within a parallel computing environment.
Scope:
It deals with the methods and
architectures used to perform multiple operations or
tasks at the same time, often at a finer granularity
than parallel computing.
Applications:
Commonly found in multi-core
processors, distributed systems, and real-time
processing tasks.
3
What
is the difference between CPU and a GPU for paralle
l computing
?
GPU is very good at data-parallel computing, CPU is
very good at parallel processing.
GPU has thousands of cores, CPU has less than 100
cores.
GPU has around 40 hyperthreads per core, CPU has
around 2(sometimes a few more) hyperthreads per
core.
GPU has difficulty executing recursive code, CPU has
less problems with it.
Copyright © 2010, Elsevier Inc. All rights Reserved 4
• GPU lowest level caches are shared between 8–24
cores for intel, 64 cores for amd and up to 192
cores for nvidia. CPU’s lowest level cache only
used by single core (2 threads). CPU’s each
thread can use SIMD which is data-parallel for
about 8–32 workitems(each comparable to a
single GPU core/thread).
• GPU highest level cache is only around 5MB, CPU
highest level cache can get 64MB or more.
• GPU is accessed through pci-e and similar bridges
and also works on a middle API that both add
latency and programming effort and makes very
lighty loaded works become slow. CPU is easier to
begin with and perfect at random + small
workloads.
Copyright © 2010, Elsevier Inc. All rights Reserved 5
• GPU has considerably higher compute per electricity
energy than CPU.
• GPU is for high throughput, CPU is for low latency.
• Integrated GPUs still use internal pci-e connection to
get commands from CPU but gets data from RAM
directly so there is still some added latency in there.
• With APIs like CUDA/OpenCL, GPU has addressable
local memories and registers that are much faster
than lowest level caches. This makes inter-core
communication easier in coding. Even number of
available(and addressable like an array) private
registers per core is much higher than that of a CPU.
(256 vs 32)
• GPU’s single core is very lightweight compared to a
single CPU core. 1 CPU core has 8–16 such pipelines
while 1 GPU “SM/CU” has 64/128/192 pipelines and
should be called a “core”.
Copyright © 2010, Elsevier Inc. All rights Reserved 6
Threads
Threads are contained within processes.
They allow programmers to divide their
programs into (more or less) independent
tasks.
The hope is that when one thread blocks
because it is waiting on a resource,
another will have work to do and can run.
Copyright © 2010, Elsevier Inc. All rights Reserved 7
Copyright © 2010, Elsevier Inc. All rights Reserved 8
Copyright © 2010, Elsevier Inc. All rights Reserved 9
A process and two threads
the “master” thread
terminating a thread
starting a thread
Is called joining
Is called forking
Figure 2.2
Copyright © 2010, Elsevier Inc. All rights Reserved 10
# Chapter Subtitle
Hyper-threading is a hardware technology that
allows a single processor to handle multiple tasks
simultaneously, which can improve
performance. It does this by dividing a CPU's
physical cores into virtual cores, also known as
threads, that the operating system treats as if they
were physical cores.
11
# Chapter Subtitle
Parallelism is the ability to execute multiple tasks
or operations simultaneously, rather than
sequentially. Parallelism can be achieved at
different levels, such as hardware, software, or
network. For example, you can use multiple cores
or processors, threads or processes, or
asynchronous or non-blocking operations to run
parallel tasks.
12
•Threads
A programming concept that involves
creating, running, and terminating
threads within a process. Threads share
memory and file handlers.
•Pthreads
A library that provides tools to manage
threads, including functions for creating,
terminating, and joining
threads. Pthreads is an Application
Programming Interface (API) that can be
used for shared memory programming.
Copyright © 2010, Elsevier Inc. All rights Reserved 13
Why use parallelism for APIs?
• One of the main reasons to use parallelism for
API’s(Application Programming Interface) is to
improve the performance and scalability of your
applications.
• By sending and receiving multiple requests and
responses at the same time, you can reduce the
waiting time and increase the throughput of your
applications.
• For example, fetch data from several APIs,
14
Why use parallelism for APIs?
• Another reason to use parallelism for APIs is to
handle complex or dynamic scenarios that require
coordination or synchronization among multiple
APIs.
• For example, Perform a transaction.
This can make your applications more reliable and
consistent.
15
Changing times
From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year.
Since then, it’s dropped to about 20%
increase per year.
16
An intelligent solution
Instead of designing and building faster
microprocessors, put multiple processors
on a single integrated circuit.
17
Now it’s up to the programmers
Adding more processors doesn’t help
much if programmers aren’t aware of
them…
… or don’t know how to use them.
Serial programs don’t benefit from this
approach (in most cases).
18
Why we need ever-increasing
performance
Computational power is increasing, but so are
our computation problems and needs.
As our computational power increases, the
number of problems that we can seriously
consider solving also increases.
Examples like
19
Climate modeling
In order to better understand climate change, we
need far more accurate computer models, models
that include interactions between the atmosphere, the
oceans, solid land, and the ice caps at the poles.
20
Protein folding
It’s believed that misfolded proteins may be involved in dis
eases such as Huntington’s, Parkinson’s, and Alzheimer’s,
but our ability to study configurations of complex molecules
such as proteins is severely limited by our current
computational power.
21
Drug discovery
There are many drugs that are effective in treating a relatively
small fraction of those suffering from some disease. It’s
possible that we can devise alternative treatments by careful
analysis of the genomes of the individuals for whom the known
treatment is ineffective. This, however, will involve extensive
computational analysis of genomes.
22
Energy research
Increased computational power will make it possible to
program much more detailed models of technologies such
as wind turbines, solar cells, and batteries. These programs
may provide the information needed to construct far more
efficient clean energy sources.
23
Data analysis
We generate tremendous amounts of data. By some
estimates, the quantity of data stored worldwide doubles
every two years, but the vast majority of it is largely useless
unless it’s analyzed.
24
Why we’re building parallel
systems
Up to now, performance increases have
been attributable to increasing density of
transistors.
But there are
inherent
problems.
25
A little physics lesson
Smaller transistors = faster processors.
Faster processors = increased power
consumption.
Increased power consumption = increased
heat.
Increased heat = unreliable processors.
26
Solution
Move away from single-core systems to
multicore processors.
“core” = central processing unit (CPU)
Introducing parallelism!!! Rather than building
ever-faster, more complex, monolithic processors,
the industry has decided to put multiple, relatively
simple, complete processors on a single chip.
Such integrated circuits are called multicore
processors, and core has become synonymous
with central processing unit, or CPU. In this setting
a conventional processor with one CPU is often
called a single-core system.
27
Why we need to write parallel
programs
Running multiple instances of a serial
program often isn’t very useful.
Think of running multiple instances of your
favorite game.
What you really want is for
it to run faster.
28
Approaches to the serial problem
Rewrite serial programs so that they’re
parallel.
Write translation programs that
automatically convert serial programs into
parallel programs.
This is very difficult to do.
Success has been limited.
29
More problems
Some coding constructs can be
recognized by an automatic program
generator, and converted to a parallel
construct.
However, it’s likely that the result will be a
very inefficient program.
Sometimes the best parallel solution is to
step back and devise an entirely new
algorithm.
30
Example
Compute n values and add them together.
Serial solution:
31
Example (cont.)
We have p cores, p much smaller than n.
Each core performs a partial sum of
approximately n/p values.
my_sum = 0;
my_first_i = ...; // Each core's starting index
my_last_i = ...; // Each core's ending index
// Loop through the assigned range of values
for (my_i = my_first_i; my_i < my_last_i; my_i++) {
my_x = Compute_next_value(...); // Compute the
value for this index
my_sum += my_x; // Accumulate the sum
} Each core uses it’s own private variables
and executes this block of code independently of the other cores.
32
Example (cont.)
After each core completes execution of the
code, is a private variable my_sum
contains the sum of the values computed
by its calls to Compute_next_value.
Ex., 8 cores, n = 24, then the calls to
Compute_next_value return:
1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9
33
Example (cont.)
Once all the cores are done computing
their private my_sum, they form a global
sum by sending results to a designated
“master” core which adds the final result.
34
Example (cont.)
35
Example (cont.)
if (I’m the master core) {
// Initialize sum with the master's own value
sum = my_sum;
// Loop through all other cores and receive their values
for each core other than myself {
received_value = receive_value_from_core(core_id);
sum += received_value;
}
// Final sum is computed at the master
} else {
// Worker cores send their sum to the master
send_value_to_master(my_sum);
}
36
Example (cont.)
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95
Core 0 1 2 3 4 5 6 7
my_sum 95 19 7 15 7 13 12 14
37
But wait!
There’s a much better way
to compute the global sum.
38
Better parallel algorithm
Don’t make the master core do all the
work.
Share it among the other cores.
Pair the cores so that core 0 adds its result
with core 1’s result.
Core 2 adds its result with core 3’s result,
etc.
Work with odd and even numbered pairs of
cores.
39
Better parallel algorithm (cont.)
Repeat the process now with only the
evenly ranked cores.
Core 0 adds result from core 2.
Core 4 adds the result from core 6, etc.
Now cores divisible by 4 repeat the
process, and so forth, until core 0 has the
final result.
40
Multiple cores forming a global
sum
41
Analysis
In the first example, the master core
performs 7 receives and 7 additions.
In the second example, the master core
performs 3 receives and 3 additions.
The improvement is more than a factor of 2!
42
Analysis (cont.)
The difference is more dramatic with a
larger number of cores.
If we have 1000 cores:
The first example would require the master to
perform 999 receives and 999 additions.
The second example would only require 10
receives and 10 additions.
That’s an improvement of almost a factor
of 100!
43
How do we write parallel
programs?
Task parallelism
Partition various tasks carried out solving the
problem among the cores.
Data parallelism
Partition the data used in solving the problem
among the cores.
Each core carries out similar operations on it’s
part of the data.
44
Professor P
15 questions
300 exams
45
Professor P’s grading assistants
TA#1 TA#3
TA#2
46
Division of work –
data parallelism
TA#1
100 exams
TA#3
100 exams
100 exams
TA#2
47
Division of work –
task parallelism
TA#1
TA#3
Questions 11 - 15
Questions 1 - 5
TA#2
Questions 6 - 10
48
Division of work – data Parallelism
49
Division of work – task Parallelism
Tasks
1)Receiving
2)Addition
50
Coordination
Cores usually need to coordinate their work.
Communication – one or more cores send
their current partial sums to another core.
Load balancing – share the work evenly
among the cores so that one is not heavily
loaded.
Synchronization – because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.
51
What we’ll be doing
Learning to write programs that are
explicitly parallel.
Using the C language.
Using three different extensions to C.
Message-Passing Interface (MPI)
Posix Threads (Pthreads)
OpenMP
52
Type of parallel systems
Shared-memory
The cores can share access to the computer’s
memory.
Coordinate the cores by having them examine
and update shared memory locations.
Distributed-memory
Each core has its own, private memory.
The cores must communicate explicitly by
sending messages across a network.
53
Type of parallel systems
Shared-memory Distributed-memory
54
Terminology
Concurrent computing – a program is one
in which multiple tasks can be in progress
at any instant.
Parallel computing – a program is one in
which multiple tasks cooperate closely to
solve a problem
Distributed computing – a program may
need to cooperate with other programs to
solve a problem.
55
Different API(Application Programming
Interface)s are used for programming different
types of systems
MPI is an API for programming distributed memory
MIMD systems
Pthreads is an API for programming shared
memory MIMD systems
OpenMP is an API for programming both shared
memory MIMD and shared memory SIMD systems.
CUDA is an API for programming Nvidia GPUs,
which have aspects of all four of our classification:
Shared memory and Distributed memory, SIMD
and MIMD.
56
Concurrent,Parallel,Distributed
In concurrent computing, a program is one in which
multiple tasks can be in progress at any instant.
In parallel computing, a program is one in which
multiple tasks cooperate closely to solve a problem.
In distributed computing, a program may need to
cooperate with other programs to solve a problem.
So parallel and distributed programs are concurrent,
but a program such as a multitasking operating
system is also concurrent.
57
In parallel programming, APIs (Application
Programming Interfaces) can be called
simultaneously to improve performance and
speed up processes. This is done by executing
multiple API calls at the same time instead of
sequentially
58
Some benefits of using parallel APIs:
•Faster response times
•Parallel APIs can lead to faster response
times, which can improve the user
experience.
•Optimized resource utilization
•Parallel APIs can optimize resource utilization
by enabling simultaneous data retrieval.
•Handling complex scenarios
•Parallel APIs can handle complex or dynamic
scenarios that require coordination or
synchronization among multiple APIs
59