Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views59 pages

Parallel Computing 1 Unit

Parallel computing involves using multiple processors to enhance performance by dividing tasks into smaller sub-tasks that can be processed concurrently. It encompasses various forms of parallelism, including data, task, and pipeline parallelism, and is applied in fields like scientific computing and big data processing. The document also discusses the differences between CPUs and GPUs in parallel computing, the importance of writing parallel programs, and various methods for achieving parallelism.

Uploaded by

pallavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views59 pages

Parallel Computing 1 Unit

Parallel computing involves using multiple processors to enhance performance by dividing tasks into smaller sub-tasks that can be processed concurrently. It encompasses various forms of parallelism, including data, task, and pipeline parallelism, and is applied in fields like scientific computing and big data processing. The document also discusses the differences between CPUs and GPUs in parallel computing, the importance of writing parallel programs, and various methods for achieving parallelism.

Uploaded by

pallavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

# Chapter Subtitle

Parallel computing is simple


arrangement of processors (multiple
processors) in the system to enhance
the performance of the same.
Parallel processing gives the way how
the system works i. e. Scheduling,
mapping, etc. by using multiple
processors. It also concerns with
synchronization concept.

1
# Chapter Subtitle
Parallel Computing
Definition:

Parallel computing is a broader
concept that encompasses the simultaneous use of
multiple compute resources to solve a computational
problem. It involves dividing a task into smaller sub-
tasks that can be processed concurrently.

Scope: It includes various forms of parallelism, such
as data parallelism (processing large datasets
simultaneously), task parallelism (executing different
tasks concurrently), and pipeline parallelism (stages of
a task are processed in parallel).

Applications: Used in scientific computing,
simulations, big data processing, and any area
requiring high computational power.

2
# Chapter Subtitle
Parallel Processing
Definition:

Parallel processing specifically refers
to the execution of multiple processes or threads
simultaneously. It focuses more on the execution
aspect within a parallel computing environment.
Scope:

It deals with the methods and
architectures used to perform multiple operations or
tasks at the same time, often at a finer granularity
than parallel computing.
Applications:

Commonly found in multi-core
processors, distributed systems, and real-time
processing tasks.

3
 What
is the difference between CPU and a GPU for paralle
l computing
?
 GPU is very good at data-parallel computing, CPU is
very good at parallel processing.
 GPU has thousands of cores, CPU has less than 100
cores.
 GPU has around 40 hyperthreads per core, CPU has
around 2(sometimes a few more) hyperthreads per
core.
 GPU has difficulty executing recursive code, CPU has
less problems with it.
Copyright © 2010, Elsevier Inc. All rights Reserved 4
• GPU lowest level caches are shared between 8–24
cores for intel, 64 cores for amd and up to 192
cores for nvidia. CPU’s lowest level cache only
used by single core (2 threads). CPU’s each
thread can use SIMD which is data-parallel for
about 8–32 workitems(each comparable to a
single GPU core/thread).
• GPU highest level cache is only around 5MB, CPU
highest level cache can get 64MB or more.
• GPU is accessed through pci-e and similar bridges
and also works on a middle API that both add
latency and programming effort and makes very
lighty loaded works become slow. CPU is easier to
begin with and perfect at random + small
workloads.
Copyright © 2010, Elsevier Inc. All rights Reserved 5
• GPU has considerably higher compute per electricity
energy than CPU.
• GPU is for high throughput, CPU is for low latency.
• Integrated GPUs still use internal pci-e connection to
get commands from CPU but gets data from RAM
directly so there is still some added latency in there.
• With APIs like CUDA/OpenCL, GPU has addressable
local memories and registers that are much faster
than lowest level caches. This makes inter-core
communication easier in coding. Even number of
available(and addressable like an array) private
registers per core is much higher than that of a CPU.
(256 vs 32)
• GPU’s single core is very lightweight compared to a
single CPU core. 1 CPU core has 8–16 such pipelines
while 1 GPU “SM/CU” has 64/128/192 pipelines and
should be called a “core”.
Copyright © 2010, Elsevier Inc. All rights Reserved 6
Threads
 Threads are contained within processes.

 They allow programmers to divide their


programs into (more or less) independent
tasks.
 The hope is that when one thread blocks
because it is waiting on a resource,
another will have work to do and can run.

Copyright © 2010, Elsevier Inc. All rights Reserved 7


Copyright © 2010, Elsevier Inc. All rights Reserved 8
Copyright © 2010, Elsevier Inc. All rights Reserved 9
A process and two threads

the “master” thread

terminating a thread
starting a thread
Is called joining
Is called forking

Figure 2.2

Copyright © 2010, Elsevier Inc. All rights Reserved 10


# Chapter Subtitle
Hyper-threading is a hardware technology that
allows a single processor to handle multiple tasks
simultaneously, which can improve
performance. It does this by dividing a CPU's
physical cores into virtual cores, also known as
threads, that the operating system treats as if they
were physical cores.

11
# Chapter Subtitle
Parallelism is the ability to execute multiple tasks
or operations simultaneously, rather than
sequentially. Parallelism can be achieved at
different levels, such as hardware, software, or
network. For example, you can use multiple cores
or processors, threads or processes, or
asynchronous or non-blocking operations to run
parallel tasks.

12
•Threads
A programming concept that involves
creating, running, and terminating
threads within a process. Threads share
memory and file handlers.
•Pthreads

A library that provides tools to manage


threads, including functions for creating,
terminating, and joining
threads. Pthreads is an Application
Programming Interface (API) that can be
used for shared memory programming.
Copyright © 2010, Elsevier Inc. All rights Reserved 13
Why use parallelism for APIs?

• One of the main reasons to use parallelism for


API’s(Application Programming Interface) is to
improve the performance and scalability of your
applications.
• By sending and receiving multiple requests and
responses at the same time, you can reduce the
waiting time and increase the throughput of your
applications.
• For example, fetch data from several APIs,

14
Why use parallelism for APIs?

• Another reason to use parallelism for APIs is to


handle complex or dynamic scenarios that require
coordination or synchronization among multiple
APIs.
• For example, Perform a transaction.
This can make your applications more reliable and
consistent.

15
Changing times
 From 1986 – 2002, microprocessors were
speeding like a rocket, increasing in
performance an average of 50% per year.

 Since then, it’s dropped to about 20%


increase per year.

16
An intelligent solution
 Instead of designing and building faster
microprocessors, put multiple processors
on a single integrated circuit.

17
Now it’s up to the programmers
 Adding more processors doesn’t help
much if programmers aren’t aware of
them…
 … or don’t know how to use them.

 Serial programs don’t benefit from this


approach (in most cases).

18
Why we need ever-increasing
performance
 Computational power is increasing, but so are
our computation problems and needs.

 As our computational power increases, the


number of problems that we can seriously
consider solving also increases.
 Examples like

19
Climate modeling

In order to better understand climate change, we


need far more accurate computer models, models
that include interactions between the atmosphere, the
oceans, solid land, and the ice caps at the poles.

20
Protein folding

It’s believed that misfolded proteins may be involved in dis


eases such as Huntington’s, Parkinson’s, and Alzheimer’s,
but our ability to study configurations of complex molecules
such as proteins is severely limited by our current
computational power.

21
Drug discovery

There are many drugs that are effective in treating a relatively


small fraction of those suffering from some disease. It’s
possible that we can devise alternative treatments by careful
analysis of the genomes of the individuals for whom the known
treatment is ineffective. This, however, will involve extensive
computational analysis of genomes.
22
Energy research

Increased computational power will make it possible to


program much more detailed models of technologies such
as wind turbines, solar cells, and batteries. These programs
may provide the information needed to construct far more
efficient clean energy sources.

23
Data analysis

We generate tremendous amounts of data. By some


estimates, the quantity of data stored worldwide doubles
every two years, but the vast majority of it is largely useless
unless it’s analyzed.

24
Why we’re building parallel
systems
 Up to now, performance increases have
been attributable to increasing density of
transistors.

 But there are


inherent
problems.

25
A little physics lesson
 Smaller transistors = faster processors.
 Faster processors = increased power
consumption.
 Increased power consumption = increased
heat.
 Increased heat = unreliable processors.

26
Solution
 Move away from single-core systems to
multicore processors.
 “core” = central processing unit (CPU)
 Introducing parallelism!!! Rather than building
ever-faster, more complex, monolithic processors,
the industry has decided to put multiple, relatively
simple, complete processors on a single chip.
Such integrated circuits are called multicore
processors, and core has become synonymous
with central processing unit, or CPU. In this setting
a conventional processor with one CPU is often
called a single-core system.

27
Why we need to write parallel
programs
 Running multiple instances of a serial
program often isn’t very useful.
 Think of running multiple instances of your
favorite game.

 What you really want is for


it to run faster.

28
Approaches to the serial problem
 Rewrite serial programs so that they’re
parallel.

 Write translation programs that


automatically convert serial programs into
parallel programs.
 This is very difficult to do.
 Success has been limited.

29
More problems
 Some coding constructs can be
recognized by an automatic program
generator, and converted to a parallel
construct.
 However, it’s likely that the result will be a
very inefficient program.
 Sometimes the best parallel solution is to
step back and devise an entirely new
algorithm.

30
Example
 Compute n values and add them together.
 Serial solution:

31
Example (cont.)
 We have p cores, p much smaller than n.
 Each core performs a partial sum of
approximately n/p values.
my_sum = 0;
my_first_i = ...; // Each core's starting index
my_last_i = ...; // Each core's ending index

// Loop through the assigned range of values


for (my_i = my_first_i; my_i < my_last_i; my_i++) {
my_x = Compute_next_value(...); // Compute the
value for this index
my_sum += my_x; // Accumulate the sum
} Each core uses it’s own private variables
and executes this block of code independently of the other cores.

32
Example (cont.)
 After each core completes execution of the
code, is a private variable my_sum
contains the sum of the values computed
by its calls to Compute_next_value.

 Ex., 8 cores, n = 24, then the calls to


Compute_next_value return:
1,4,3, 9,2,8, 5,1,1, 5,2,7, 2,5,0, 4,1,8, 6,5,1, 2,3,9

33
Example (cont.)
 Once all the cores are done computing
their private my_sum, they form a global
sum by sending results to a designated
“master” core which adds the final result.

34
Example (cont.)

35
Example (cont.)
if (I’m the master core) {
// Initialize sum with the master's own value
sum = my_sum;

// Loop through all other cores and receive their values


for each core other than myself {
received_value = receive_value_from_core(core_id);
sum += received_value;
}

// Final sum is computed at the master


} else {
// Worker cores send their sum to the master
send_value_to_master(my_sum);
}
36
Example (cont.)

Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14

Global sum
8 + 19 + 7 + 15 + 7 + 13 + 12 + 14 = 95

Core 0 1 2 3 4 5 6 7
my_sum 95 19 7 15 7 13 12 14

37
But wait!
There’s a much better way
to compute the global sum.

38
Better parallel algorithm
 Don’t make the master core do all the
work.
 Share it among the other cores.
 Pair the cores so that core 0 adds its result
with core 1’s result.
 Core 2 adds its result with core 3’s result,
etc.
 Work with odd and even numbered pairs of
cores.
39
Better parallel algorithm (cont.)
 Repeat the process now with only the
evenly ranked cores.
 Core 0 adds result from core 2.
 Core 4 adds the result from core 6, etc.

 Now cores divisible by 4 repeat the


process, and so forth, until core 0 has the
final result.

40
Multiple cores forming a global
sum

41
Analysis
 In the first example, the master core
performs 7 receives and 7 additions.

 In the second example, the master core


performs 3 receives and 3 additions.

 The improvement is more than a factor of 2!

42
Analysis (cont.)
 The difference is more dramatic with a
larger number of cores.
 If we have 1000 cores:
 The first example would require the master to
perform 999 receives and 999 additions.
 The second example would only require 10
receives and 10 additions.

 That’s an improvement of almost a factor


of 100!
43
How do we write parallel
programs?
 Task parallelism
 Partition various tasks carried out solving the
problem among the cores.

 Data parallelism
 Partition the data used in solving the problem
among the cores.
 Each core carries out similar operations on it’s
part of the data.

44
Professor P

15 questions
300 exams

45
Professor P’s grading assistants

TA#1 TA#3
TA#2

46
Division of work –
data parallelism

TA#1
100 exams
TA#3

100 exams

100 exams
TA#2

47
Division of work –
task parallelism

TA#1
TA#3
Questions 11 - 15
Questions 1 - 5

TA#2
Questions 6 - 10

48
Division of work – data Parallelism

49
Division of work – task Parallelism

Tasks
1)Receiving

2)Addition

50
Coordination
 Cores usually need to coordinate their work.
 Communication – one or more cores send
their current partial sums to another core.
 Load balancing – share the work evenly
among the cores so that one is not heavily
loaded.
 Synchronization – because each core works
at its own pace, make sure cores do not get
too far ahead of the rest.

51
What we’ll be doing
 Learning to write programs that are
explicitly parallel.
 Using the C language.
 Using three different extensions to C.
 Message-Passing Interface (MPI)
 Posix Threads (Pthreads)
 OpenMP

52
Type of parallel systems
 Shared-memory
 The cores can share access to the computer’s
memory.
 Coordinate the cores by having them examine
and update shared memory locations.
 Distributed-memory
 Each core has its own, private memory.
 The cores must communicate explicitly by
sending messages across a network.

53
Type of parallel systems

Shared-memory Distributed-memory

54
Terminology
 Concurrent computing – a program is one
in which multiple tasks can be in progress
at any instant.
 Parallel computing – a program is one in
which multiple tasks cooperate closely to
solve a problem
 Distributed computing – a program may
need to cooperate with other programs to
solve a problem.

55
Different API(Application Programming
Interface)s are used for programming different
types of systems
 MPI is an API for programming distributed memory

MIMD systems
 Pthreads is an API for programming shared

memory MIMD systems


 OpenMP is an API for programming both shared

memory MIMD and shared memory SIMD systems.


 CUDA is an API for programming Nvidia GPUs,

which have aspects of all four of our classification:


 Shared memory and Distributed memory, SIMD

and MIMD.
56
Concurrent,Parallel,Distributed
 In concurrent computing, a program is one in which
multiple tasks can be in progress at any instant.
 In parallel computing, a program is one in which

multiple tasks cooperate closely to solve a problem.


 In distributed computing, a program may need to

cooperate with other programs to solve a problem.


So parallel and distributed programs are concurrent,
but a program such as a multitasking operating
system is also concurrent.

57
In parallel programming, APIs (Application
Programming Interfaces) can be called
simultaneously to improve performance and
speed up processes. This is done by executing
multiple API calls at the same time instead of
sequentially

58
Some benefits of using parallel APIs:
•Faster response times
•Parallel APIs can lead to faster response
times, which can improve the user
experience.
•Optimized resource utilization
•Parallel APIs can optimize resource utilization
by enabling simultaneous data retrieval.
•Handling complex scenarios
•Parallel APIs can handle complex or dynamic
scenarios that require coordination or
synchronization among multiple APIs

59

You might also like