Parallel Computing
Dr. Gargi Sanket Prabhu
Parallel Architectures CS & IS, BITS Pilani K K Birla Goa
Campus
Amdahl’s Law
• Let f be the fraction of operations in a computation that must be performed
sequentially 0 ≤ f ≤ 1
Maximum speedup
S ≤ ____1____
f+(1-f) / p
Example
• If 80% of a program can be parallelized, then the theoretical maximum
speedup when number of processors are 5 is ___________
Amdahl’s Law
Amdahl’s Effect
• As the problem size increases, the fraction f of inherently sequential
operations decreases, making the problem more amenable to
parallelization.
Amdahl’s Law example
N=1,000,000
Sequential algorithm marks 2,122,048 cells
Outputs 78,498 prime numbers
Solution:
Amdahl’s Law example
• N=1,000,000
• Sequential algorithm marks 2,122,048 cells
• Outputs 78,498 prime numbers
• Solution:
Limitations of Amdahl’s Law
Limitations of Amdahl’s Law
• Fixed Problem Size Assumption
• Ignores Communication & Synchronization Overheads
• Sequential Fraction Is Not Always Constant
• Homogeneous Processor Assumption
• Overly Pessimistic for Large Problems
Parallel Architecture Paradigms
Michael Flynn in 1972 gave taxonomy for categorizing different styles of
computer system architecture:
• Single instruction stream, single data stream (SISD)
• Single instruction stream, multiple data stream (SIMD)
• Multiple instruction stream, single data stream (MISD)
• Multiple instruction stream, multiple data stream (MIMD)
SISD
Single processor executes a single instruction stream on a single data stream
E.g. Classic Von Neumann architecture
SIMD : Vector Computing
A single instruction is broadcast to multiple processing units,
each of which operates on a separate data stream
MISD
Multiple processors executing different instructions on the same
data stream
MIMD: Most Advanced Computers
Multiple processors or cores, each capable of executing different
instructions on different data streams independently
Memory Access Classification
An alternative way to classify parallel systems is
01 by how their cores access memory
02 This classification focuses on memory-sharing
and communication between cores
Shared Memory Systems
• Cores share access to a common memory space
• Cores coordinate their tasks by modifying shared memory locations
Shared Memory Computing
Advantages:
• Easier to program due to shared memory access
• Well-suited for symmetric multiprocessing (SMP) systems
• Efficient for tasks that require frequent communication and coordination
between threads or processes.
Shared Memory Computing
Advantages:
• Easier to program due to shared memory access
• Well-suited for symmetric multiprocessing (SMP) systems
• Efficient for tasks that require frequent communication and coordination
between threads or processes.
Challenges:
• Scalability can be limited due to memory contention
• Careful synchronization mechanisms are required to prevent race conditions.
Example
Fragment 0 Fragment 1
While(x==0);
x=1; x=2;
Example
Void withdraw(int
amount)
{
If(balance-amount >0)
balance-=amount;
}
Example
Atomic :
Location x=1;
Location y=5;
Distributed Memory Systems
• Each core has its own private memory
• Cores coordinate their tasks by communicating across a network
Example
Fragment 0 Fragment 1
X=5;
Receive(1,y); Send(0,10);
Send(1,x+y); Receive(0,x);
Distributed Memory Computing
Advantages:
• Scalable for large systems as memory is distributed
• Suitable for parallel applications with minimal communication between
processes
• Harness the power of a large number of nodes.
Challenges:
• Programming can be more complex due to explicit message passing data
distribution, and synchronization requirements.
Distributed Memory Computing
Advantages:
• Scalable for large systems as memory is distributed
• Suitable for parallel applications with minimal communication between
processes
• Harness the power of a large number of nodes.
Hybrid Systems
Combine shared-memory nodes with distributed-memory architectures
Common in clusters, where individual nodes are multicore shared-
memory systems connected via a network
Memory Access Models
• Uniform Memory Access (UMA)
• Non Uniform Memory Access (NUMA)
• Cache-Only Memory Access (COMA)
Uniform Memory Access
• All processors have equal and uniform access to a single shared
memory pool.
• Memory access times are roughly the same for all processors
• The memory access pattern is similar to that of a single-processor system,
which makes programming simpler.
Uniform Memory Access
• As the number of processors increases, contention for the shared
memory bus can lead to performance bottlenecks.
• UMA is more suited for smaller multiprocessor systems.
Non Uniform Memory Access
• The system is composed of multiple nodes, each containing processors
and a local memory.
• Processors have faster access to their local memory than to remote
memory in other nodes.
• Memory access times can vary based on whether the data is stored locally
or remotely.
Non Uniform Memory Access
• Scale better as the number of processors increases since each node can
have its own local memory and processors.
• Used in large-scale multiprocessor systems, like servers and high-
performance computing clusters.
Cache-Only Memory Access
• Only cache memories are present; no main memory is employed either
in the form of UMA or NUMA
• The main goal is to provide a unified view of the memory while efficiently
distributing data across caches.
Cache-Only Memory Access
• COMA architectures have not gained as much traction as UMA and
NUMA due to their complexity and limited benefits compared to other
memory models.
Summary
Feature COMA UMA NUMA
Non-uniform,
Memory Access Dynamic and
Uniform depending on
Time variable
location
Complexity High Low Medium
Limited by Limited by memory
Scalability High
complexity bus contention
Potentially high for Predictable and High with proper
Performance
specific apps uniform optimizations
Support Limited Extensive Extensive
Large-scale
Specialized, Small to medium
Typical Use Cases multiprocessor
research systems multiprocessor
systems
Thank You!