Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
3K views35 pages

Bcs702 Parallel Computing Module 1

The document provides an overview of parallel computing, including its definitions, advantages, and applications. It discusses various parallel programming models such as Shared-Memory (OpenMP) and Distributed-Memory (MPI), along with their respective advantages and disadvantages. Additionally, it classifies parallel computers using Flynn's taxonomy, explaining SIMD and MIMD systems, and highlights the importance of parallel hardware and software.

Uploaded by

Senthilnathan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views35 pages

Bcs702 Parallel Computing Module 1

The document provides an overview of parallel computing, including its definitions, advantages, and applications. It discusses various parallel programming models such as Shared-Memory (OpenMP) and Distributed-Memory (MPI), along with their respective advantages and disadvantages. Additionally, it classifies parallel computers using Flynn's taxonomy, explaining SIMD and MIMD systems, and highlights the importance of parallel hardware and software.

Uploaded by

Senthilnathan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

BCS702 PARALLEL COMPUTING

MODULE-1

Introduction to parallel programming, Parallel hardware and parallel software – Classifications of


parallel computers, SIMD systems, MIMD systems, Interconnection networks, Cache coherence,
Shared-memory vs. distributed-memory, Coordinating the processes/threads, Shared-memory,
Distributed-memory.

1.1 Introduction to Parallel Programming


Introduction to Parallel Computing
​ Computer software was written conventionally for serial computing. This meant that to solve
a problem, an algorithm divides the problem into smaller instructions. These discrete instructions are
then executed on the CPU of a computer one by one. Only after one instruction is finished, next one
starts.
Parallel Computing is the use of multiple processing elements simultaneously for solving any
problem. Problems are broken down into instructions and are solved concurrently as each resource
that has been applied to work is working at the same time.

Advantages of Parallel Computing over Serial Computing are as follows:


1.​ It saves time and money as many resources working together will reduce the time and cut
potential costs.
2.​ It can be impractical to solve larger problems on Serial Computing.
3.​ It can take advantage of non-local resources when the local resources are finite.
4.​ Serial Computing 'wastes' the potential computing power, thus Parallel Computing makes
better work of the hardware.

Why parallel computing?


●​ The whole real-world runs in dynamic nature i.e. many things happen at a certain time but at
different places concurrently. This data is extensively huge to manage.
●​ Real-world data needs more dynamic simulation and modeling, and for achieving the same,
parallel computing is the key.
●​ Parallel computing provides concurrency and saves time and money.
●​ Complex, large datasets, and their management can be organized only and only using parallel
computing's approach.
●​ Ensures the effective utilization of the resources. The hardware is guaranteed to be used
effectively whereas in serial computation only some part of the hardware was used and the rest
rendered idle.
●​ Also, it is impractical to implement real-time systems using serial computing.

Applications of Parallel Computing:


●​ Databases and Data mining.
●​ Real-time simulation of systems.
●​ Science and Engineering.
●​ Advanced graphics, augmented reality, and virtual reality.
What is Parallel Programming?
●​ A programming technique where multiple processes or threads execute simultaneously to
solve a problem faster.
●​ Useful for high-performance computing, scientific simulations, real-time systems, etc.

Goals:
●​ Improve performance by executing multiple instructions concurrently.
●​ Optimize resource utilization across multicore or multiprocessor systems.
●​ Reduce execution time for large-scale or complex computations.

Real-World Analogy
Making a Dinner with Friends (Parallel Cooking)
●​ Task: Make rice, curry, and salad.
●​ You alone (Sequential): Cook rice → then curry → then salad.
●​ With 3 friends (Parallel):
o​ You cook rice,
o​ Friend 1 makes curry,
o​ Friend 2 chops salad.
You complete the meal 3× faster!

Parallel Programming Models


❖​ Shared-Memory Model (using OpenMP)
❖​ Distributed-Memory Model (using MPI)
❖​ Compute Unified Device Architecture (GPU programming)

Shared-Memory Model (using OpenMP)


What is the Shared-Memory Model?
In the shared-memory model, multiple threads (or processes) run in parallel and share the same
address space (memory). All threads can read and write to shared variables, which enables
communication and synchronization.
●​ Memory is global and accessible to all threads.
●​ Threads are typically created and managed within the same process.
●​ Commonly used on multi-core systems where all cores can access the main memory.
OpenMP Overview:
OpenMP (Open Multi-Processing) is an API in C, C++, and Fortran for shared-memory parallel
programming. It uses compiler directives (pragmas), runtime library routines, and environment
variables to control parallel execution.
Key OpenMP Concepts

Concept Description
#pragma omp parallel Starts a parallel region (forks threads).
#pragma omp for or Splits loop iterations among threads.
#pragma omp parallel for
shared, private Control variable scope (shared across threads or private to each)
Critical Ensures a block of code is executed by only one thread at a time
Barrier Synchronizes all threads at a point.

Example:
Parallel Sum of Array Elements //Sequential Code (C):
#include <stdio.h>
int main()
{
int A[5] = {1, 2, 3, 4, 5};
int sum = 0;
for (int i = 0; i < 5; i++) {
sum += A[i];
}
printf("Sum = %d\n", sum);
return 0;
}
Parallel Version Using OpenMP in C:
#include <stdio.h>
#include <omp.h>
int main()
{
int A[5] = {1, 2, 3, 4, 5};
int sum = 0;
int i;
#pragma omp parallel
for (i = 0; i < 5; i++) {
sum += A[i]; // Each thread adds its part of the array
}
printf("Parallel Sum = %d\n", sum);
return 0;
}
Advantages of Shared-Memory Model
●​ Easier communication (no need for message passing).
●​ Efficient for small-to-medium number of processors.
●​ Natural fit for multi-threaded applications.

Distributed-Memory Model (using MPI)

What is the Distributed-Memory Model?


In the distributed-memory model, each process has its own private memory. Processes do not share
memory and must communicate explicitly by-passing messages.
●​ Used in clusters, supercomputers, and multi-node systems.
●​ Each process is typically an independent program running on different machines or cores.
●​ Data sharing is done using Message Passing Interface (MPI).

What is MPI?
MPI (Message Passing Interface) is a standardized and portable message-passing system to allow
processes to communicate with each other in a distributed-memory environment.
MPI provides:
●​ Point-to-point communication (MPI_Send, MPI_Recv)
●​ Collective communication (MPI_Bcast, MPI_Reduce, etc.)
●​ Process synchronization
●​ Scalability across nodes

Example: Parallel Sum of Array Elements using MPI

Scenario:
We have an array distributed across processes. Each process computes its partial sum, and then all
partial sums are combined using MPI_Reduce.

MPI Code in C:
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])


{
int rank, size;
int local_sum = 0, global_sum = 0;
int data[4] = {1, 2, 3, 4}; // Assume size = 4 for simplicity
MPI_Init(&argc, &argv); // Initialize MPI
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get total number of processes
// Each process handles one element for simplicity
local_sum = data[rank];
// Reduce local sums to global sum in root process (rank 0)
​ MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD);
if (rank == 0) {
printf("Total sum = %d\n", global_sum);
}
MPI_Finalize(); // Finalize MPI
return 0;
}

Key MPI Functions Used


Function Description
MPI_Init Starts the MPI environment
MPI_Comm_rank Gets the rank (ID) of the process
MPI_Comm_size Gets the total number of processes
MPI_Reduce Gathers and reduces values from all processes
MPI_Finalize Ends the MPI environment

Advantages of Distributed-Memory Model


●​ Highly scalable across many machines.
●​ Works across networks or clusters.
●​ Each process works independently — no race conditions in shared variables.

Disadvantages
●​ Complex programming due to explicit communication.
●​ Message latency and bandwidth issues.
●​ Debugging and synchronization can be harder than shared-memory.
Shared vs Distributed Memory
Feature Shared Memory (OpenMP) Distributed Memory (MPI)
Memory Shared across threads Private to each process
Communication Implicit via memory Explicit via message passing
Scalability Limited to one machine Scales across machines
Programming Easier More complex
Performance Good for few cores Better for large clusters

1.2. Parallel Hardware and Parallel Software


Parallel Hardware
Parallel hardware refers to the physical components (processors, memory systems,
interconnects) that make it possible to run multiple tasks simultaneously.
Why Parallel Hardware?
To solve large or complex problems faster, we need to perform multiple computations at the
same time. Parallel hardware provides the necessary support for this.
Types of Parallel Hardware
There are mainly two categories based on how processors work:
SIMD (Single Instruction Multiple Data)
●​ All processors execute the same instruction but on different pieces of data.
Example: Graphics Processing Unit (GPU) – processes thousands of pixels
●​
simultaneously.
MIMD (Multiple Instruction Multiple Data)
●​ Each processor executes different instructions on different data.
●​ Example: Multicore CPUs, clusters, supercomputers.
Memory Architecture
How do processors access memory?

●​ Shared Memory: All processors access the same memory (e.g., multicore desktops).
●​ Distributed Memory: Each processor has its own local memory (e.g., MPI clusters).
Interconnection Networks
Used to connect processors to each other and to memory:
●​ Bus, Ring, Mesh, Hypercube, Crossbar are common topologies.
●​ Efficient networks are important to avoid delays in communication.
Cache Coherence
In shared-memory systems, it ensures that all processors see the same value of a variable, even
when it is cached.
Parallel Software
Parallel software is the programming and logic that tells the hardware how to run multiple
tasks simultaneously.
Why Parallel Software?
Even with powerful hardware, we need specially designed software and programs to use it
efficiently.
Programming Models
These are the ways we write parallel programs:
Shared Memory Model
●​ Threads share a global memory.
●​ Use OpenMP (Open Multi-Processing) for C/C++ or Fortran.
Distributed Memory Model
●​ Each process has its own memory.
●​ Communication is done using messages.
●​ Use MPI (Message Passing Interface).
Hybrid Model
●​ Combine both OpenMP + MPI for better performance in large systems.
Parallel Languages & Tools
●​ OpenMP – Easy parallelism for loops in C/C++.
●​ MPI – Communication between processes in clusters.
●​ CUDA – GPU programming using NVIDIA hardware.
●​ Chapel, Cilk, TBB – Other tools for parallel development.
Main Concepts in Parallel Software
●​ Task Decomposition: Split the problem into smaller tasks.
●​ Synchronization: Coordinate tasks using barriers, locks.
●​ Communication: Share data between tasks.
●​ Load Balancing: Ensure all processors do equal work.
Difference between Parallel Hardware and Software

Feature Parallel Hardware Parallel Software

What is it? Physical components (CPU, GPU, etc.) Programs and models for parallel execution

Role Performs multiple computations Manages and distributes tasks

Examples Multicore processors, Clusters OpenMP, MPI, CUDA

1.3. Classifications of Parallel Computers


In parallel computing, tasks are divided into smaller subtasks and executed simultaneously to
improve performance and efficiency. Based on how data and instructions are processed, parallel
computers are classified using Flynn’s taxonomy, which is one of the most well-known classification
schemes.
Flynn’s Taxonomy
Michael J. Flynn (1966) classified parallel computers into four categories based on the number of
instruction streams and data streams:
Instruction Data
Classification Example
Stream Stream
SISD Traditional uniprocessor
1 1
(Single Instruction, Single Data) (like old PCs)
SIMD GPU (Graphics Processing Unit),
1 M
(Single Instruction, Multiple Data) vector processors
MISD Rare, theoretical or used in
M 1
(Multiple Instruction, Single Data) fault-tolerant systems
MIMD Modern multi-core CPUs, clusters,
M M
(Multiple Instruction, Multiple Data) supercomputers

1. SISD (Single Instruction, Single Data)


●​ Working: A single processor executes a single instruction on a single data item at a time.
●​ Architecture: Sequential and traditional von Neumann model.
●​ Example:
o​ Early computers like IBM 7090.
2. SIMD (Single Instruction, Multiple Data)
●​ Working: Multiple processing elements perform the same operation on multiple data
simultaneously.
●​ Best for: Vector processing, image processing, matrix operations.
●​ Example:
o​ GPUs in graphics rendering.
o​ Intel AVX (Advanced Vector Extensions).
// Example: Adding two arrays using SIMD in C (conceptual)
#pragma omp simd
for(int i = 0; i < N; i++)
{
C[i] = A[i] + B[i];
}
3. MISD (Multiple Instruction, Single Data)
●​ Working: Multiple processors execute different instructions on the same data stream.
●​ Usage: Mostly theoretical or used in specialized applications like redundant systems for
reliability (e.g., space systems).
●​ Example:
o​ Fault-tolerant computing in spacecraft (rare in practice).
4. MIMD (Multiple Instruction, Multiple Data)
●​ Working: Multiple processors execute different instructions on different data.
●​ Best for: General-purpose parallel computing.
●​ Two types:
o​ Shared Memory MIMD: All processors access the same global memory (e.g.,
OpenMP).
o​ Distributed Memory MIMD: Each processor has its own memory and they
communicate (e.g., MPI).
// OpenMP example (shared memory MIMD)

#pragma omp parallel


{
​ printf("Thread %d is working\n", omp_get_thread_num());
}
Summary Table:

Type Parallelism Memory Type Use Case


SISD None Single Sequential tasks
SIMD Data Shared Image/audio processing
MISD Fault tolerance Shared Redundant execution
MIMD Task & Data Shared or Distributed General parallel apps

1.4. SIMD and MIMD Systems


1. SIMD (Single Instruction, Multiple Data)
Definition:
SIMD systems allow a single control unit to dispatch one instruction to multiple processing
elements (PEs), each working on different data elements simultaneously.
How does it work?
●​ One instruction is broadcast to all processing units.
●​ Each unit executes the instruction on its own data.
●​ Works well with structured, data-parallel tasks.
Characteristics:

Feature Description
Instruction stream Single
Data stream Multiple
Control Centralized
Synchronization Implicit
Memory Shared or distributed (depends on implementation)
Applications:
●​ Image processing
●​ Signal processing
●​ Machine learning (matrix operations)
●​ Vectorized operations in scientific computing

Real-World Examples:
●​ GPUs (Graphics Processing Units): Process thousands of pixels using same shader logic.
●​ Intel AVX/SSE instructions: SIMD vector extensions in CPUs.
●​ NEON in ARM processors: Used in mobile devices.
Example:
Adding two arrays of numbers: Sequential Code
for (int i = 0; i < 4; i++) {
C[i] = A[i] + B[i];
}
SIMD Logic:
SIMD_ADD A[0..3], B[0..3] -> C[0..3]
All additions are done in parallel by SIMD hardware.

2. MIMD (Multiple Instruction, Multiple Data)


Definition:
MIMD systems consist of multiple independent processors, each executing its own instruction
stream on its own data. This supports both data and task parallelism.
How does it work?
●​ Each processor has its own control unit and executes independently.
●​ Can coordinate via shared memory (SMP) or message passing (MPI).
Characteristics:

Feature Description
Instruction stream Multiple
Data stream Multiple
Control Decentralized
Synchronization Explicit (using threads, barriers, locks, or messages)
Memory Shared (e.g., multicore CPU) or Distributed (e.g., cluster)

Applications:
●​ Web servers
●​ Distributed simulations
●​ High-performance scientific computations
●​ Cloud computing

Real-World Examples:
●​ Multicore CPUs (Intel, AMD)
●​ Clusters (HPC supercomputers)
●​ Cloud infrastructure (AWS, Azure instances)
Example:
●​ A weather simulation might have:
●​ Core 1 calculating wind patterns
●​ Core 2 calculating rainfall
●​ Core 3 modeling temperature
Each runs different programs on different data.

SIMD vs MIMD:
Feature SIMD MIMD
Instruction stream Single Multiple
Data stream Multiple Multiple
Processing elements Uniform, tightly coupled Independent, loosely coupled
Control Centralized Decentralized
Type of parallelism Data parallelism Task and data parallelism
Example systems GPU, Vector processor Multicore CPU, Clusters
Programming model Vectorization, OpenCL, CUDA Threads, MPI, OpenMP
Synchronization Minimal/implicit Complex/explicit
Efficiency Very high for uniform tasks High for complex workloads

1.5. Interconnection Networks

Why do we need Interconnection Networks?

In parallel computing, many processors (CPUs) work together to solve a problem faster. But to
coordinate, they must:

●​ Share data (e.g., partial results)


●​ Send control messages (e.g., synchronization signals)
●​ Access shared memory
This requires a communication system connecting the processors and memories:​
Interconnection Network.

What is an Interconnection Network?

Definition:​
An interconnection network is the hardware and routing infrastructure that enables:
●​ Processor-to-Processor communication (message passing)​

●​ Processor-to-Memory communication (shared memory)​

It consists of:

●​ Nodes: Processors or memories.​

●​ Links: Physical connections (wires, buses).​

●​ Switches/Routers: Devices deciding which path data takes.

Classification of Interconnection Networks

There are two main classes:

Static (Direct) Networks

Static/direct:​
Each processor is directly connected to a fixed set of neighbors.

Properties:
●​ Predictable paths.
●​ Lower complexity.
●​ Good for fixed-size systems.
Examples:

1.​ Linear Array:


i.​ Processors in a line.
ii.​ Each connected to two neighbors (except ends).
iii.​ Diameter: O(n) — worst-case distance grows linearly.
2.​ Ring:
i.​ Ends connected to form a loop.
ii.​ Reduces diameter to n/2.
3.​ Mesh/Grid:
i.​ 2D layout.
ii.​ Used in many supercomputers.
iii.​ Diameter: O(sqrt(n)).
4.​ Torus:
i.​ Mesh + wraparound links.
ii.​ Improves latency.
5.​ Hypercube:
i.​ n-dimensional cube.
ii.​ Each node has log2(n) neighbors.
iii.​ Diameter: log2(n).
iv.​ Very scalable.
6.​ Tree:
i.​ Hierarchical.
ii.​ Good broadcast capability but bottleneck near root.

Dynamic (Indirect) Networks


Dynamic/indirect:​
Nodes connect via switching elements. Paths are set up dynamically.

Properties:

●​ Flexible.
●​ Can connect large numbers of nodes efficiently.
●​ More complex control.
Examples:
1.​ Crossbar Switch:

i.​ Every processor can connect to any memory module via a switch matrix.

ii.​ Very high bandwidth, but cost grows O(n²).​

2.​ Multistage Interconnection Networks (MIN):​

i.​ Multiple layers of switches.


ii.​ Paths configured on demand.
iii.​ Less expensive than a crossbar.​
Examples:
1.​ Omega Network
2.​ Banyan Network
3.​ Clos Network
4.​ Often used in commercial parallel systems.​
Key Properties of Interconnection Networks

When designing or evaluating a network, we analyze:

Property Meaning

Degree Number of links per node

Diameter Max number of hops between any two nodes

Bisection Bandwidth Bandwidth across the minimum cut dividing the


network in half

Latency Time to deliver a message

Throughput Total data transfer rate of the network

Scalability How well the network grows as you add nodes

Fault Tolerance How robust the network is to failures

Examples of Properties

Topology Degree Diameter Bisection


Bandwidth

Linear Array 2 n-1 1 link


(n nodes)

Ring 2 n/2 2 links

Mesh 4 2*sqrt(n) sqrt(n) links


(sqrt(n)x sqrt(n))

Hypercube k k n/2 links


(n=2^k)

Crossbar n 1 n links
1.6. Cache Coherence
What is Cache Coherence?
●​ In parallel computing, multiple processors (or cores) often work together to solve a problem.
●​ Each processor usually has its own cache — a small, fast memory that stores frequently
accessed data from the main memory (RAM).
●​ Cache coherence ensures that all processors see the most recent and consistent value of
shared data — no matter which cache holds it.
Why do we need Cache Coherence?
Without cache coherence:
●​ One processor may update a shared variable in its cache.
●​ Other processors may still have an old (stale) value in their caches.
●​ This leads to incorrect results in a parallel program.
Example:
​ // Assume X = 5 initially

Processor 1: X = 10; // updates its cached X


Processor 2: print(X); // still sees X = 5 in its cache => wrong value!
The Cache Coherence Problem
The problem occurs when:
●​ Multiple processors cache the same memory location.
●​ One processor modifies its cached copy.
●​ Other processors don't get informed of this change.
The system becomes incoherent — processors do not agree on the current value of that memory
location.
How Cache Coherence is achieved ?
There are two main approaches:

1.​ Snooping Protocols


●​ All caches watch (snoop) the shared memory bus.
●​ When a processor writes to a cached block:
○​ It broadcasts an update or invalidation to all other caches.
●​ Other caches either:
○​ Update their copies (write-update protocol), or
○​ Invalidate their copies (write-invalidate protocol).

Common protocol: MESI (Modified, Exclusive, Shared, Invalid)


2.​ Directory-Based Protocols
●​ Suitable for large-scale systems (e.g., clusters, NUMA architectures).
●​ A central directory keeps track of which caches have copies of a memory block.
●​ When a processor writes:​
The directory coordinates updates/invalidation of other caches.

Key Protocols (Example: MESI)

MESI tracks each cache line’s state:

●​ M (Modified) → Cache has the only valid copy, and it’s been changed.
●​ E (Exclusive) → Cache has the only valid copy, unchanged.
●​ S (Shared) → Multiple caches have the same valid copy.
●​ I (Invalid) → Cache copy is invalid (stale).

Importance in Parallel Computing


Cache coherence ensures:

●​ Correct execution of parallel programs that share data.


●​ All processors see a consistent view of shared variables.​

Without coherence:

●​ Programs might read stale values.


●​ Results could be unpredictable or wrong.​

Impact on Performance

Cache coherence mechanisms:

●​ Add overhead (extra bus traffic, protocol complexity).​

●​ Can cause false sharing: when processors modify different variables that happen to share a
cache line, leading to unnecessary invalidations.​
MESI Protocol State Transitions Diagram

MESI State Transfer

Whether the state changes from invalid to exclusive or shared depends on whether other cores hold
the same block (a simple OR logic on the hit signal).

1.​ Invalid to Shared.​


Read, miss and issue read requests. If there are other cores holding the block and the state is
shared, transfer to a shared state.​

2.​ Invalid to Exclusive.​


Read, miss and issue read requests. If there are no other cores holding the block, transfer to
an exclusive state.​

3.​ Invalid to Modified.​


Write miss and issue read request. Broadcast message to invalidate exclusive and shared state
blocks. May incur dirty write back for other modified blocks on other cores.​

4.​ Shared to Modified.​


Write hits. Broadcast message to invalidate shared state blocks.​

5.​ Shared to Invalid.​


Snoop hit on a writer.​

6.​ Exclusive to Modified.​


Write hits. No need to broadcast any messages.​
7.​ Exclusive to Invalid.​
Snoop hit on a writer.​

8.​ Modified to Shared.​


Snoop hit on a read. Dirty write back and other core read the updated copy and transfer from
invalid to shared.​

9.​ Modified to Invalid.​


Snoop hit on a writer. Dirty write back and other core read the update copy and transfer from
invalid to modified.​

1.7. Shared-Memory vs. Distributed-Memory

Parallel computing systems are typically classified into shared-memory and distributed-memory
architectures.
Below is a comparison table, followed by detailed explanations:
Aspect Shared-Memory System Distributed-Memory System

Memory Access All processors share a single global Each processor has its own private
memory space memory

Communication Via shared variables (loads/stores) Via message passing (e.g., MPI)

Programming Easier: threads, OpenMP, Pthreads Harder: explicit messages, MPI


Model

Scalability Limited by memory bandwidth and Highly scalable across nodes


contention

Examples Multi-core CPUs, SMP machines Clusters, supercomputers

Synchronization Locks, semaphores, barriers Explicit coordination in code

Cost Often more expensive per node Commodity hardware networked


together

Shared-Memory Architecture

Definition:
A computing model where multiple processors (or cores) access the same physical
memory.

How does it work?


●​ All threads/processors read and write to a common address space.
●​ Hardware handles cache coherence (keeping caches consistent).
●​ Synchronization (e.g., mutexes) prevents race conditions.​
Shared-Memory Architecture Diagram

Programming:

●​ Easier because you don’t have to think about where the data is stored.​

●​ Typically use:
○​ OpenMP
○​ Pthreads
○​ Java threads

Example Systems:
●​ Multi-core desktop CPUs (Intel, AMD)​

●​ Symmetric Multiprocessors (SMPs)

Advantages:
●​ Simple programming model.​

●​ No explicit communication—data just lives in memory.​

Disadvantages:
●​ Harder to scale beyond a certain number of cores (memory bus contention).​

●​ Hardware cost rises rapidly as you scale cores and memory bandwidth.​
Distributed-Memory Architecture
Definition:
Each processor has its own private memory; processors communicate by sending
messages over a network.
How does it work?
●​ No global memory space.
●​ All data exchanges are explicit: you send/receive data.

Programming:
●​ More complex: you must manage data distribution and communication.
●​ Use:
○​ MPI (Message Passing Interface)
○​ PVM
Example Systems:
●​ Clusters of servers
●​ Supercomputers (e.g., IBM Blue Gene)
Advantages:
●​ Scalable to thousands of nodes.
●​ Commodity hardware can be assembled into powerful clusters.
Disadvantages:
●​ More difficult programming (must explicitly manage communication).
●​ Higher latency to move data between nodes.
Hybrid Systems
Modern HPC systems are often hybrid:
●​ Within a node: shared-memory (multi-core CPUs).
●​ Between nodes: distributed-memory (message passing).​

This is why you might see MPI+OpenMP programs:


●​ MPI for inter-node communication.
●​ OpenMP for intra-node parallelism.

1.8. Coordinating the Processes/Threads


What is Coordination?

In parallel computing, coordination refers to:

●​ Controlling the activities of multiple threads/processes.


●​ Synchronizing their progress.
●​ Managing access to shared resources and data.
●​ Ensuring correct order and consistency.

Without coordination:

●​ Work may be done out of order.


●​ Threads could interfere (write/read conflicts).
●​ Results could be incorrect or inconsistent.

When is Coordination Needed?


Typical situations:

●​ When tasks depend on results from other tasks

●​ When multiple threads update shared variables

●​ When you want to ensure all threads reach a point before continuing

●​ When distributing or collecting data among processes

Coordination in Shared Memory Systems (Threads)

This is common in:

●​ Multicore CPUs
●​ GPUs
●​ Frameworks like OpenMP, Pthreads

Synchronization Primitives

Mutexes (Mutual Exclusion Locks)

●​ Ensure that only one thread enters a critical section at a time.​

●​ Prevent race conditions.

Example:
​ pthread_mutex_lock(&mutex);
// Critical section
pthread_mutex_unlock(&mutex);

​ Spinlocks

●​ Similar to mutexes but busy-wait (continuously check the lock).​

●​ Faster in some situations but can waste CPU cycles.

Semaphores

●​ Counting mechanisms to control access to resources.

●​ Used for signaling between threads.​


Barriers

●​ Force all threads to wait until everyone reaches a synchronization point.​

●​ After the barrier, all proceed together.

​ OpenMP Example:
#pragma omp parallel
{
​ // Do work
​ #pragma omp barrier
​ // All threads wait here
}
​ Condition Variables

●​ Let threads wait for specific conditions to become true.​

●​ Used with mutexes.​

Atomic Operations

●​ Simple updates (e.g., increment) performed safely without locks.

OpenMP Example:

​ #pragma omp atomic


counter++;

Scheduling and Work Division

Coordination also means deciding:

●​ Which thread does which part of the work.

●​ In what order.​

OpenMP Scheduling:
●​ static: Fixed assignment of iterations.
●​ dynamic: Threads pull work as they finish chunks.
●​ guided: Like dynamic but decreasing chunk sizes.
4. Coordination in Distributed Memory Systems (Processes)
This is common in:

●​ Clusters

●​ Supercomputers

●​ Message Passing Interface (MPI)​

Message Passing

●​ Processes explicitly send and receive messages.

●​ Unlike threads, processes do not share memory.​

Example:

MPI_Send();
MPI_Recv();

Collective Communication

Operations involving all processes:

●​ Broadcast: Send data from one process to all others.​

●​ Scatter: Distribute chunks of data to processes.​

●​ Gather: Collect data from all processes.​

●​ Reduce: Combine values (sum, max, etc.) across processes.

​ Barriers

MPI has barriers similar to threads:

​ MPI_Barrier(MPI_COMM_WORLD);

All processes wait until everyone arrives.

Synchronization Challenges

●​ Network latency in communication.​

●​ Correct matching of send/receive operations.​

●​ Deadlocks if communication patterns are mismatched.


5. Coordination Design Patterns

These common patterns help organize parallel programs:

Fork-Join

●​ A master thread forks worker threads, then joins them.

OpenMP Example:

#pragma omp parallel


{
​ // Parallel work
}
// Implicit join

​ Master-Worker

●​ One process (master) distributes tasks to workers.

Pipeline

●​ Data flows through multiple stages of processing.

​ Data Parallel Reduction

●​ Each thread/process computes partial results, then combines them.

6. Challenges in Coordination
Race Conditions:
●​ Two or more threads update shared data simultaneously in an unsafe way.
Deadlocks:
●​ Two threads wait indefinitely for each other’s resources.
Livelocks:
●​ Threads keep changing state but make no progress.
Starvation:
●​ A thread never gets CPU time or resources.
False Sharing:
●​ Threads update different variables sharing the same cache line, hurting performance.
Example in OpenMP

Parallel Sum with Reduction and Barrier:

#include <omp.h>
#include <stdio.h>
int main()
{
​ int i, N = 1000;
​ double sum = 0.0, a[N];
​ // Initialize array
​ for (i = 0; i < N; i++)
a[i] = 1.0;
​ #pragma omp parallel
​ {
​ #pragma omp for reduction(+:sum)
​ for (i = 0; i < N; i++)
​ sum += a[i];

​ #pragma omp barrier

​ #pragma omp single


​ printf("Total sum: %f\n", sum);
​ }
​ return 0;
}
Explanation:

●​ reduction: Coordinates summing safely.


●​ barrier: Synchronizes threads.
●​ single: Only one thread prints.
1.9. Shared-Memory Programming

What Is Shared-Memory Programming?


In shared-memory parallel computing, multiple threads or processes access a common memory space,
meaning all of them can read and write the same variables in memory.
This is different from distributed-memory systems (like MPI), where each process has its own private
memory, and data must be explicitly sent between processes.

Key Characteristics
Single Address Space

●​ All threads can access the same variables.


●​ Any change by one thread is immediately visible to others (unless caches or synchronization
issues interfere).
Implicit Communication

●​ No need to explicitly send messages—data is in shared memory.


Synchronization Is Critical

●​ Because threads share memory, you must control access to avoid data races (two threads
modifying data simultaneously).
Common Environments

●​ Multicore CPUs (e.g., Intel, AMD)


●​ Shared-memory machines (SMP)
●​ Programming APIs: OpenMP, Pthreads, Cilk, and TBB.

Example Scenario
Suppose you have this array:
int data[1000];
In shared-memory:

●​ All threads can update `data[i]`.


●​ You need to synchronize updates if multiple threads write to the same index.
Programming Models
OpenMP

●​ The most common shared-memory API in C, C++, Fortran.


●​ You write **parallel regions** with pragmas like:

#pragma omp parallel for


for(int i=0; i<N; i++)
{
​ data[i] = data[i]*2;
}

●​ Threads automatically split the loop iterations.


●​ You control scheduling, synchronization, reductions.

Pthreads

●​ POSIX Threads library (C).


●​ Lower-level API—more control, more complexity.
●​ You manually create threads and manage synchronization:

​ pthread_create(&thread_id, NULL, function, args);

Challenges in Shared Memory


Data Races

●​ Two threads write the same variable without coordination.


●​ Causes unpredictable results.

False Sharing
Threads modify different variables that happen to be on the same cache line, causing
performance degradation.
Scalability
Performance can degrade when too many threads contend for memory bandwidth or locks.
Synchronization Tools
Shared memory requires careful synchronization to ensure correctness:

Tool Purpose

Mutex/Lock Ensure only one thread enters a critical section.

Atomic Operations Guarantee indivisible updates to variables.

Barriers Wait for all threads to reach a point before continuing.

Condition Variables Allow threads to wait for specific conditions.



Example: Summing an Array with OpenMP
#include <stdio.h>
#include <omp.h>
int main()
{
​ int N = 1000;
​ int data[N];
​ for (int i = 0; i < N; i++) data[i] = 1;
​ int sum = 0;
​ #pragma omp parallel for reduction(+:sum)
​ for (int i = 0; i < N; i++)
{
​ sum += data[i];
​ }
​ printf("Sum = %d\n", sum);
​ return 0;
}
Why is this safe?
The `reduction(+:sum)` clause ensures each thread keeps a local copy of `sum` and combines results
automatically.
Performance Considerations
●​ Minimize critical sections.
●​ Avoid frequent locking.
●​ Use thread-local storage when possible.
●​ Balance workload among threads.

❖​ Shared-Memory Programming is an efficient way to parallelize tasks on multicore systems.


❖​ It simplifies data sharing compared to distributed-memory systems.
❖​ It requires careful synchronization to avoid bugs and performance issues.
❖​ OpenMP makes shared-memory programming much easier than low-level threads.

Characteristics:
●​ Threads/processes share the same address space.
●​ Easy to implement using threads.
●​ Synchronization via mutexes, semaphores, barriers.
●​ Examples: OpenMP, pthreads.

Challenges:
●​ Race conditions
●​ Deadlocks
●​ Cache coherence

1.10. Distributed-Memory Programming


Distributed-Memory Programming in Parallel Computing involves a model where each processor has
its own private memory, and communication between processors is done explicitly by passing
messages.

This model is widely used in clusters, supercomputers, and distributed systems where shared memory
is not physically feasible or scalable.
How does it work?
●​ Processes run on separate nodes (or processors).

●​ There is no global shared memory.​

●​ To share data:​

○​ One process sends a message.​

○​ Another receives it and processes it.

Diagram:

Programming Models

The most common API used:

●​ MPI (Message Passing Interface)​

Other tools:

●​ PVM (Parallel Virtual Machine)​

●​ OpenMPI, MPICH (implementations of MPI)​


Example: MPI Hello World
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv)
{
​ MPI_Init(&argc, &argv); // Initialize MPI environment
​ int world_rank;
​ MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get rank
​ int world_size;
​ MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get total processes
​ printf("Hello from process %d of %d\n", world_rank, world_size);
​ MPI_Finalize(); // Clean up MPI
​ return 0;
}

Advantages

●​ Scalability: Can run on thousands of nodes.​

●​ Cost-effectiveness: Can be built with commodity hardware.​

●​ Fault Isolation: Errors in one node may not crash the whole system.

Disadvantages

●​ Complex Programming: Requires manual data partitioning and message management.​

●​ Latency: Communication can be slow due to network overhead.​

●​ Debugging Difficulty: Harder to trace bugs across distributed nodes.


When to use distributed memory?

●​ Large-scale scientific computations​

●​ High-performance simulations​

●​ Big data processing on clusters​

●​ Cloud-based parallel computing

Characteristics:
●​ No shared address space.

●​ Communication done via explicit message passing.

●​ Suitable for large-scale computing on clusters.

●​ Examples: MPI, Hadoop (MapReduce model).

Challenges:
●​ Explicit communication management

●​ Higher programming complexity

●​ Data distribution and coordination


FIVE MARK QUESTIONS
1.​ Explain the differences between SIMD and MIMD systems with examples. (BT Level 2 –
Understand)

2.​ Describe various types of interconnection networks and their impact on performance.
(BT Level 2 – Understand)

3.​ Apply the concept of cache coherence in shared-memory systems using the MESI
protocol. (BT Level 3 – Apply)
4.​ Illustrate how coordination between threads is managed in shared-memory
programming. (BT Level 3 – Apply
5.​ Analyze the advantages and limitations of shared-memory versus distributed-memory
models. (BT Level 4 – Analyze)

TEN MARK QUESTIONS


1.​ Explain the classifications of parallel computers and their significance in modern computing.
(BT Level 2 – Understand)​

2.​ Illustrate the working of SIMD and MIMD systems with diagrams and discuss their practical
applications. (BT Level 3 – Apply)​

3.​ Apply the concept of interconnection networks by comparing different topologies and their effects
on data transfer efficiency. (BT Level 3 – Apply)​

4.​ Analyze the causes and effects of cache coherence problems in shared-memory systems and
explain how coherence protocols address them. (BT Level 4 – Analyze)​

5.​ Evaluate the advantages and limitations of shared-memory and distributed-memory models for
large-scale parallel applications. (BT Level 5 – Evaluate)​

6.​ Assess the effectiveness of thread/process coordination methods in avoiding synchronization


issues and ensuring parallel efficiency. (BT Level 5 – Evaluate)​

7.​ Design a parallel solution for matrix multiplication using a shared-memory model and explain
your approach. (BT Level 6 – Create)​

8.​ Propose a hybrid architecture that combines the benefits of both shared and distributed memory
systems, and justify your design choices. (BT Level 6 – Create)

You might also like