100% found this document useful (1 vote)

3K views35 pages

Bcs702 Parallel Computing Module 1

The document provides an overview of parallel computing, including its definitions, advantages, and applications. It discusses various parallel programming models such as Shared-Memory (OpenMP) and Distributed-Memory (MPI), along with their respective advantages and disadvantages. Additionally, it classifies parallel computers using Flynn's taxonomy, explaining SIMD and MIMD systems, and highlights the importance of parallel hardware and software.

Uploaded by

Senthilnathan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

3K views35 pages

Bcs702 Parallel Computing Module 1

Uploaded by

Senthilnathan S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

BCS702 PARALLEL COMPUTING

MODULE-1

Introduction to parallel programming, Parallel hardware and parallel software – Classifications of

parallel computers, SIMD systems, MIMD systems, Interconnection networks, Cache coherence,
Shared-memory vs. distributed-memory, Coordinating the processes/threads, Shared-memory,
Distributed-memory.

1.1 Introduction to Parallel Programming

Introduction to Parallel Computing
Computer software was written conventionally for serial computing. This meant that to solve
a problem, an algorithm divides the problem into smaller instructions. These discrete instructions are
then executed on the CPU of a computer one by one. Only after one instruction is finished, next one
starts.
Parallel Computing is the use of multiple processing elements simultaneously for solving any
problem. Problems are broken down into instructions and are solved concurrently as each resource
that has been applied to work is working at the same time.

Advantages of Parallel Computing over Serial Computing are as follows:

1. It saves time and money as many resources working together will reduce the time and cut
potential costs.
2. It can be impractical to solve larger problems on Serial Computing.
3. It can take advantage of non-local resources when the local resources are finite.
4. Serial Computing 'wastes' the potential computing power, thus Parallel Computing makes
better work of the hardware.

Why parallel computing?

● The whole real-world runs in dynamic nature i.e. many things happen at a certain time but at
different places concurrently. This data is extensively huge to manage.
● Real-world data needs more dynamic simulation and modeling, and for achieving the same,
parallel computing is the key.
● Parallel computing provides concurrency and saves time and money.
● Complex, large datasets, and their management can be organized only and only using parallel
computing's approach.
● Ensures the effective utilization of the resources. The hardware is guaranteed to be used
effectively whereas in serial computation only some part of the hardware was used and the rest
rendered idle.
● Also, it is impractical to implement real-time systems using serial computing.

Applications of Parallel Computing:

● Databases and Data mining.
● Real-time simulation of systems.
● Science and Engineering.
● Advanced graphics, augmented reality, and virtual reality.
What is Parallel Programming?
● A programming technique where multiple processes or threads execute simultaneously to
solve a problem faster.
● Useful for high-performance computing, scientific simulations, real-time systems, etc.

Goals:
● Improve performance by executing multiple instructions concurrently.
● Optimize resource utilization across multicore or multiprocessor systems.
● Reduce execution time for large-scale or complex computations.

Real-World Analogy
Making a Dinner with Friends (Parallel Cooking)
● Task: Make rice, curry, and salad.
● You alone (Sequential): Cook rice → then curry → then salad.
● With 3 friends (Parallel):
o You cook rice,
o Friend 1 makes curry,
o Friend 2 chops salad.
You complete the meal 3× faster!

Parallel Programming Models

❖ Shared-Memory Model (using OpenMP)
❖ Distributed-Memory Model (using MPI)
❖ Compute Unified Device Architecture (GPU programming)

Shared-Memory Model (using OpenMP)

What is the Shared-Memory Model?
In the shared-memory model, multiple threads (or processes) run in parallel and share the same
address space (memory). All threads can read and write to shared variables, which enables
communication and synchronization.
● Memory is global and accessible to all threads.
● Threads are typically created and managed within the same process.
● Commonly used on multi-core systems where all cores can access the main memory.
OpenMP Overview:
OpenMP (Open Multi-Processing) is an API in C, C++, and Fortran for shared-memory parallel
programming. It uses compiler directives (pragmas), runtime library routines, and environment
variables to control parallel execution.
Key OpenMP Concepts

Concept Description
#pragma omp parallel Starts a parallel region (forks threads).
#pragma omp for or Splits loop iterations among threads.
#pragma omp parallel for
shared, private Control variable scope (shared across threads or private to each)
Critical Ensures a block of code is executed by only one thread at a time
Barrier Synchronizes all threads at a point.

Example:
Parallel Sum of Array Elements //Sequential Code (C):
#include <stdio.h>
int main()
{
int A[5] = {1, 2, 3, 4, 5};
int sum = 0;
for (int i = 0; i < 5; i++) {
sum += A[i];
}
printf("Sum = %d\n", sum);
return 0;
}
Parallel Version Using OpenMP in C:
#include <stdio.h>
#include <omp.h>
int main()
{
int A[5] = {1, 2, 3, 4, 5};
int sum = 0;
int i;
#pragma omp parallel
for (i = 0; i < 5; i++) {
sum += A[i]; // Each thread adds its part of the array
}
printf("Parallel Sum = %d\n", sum);
return 0;
}
Advantages of Shared-Memory Model
● Easier communication (no need for message passing).
● Efficient for small-to-medium number of processors.
● Natural fit for multi-threaded applications.

Distributed-Memory Model (using MPI)

What is the Distributed-Memory Model?

In the distributed-memory model, each process has its own private memory. Processes do not share
memory and must communicate explicitly by-passing messages.
● Used in clusters, supercomputers, and multi-node systems.
● Each process is typically an independent program running on different machines or cores.
● Data sharing is done using Message Passing Interface (MPI).

What is MPI?
MPI (Message Passing Interface) is a standardized and portable message-passing system to allow
processes to communicate with each other in a distributed-memory environment.
MPI provides:
● Point-to-point communication (MPI_Send, MPI_Recv)
● Collective communication (MPI_Bcast, MPI_Reduce, etc.)
● Process synchronization
● Scalability across nodes

Example: Parallel Sum of Array Elements using MPI

Scenario:
We have an array distributed across processes. Each process computes its partial sum, and then all
partial sums are combined using MPI_Reduce.

MPI Code in C:
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])

{
int rank, size;
int local_sum = 0, global_sum = 0;
int data[4] = {1, 2, 3, 4}; // Assume size = 4 for simplicity
MPI_Init(&argc, &argv); // Initialize MPI
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get total number of processes
// Each process handles one element for simplicity
local_sum = data[rank];
// Reduce local sums to global sum in root process (rank 0)
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, 0,
MPI_COMM_WORLD);
if (rank == 0) {
printf("Total sum = %d\n", global_sum);
}
MPI_Finalize(); // Finalize MPI
return 0;
}

Key MPI Functions Used

Function Description
MPI_Init Starts the MPI environment
MPI_Comm_rank Gets the rank (ID) of the process
MPI_Comm_size Gets the total number of processes
MPI_Reduce Gathers and reduces values from all processes
MPI_Finalize Ends the MPI environment

Advantages of Distributed-Memory Model

● Highly scalable across many machines.
● Works across networks or clusters.
● Each process works independently — no race conditions in shared variables.

Disadvantages
● Complex programming due to explicit communication.
● Message latency and bandwidth issues.
● Debugging and synchronization can be harder than shared-memory.
Shared vs Distributed Memory
Feature Shared Memory (OpenMP) Distributed Memory (MPI)
Memory Shared across threads Private to each process
Communication Implicit via memory Explicit via message passing
Scalability Limited to one machine Scales across machines
Programming Easier More complex
Performance Good for few cores Better for large clusters

1.2. Parallel Hardware and Parallel Software

Parallel Hardware
Parallel hardware refers to the physical components (processors, memory systems,
interconnects) that make it possible to run multiple tasks simultaneously.
Why Parallel Hardware?
To solve large or complex problems faster, we need to perform multiple computations at the
same time. Parallel hardware provides the necessary support for this.
Types of Parallel Hardware
There are mainly two categories based on how processors work:
SIMD (Single Instruction Multiple Data)
● All processors execute the same instruction but on different pieces of data.
Example: Graphics Processing Unit (GPU) – processes thousands of pixels
●
simultaneously.
MIMD (Multiple Instruction Multiple Data)
● Each processor executes different instructions on different data.
● Example: Multicore CPUs, clusters, supercomputers.
Memory Architecture
How do processors access memory?

● Shared Memory: All processors access the same memory (e.g., multicore desktops).
● Distributed Memory: Each processor has its own local memory (e.g., MPI clusters).
Interconnection Networks
Used to connect processors to each other and to memory:
● Bus, Ring, Mesh, Hypercube, Crossbar are common topologies.
● Efficient networks are important to avoid delays in communication.
Cache Coherence
In shared-memory systems, it ensures that all processors see the same value of a variable, even
when it is cached.
Parallel Software
Parallel software is the programming and logic that tells the hardware how to run multiple
tasks simultaneously.
Why Parallel Software?
Even with powerful hardware, we need specially designed software and programs to use it
efficiently.
Programming Models
These are the ways we write parallel programs:
Shared Memory Model
● Threads share a global memory.
● Use OpenMP (Open Multi-Processing) for C/C++ or Fortran.
Distributed Memory Model
● Each process has its own memory.
● Communication is done using messages.
● Use MPI (Message Passing Interface).
Hybrid Model
● Combine both OpenMP + MPI for better performance in large systems.
Parallel Languages & Tools
● OpenMP – Easy parallelism for loops in C/C++.
● MPI – Communication between processes in clusters.
● CUDA – GPU programming using NVIDIA hardware.
● Chapel, Cilk, TBB – Other tools for parallel development.
Main Concepts in Parallel Software
● Task Decomposition: Split the problem into smaller tasks.
● Synchronization: Coordinate tasks using barriers, locks.
● Communication: Share data between tasks.
● Load Balancing: Ensure all processors do equal work.
Difference between Parallel Hardware and Software

Feature Parallel Hardware Parallel Software

What is it? Physical components (CPU, GPU, etc.) Programs and models for parallel execution

Role Performs multiple computations Manages and distributes tasks

Examples Multicore processors, Clusters OpenMP, MPI, CUDA

1.3. Classifications of Parallel Computers

In parallel computing, tasks are divided into smaller subtasks and executed simultaneously to
improve performance and efficiency. Based on how data and instructions are processed, parallel
computers are classified using Flynn’s taxonomy, which is one of the most well-known classification
schemes.
Flynn’s Taxonomy
Michael J. Flynn (1966) classified parallel computers into four categories based on the number of
instruction streams and data streams:
Instruction Data
Classification Example
Stream Stream
SISD Traditional uniprocessor
1 1
(Single Instruction, Single Data) (like old PCs)
SIMD GPU (Graphics Processing Unit),
1 M
(Single Instruction, Multiple Data) vector processors
MISD Rare, theoretical or used in
M 1
(Multiple Instruction, Single Data) fault-tolerant systems
MIMD Modern multi-core CPUs, clusters,
M M
(Multiple Instruction, Multiple Data) supercomputers

1. SISD (Single Instruction, Single Data)

● Working: A single processor executes a single instruction on a single data item at a time.
● Architecture: Sequential and traditional von Neumann model.
● Example:
o Early computers like IBM 7090.
2. SIMD (Single Instruction, Multiple Data)
● Working: Multiple processing elements perform the same operation on multiple data
simultaneously.
● Best for: Vector processing, image processing, matrix operations.
● Example:
o GPUs in graphics rendering.
o Intel AVX (Advanced Vector Extensions).
// Example: Adding two arrays using SIMD in C (conceptual)
#pragma omp simd
for(int i = 0; i < N; i++)
{
C[i] = A[i] + B[i];
}
3. MISD (Multiple Instruction, Single Data)
● Working: Multiple processors execute different instructions on the same data stream.
● Usage: Mostly theoretical or used in specialized applications like redundant systems for
reliability (e.g., space systems).
● Example:
o Fault-tolerant computing in spacecraft (rare in practice).
4. MIMD (Multiple Instruction, Multiple Data)
● Working: Multiple processors execute different instructions on different data.
● Best for: General-purpose parallel computing.
● Two types:
o Shared Memory MIMD: All processors access the same global memory (e.g.,
OpenMP).
o Distributed Memory MIMD: Each processor has its own memory and they
communicate (e.g., MPI).
// OpenMP example (shared memory MIMD)

#pragma omp parallel

{
printf("Thread %d is working\n", omp_get_thread_num());
}
Summary Table:

Type Parallelism Memory Type Use Case

SISD None Single Sequential tasks
SIMD Data Shared Image/audio processing
MISD Fault tolerance Shared Redundant execution
MIMD Task & Data Shared or Distributed General parallel apps

1.4. SIMD and MIMD Systems

1. SIMD (Single Instruction, Multiple Data)
Definition:
SIMD systems allow a single control unit to dispatch one instruction to multiple processing
elements (PEs), each working on different data elements simultaneously.
How does it work?
● One instruction is broadcast to all processing units.
● Each unit executes the instruction on its own data.
● Works well with structured, data-parallel tasks.
Characteristics:

Feature Description
Instruction stream Single
Data stream Multiple
Control Centralized
Synchronization Implicit
Memory Shared or distributed (depends on implementation)
Applications:
● Image processing
● Signal processing
● Machine learning (matrix operations)
● Vectorized operations in scientific computing

Real-World Examples:
● GPUs (Graphics Processing Units): Process thousands of pixels using same shader logic.
● Intel AVX/SSE instructions: SIMD vector extensions in CPUs.
● NEON in ARM processors: Used in mobile devices.
Example:
Adding two arrays of numbers: Sequential Code
for (int i = 0; i < 4; i++) {
C[i] = A[i] + B[i];
}
SIMD Logic:
SIMD_ADD A[0..3], B[0..3] -> C[0..3]
All additions are done in parallel by SIMD hardware.

2. MIMD (Multiple Instruction, Multiple Data)

Definition:
MIMD systems consist of multiple independent processors, each executing its own instruction
stream on its own data. This supports both data and task parallelism.
How does it work?
● Each processor has its own control unit and executes independently.
● Can coordinate via shared memory (SMP) or message passing (MPI).
Characteristics:

Feature Description
Instruction stream Multiple
Data stream Multiple
Control Decentralized
Synchronization Explicit (using threads, barriers, locks, or messages)
Memory Shared (e.g., multicore CPU) or Distributed (e.g., cluster)

Applications:
● Web servers
● Distributed simulations
● High-performance scientific computations
● Cloud computing

Real-World Examples:
● Multicore CPUs (Intel, AMD)
● Clusters (HPC supercomputers)
● Cloud infrastructure (AWS, Azure instances)
Example:
● A weather simulation might have:
● Core 1 calculating wind patterns
● Core 2 calculating rainfall
● Core 3 modeling temperature
Each runs different programs on different data.

SIMD vs MIMD:
Feature SIMD MIMD
Instruction stream Single Multiple
Data stream Multiple Multiple
Processing elements Uniform, tightly coupled Independent, loosely coupled
Control Centralized Decentralized
Type of parallelism Data parallelism Task and data parallelism
Example systems GPU, Vector processor Multicore CPU, Clusters
Programming model Vectorization, OpenCL, CUDA Threads, MPI, OpenMP
Synchronization Minimal/implicit Complex/explicit
Efficiency Very high for uniform tasks High for complex workloads

1.5. Interconnection Networks

Why do we need Interconnection Networks?

In parallel computing, many processors (CPUs) work together to solve a problem faster. But to
coordinate, they must:

● Share data (e.g., partial results)

● Send control messages (e.g., synchronization signals)
● Access shared memory
This requires a communication system connecting the processors and memories:
Interconnection Network.

What is an Interconnection Network?

Definition:
An interconnection network is the hardware and routing infrastructure that enables:
● Processor-to-Processor communication (message passing)

● Processor-to-Memory communication (shared memory)

It consists of:

● Nodes: Processors or memories.

● Links: Physical connections (wires, buses).

● Switches/Routers: Devices deciding which path data takes.

Classification of Interconnection Networks

There are two main classes:

Static (Direct) Networks

Static/direct:
Each processor is directly connected to a fixed set of neighbors.

Properties:
● Predictable paths.
● Lower complexity.
● Good for fixed-size systems.
Examples:

1. Linear Array:

i. Processors in a line.
ii. Each connected to two neighbors (except ends).
iii. Diameter: O(n) — worst-case distance grows linearly.
2. Ring:
i. Ends connected to form a loop.
ii. Reduces diameter to n/2.
3. Mesh/Grid:
i. 2D layout.
ii. Used in many supercomputers.
iii. Diameter: O(sqrt(n)).
4. Torus:
i. Mesh + wraparound links.
ii. Improves latency.
5. Hypercube:
i. n-dimensional cube.
ii. Each node has log2(n) neighbors.
iii. Diameter: log2(n).
iv. Very scalable.
6. Tree:
i. Hierarchical.
ii. Good broadcast capability but bottleneck near root.

Dynamic (Indirect) Networks

Dynamic/indirect:
Nodes connect via switching elements. Paths are set up dynamically.

Properties:

● Flexible.
● Can connect large numbers of nodes efficiently.
● More complex control.
Examples:
1. Crossbar Switch:

i. Every processor can connect to any memory module via a switch matrix.

ii. Very high bandwidth, but cost grows O(n²).

2. Multistage Interconnection Networks (MIN):

i. Multiple layers of switches.

ii. Paths configured on demand.
iii. Less expensive than a crossbar.
Examples:
1. Omega Network
2. Banyan Network
3. Clos Network
4. Often used in commercial parallel systems.
Key Properties of Interconnection Networks

When designing or evaluating a network, we analyze:

Property Meaning

Degree Number of links per node

Diameter Max number of hops between any two nodes

Bisection Bandwidth Bandwidth across the minimum cut dividing the

network in half

Latency Time to deliver a message

Throughput Total data transfer rate of the network

Scalability How well the network grows as you add nodes

Fault Tolerance How robust the network is to failures

Examples of Properties

Topology Degree Diameter Bisection

Bandwidth

Linear Array 2 n-1 1 link

(n nodes)

Ring 2 n/2 2 links

Mesh 4 2*sqrt(n) sqrt(n) links

(sqrt(n)x sqrt(n))

Hypercube k k n/2 links

(n=2^k)

Crossbar n 1 n links
1.6. Cache Coherence
What is Cache Coherence?
● In parallel computing, multiple processors (or cores) often work together to solve a problem.
● Each processor usually has its own cache — a small, fast memory that stores frequently
accessed data from the main memory (RAM).
● Cache coherence ensures that all processors see the most recent and consistent value of
shared data — no matter which cache holds it.
Why do we need Cache Coherence?
Without cache coherence:
● One processor may update a shared variable in its cache.
● Other processors may still have an old (stale) value in their caches.
● This leads to incorrect results in a parallel program.
Example:
// Assume X = 5 initially

Processor 1: X = 10; // updates its cached X

Processor 2: print(X); // still sees X = 5 in its cache => wrong value!
The Cache Coherence Problem
The problem occurs when:
● Multiple processors cache the same memory location.
● One processor modifies its cached copy.
● Other processors don't get informed of this change.
The system becomes incoherent — processors do not agree on the current value of that memory
location.
How Cache Coherence is achieved ?
There are two main approaches:

1. Snooping Protocols

● All caches watch (snoop) the shared memory bus.
● When a processor writes to a cached block:
○ It broadcasts an update or invalidation to all other caches.
● Other caches either:
○ Update their copies (write-update protocol), or
○ Invalidate their copies (write-invalidate protocol).

Common protocol: MESI (Modified, Exclusive, Shared, Invalid)

2. Directory-Based Protocols
● Suitable for large-scale systems (e.g., clusters, NUMA architectures).
● A central directory keeps track of which caches have copies of a memory block.
● When a processor writes:
The directory coordinates updates/invalidation of other caches.

Key Protocols (Example: MESI)

MESI tracks each cache line’s state:

● M (Modified) → Cache has the only valid copy, and it’s been changed.
● E (Exclusive) → Cache has the only valid copy, unchanged.
● S (Shared) → Multiple caches have the same valid copy.
● I (Invalid) → Cache copy is invalid (stale).

Importance in Parallel Computing

Cache coherence ensures:

● Correct execution of parallel programs that share data.

● All processors see a consistent view of shared variables.

Without coherence:

● Programs might read stale values.

● Results could be unpredictable or wrong.

Impact on Performance

Cache coherence mechanisms:

● Add overhead (extra bus traffic, protocol complexity).

● Can cause false sharing: when processors modify different variables that happen to share a
cache line, leading to unnecessary invalidations.
MESI Protocol State Transitions Diagram

MESI State Transfer

Whether the state changes from invalid to exclusive or shared depends on whether other cores hold
the same block (a simple OR logic on the hit signal).

1. Invalid to Shared.

Read, miss and issue read requests. If there are other cores holding the block and the state is
shared, transfer to a shared state.

2. Invalid to Exclusive.

Read, miss and issue read requests. If there are no other cores holding the block, transfer to
an exclusive state.

3. Invalid to Modified.

Write miss and issue read request. Broadcast message to invalidate exclusive and shared state
blocks. May incur dirty write back for other modified blocks on other cores.

4. Shared to Modified.

Write hits. Broadcast message to invalidate shared state blocks.

5. Shared to Invalid.

Snoop hit on a writer.

6. Exclusive to Modified.

Write hits. No need to broadcast any messages.
7. Exclusive to Invalid.
Snoop hit on a writer.

8. Modified to Shared.

Snoop hit on a read. Dirty write back and other core read the updated copy and transfer from
invalid to shared.

9. Modified to Invalid.

Snoop hit on a writer. Dirty write back and other core read the update copy and transfer from
invalid to modified.

1.7. Shared-Memory vs. Distributed-Memory

Parallel computing systems are typically classified into shared-memory and distributed-memory
architectures.
Below is a comparison table, followed by detailed explanations:
Aspect Shared-Memory System Distributed-Memory System

Memory Access All processors share a single global Each processor has its own private
memory space memory

Communication Via shared variables (loads/stores) Via message passing (e.g., MPI)

Programming Easier: threads, OpenMP, Pthreads Harder: explicit messages, MPI

Model

Scalability Limited by memory bandwidth and Highly scalable across nodes

contention

Examples Multi-core CPUs, SMP machines Clusters, supercomputers

Synchronization Locks, semaphores, barriers Explicit coordination in code

Cost Often more expensive per node Commodity hardware networked

together

Shared-Memory Architecture

Definition:
A computing model where multiple processors (or cores) access the same physical
memory.

How does it work?

● All threads/processors read and write to a common address space.
● Hardware handles cache coherence (keeping caches consistent).
● Synchronization (e.g., mutexes) prevents race conditions.
Shared-Memory Architecture Diagram

Programming:

● Easier because you don’t have to think about where the data is stored.

● Typically use:
○ OpenMP
○ Pthreads
○ Java threads

Example Systems:
● Multi-core desktop CPUs (Intel, AMD)

● Symmetric Multiprocessors (SMPs)

Advantages:
● Simple programming model.

● No explicit communication—data just lives in memory.

Disadvantages:
● Harder to scale beyond a certain number of cores (memory bus contention).

● Hardware cost rises rapidly as you scale cores and memory bandwidth.
Distributed-Memory Architecture
Definition:
Each processor has its own private memory; processors communicate by sending
messages over a network.
How does it work?
● No global memory space.
● All data exchanges are explicit: you send/receive data.

Programming:
● More complex: you must manage data distribution and communication.
● Use:
○ MPI (Message Passing Interface)
○ PVM
Example Systems:
● Clusters of servers
● Supercomputers (e.g., IBM Blue Gene)
Advantages:
● Scalable to thousands of nodes.
● Commodity hardware can be assembled into powerful clusters.
Disadvantages:
● More difficult programming (must explicitly manage communication).
● Higher latency to move data between nodes.
Hybrid Systems
Modern HPC systems are often hybrid:
● Within a node: shared-memory (multi-core CPUs).
● Between nodes: distributed-memory (message passing).

This is why you might see MPI+OpenMP programs:

● MPI for inter-node communication.
● OpenMP for intra-node parallelism.

1.8. Coordinating the Processes/Threads

What is Coordination?

In parallel computing, coordination refers to:

● Controlling the activities of multiple threads/processes.

● Synchronizing their progress.
● Managing access to shared resources and data.
● Ensuring correct order and consistency.

Without coordination:

● Work may be done out of order.

● Threads could interfere (write/read conflicts).
● Results could be incorrect or inconsistent.

When is Coordination Needed?

Typical situations:

● When tasks depend on results from other tasks

● When multiple threads update shared variables

● When you want to ensure all threads reach a point before continuing

● When distributing or collecting data among processes

Coordination in Shared Memory Systems (Threads)

This is common in:

● Multicore CPUs
● GPUs
● Frameworks like OpenMP, Pthreads

Synchronization Primitives

Mutexes (Mutual Exclusion Locks)

● Ensure that only one thread enters a critical section at a time.

● Prevent race conditions.

Example:
pthread_mutex_lock(&mutex);
// Critical section
pthread_mutex_unlock(&mutex);

Spinlocks

● Similar to mutexes but busy-wait (continuously check the lock).

● Faster in some situations but can waste CPU cycles.

Semaphores

● Counting mechanisms to control access to resources.

● Used for signaling between threads.

Barriers

● Force all threads to wait until everyone reaches a synchronization point.

● After the barrier, all proceed together.

OpenMP Example:
#pragma omp parallel
{
// Do work
#pragma omp barrier
// All threads wait here
}
Condition Variables

● Let threads wait for specific conditions to become true.

● Used with mutexes.

Atomic Operations

● Simple updates (e.g., increment) performed safely without locks.

OpenMP Example:

#pragma omp atomic

counter++;

Scheduling and Work Division

Coordination also means deciding:

● Which thread does which part of the work.

● In what order.

OpenMP Scheduling:
● static: Fixed assignment of iterations.
● dynamic: Threads pull work as they finish chunks.
● guided: Like dynamic but decreasing chunk sizes.
4. Coordination in Distributed Memory Systems (Processes)
This is common in:

● Clusters

● Supercomputers

● Message Passing Interface (MPI)

Message Passing

● Processes explicitly send and receive messages.

● Unlike threads, processes do not share memory.

Example:

MPI_Send();
MPI_Recv();

Collective Communication

Operations involving all processes:

● Broadcast: Send data from one process to all others.

● Scatter: Distribute chunks of data to processes.

● Gather: Collect data from all processes.

● Reduce: Combine values (sum, max, etc.) across processes.

Barriers

MPI has barriers similar to threads:

MPI_Barrier(MPI_COMM_WORLD);

All processes wait until everyone arrives.

Synchronization Challenges

● Network latency in communication.

● Correct matching of send/receive operations.

● Deadlocks if communication patterns are mismatched.

5. Coordination Design Patterns

These common patterns help organize parallel programs:

Fork-Join

● A master thread forks worker threads, then joins them.

OpenMP Example:

#pragma omp parallel

{
// Parallel work
}
// Implicit join

Master-Worker

● One process (master) distributes tasks to workers.

Pipeline

● Data flows through multiple stages of processing.

Data Parallel Reduction

● Each thread/process computes partial results, then combines them.

6. Challenges in Coordination
Race Conditions:
● Two or more threads update shared data simultaneously in an unsafe way.
Deadlocks:
● Two threads wait indefinitely for each other’s resources.
Livelocks:
● Threads keep changing state but make no progress.
Starvation:
● A thread never gets CPU time or resources.
False Sharing:
● Threads update different variables sharing the same cache line, hurting performance.
Example in OpenMP

Parallel Sum with Reduction and Barrier:

#include <omp.h>
#include <stdio.h>
int main()
{
int i, N = 1000;
double sum = 0.0, a[N];
// Initialize array
for (i = 0; i < N; i++)
a[i] = 1.0;
#pragma omp parallel
{
#pragma omp for reduction(+:sum)
for (i = 0; i < N; i++)
sum += a[i];

#pragma omp barrier

#pragma omp single

printf("Total sum: %f\n", sum);
}
return 0;
}
Explanation:

● reduction: Coordinates summing safely.

● barrier: Synchronizes threads.
● single: Only one thread prints.
1.9. Shared-Memory Programming

What Is Shared-Memory Programming?

In shared-memory parallel computing, multiple threads or processes access a common memory space,
meaning all of them can read and write the same variables in memory.
This is different from distributed-memory systems (like MPI), where each process has its own private
memory, and data must be explicitly sent between processes.

Key Characteristics
Single Address Space

● All threads can access the same variables.

● Any change by one thread is immediately visible to others (unless caches or synchronization
issues interfere).
Implicit Communication

● No need to explicitly send messages—data is in shared memory.

Synchronization Is Critical

● Because threads share memory, you must control access to avoid data races (two threads
modifying data simultaneously).
Common Environments

● Multicore CPUs (e.g., Intel, AMD)

● Shared-memory machines (SMP)
● Programming APIs: OpenMP, Pthreads, Cilk, and TBB.

Example Scenario
Suppose you have this array:
int data[1000];
In shared-memory:

● All threads can update `data[i]`.

● You need to synchronize updates if multiple threads write to the same index.
Programming Models
OpenMP

● The most common shared-memory API in C, C++, Fortran.

● You write **parallel regions** with pragmas like:

#pragma omp parallel for

for(int i=0; i<N; i++)
{
data[i] = data[i]*2;
}

● Threads automatically split the loop iterations.

● You control scheduling, synchronization, reductions.

Pthreads

● POSIX Threads library (C).

● Lower-level API—more control, more complexity.
● You manually create threads and manage synchronization:

pthread_create(&thread_id, NULL, function, args);

Challenges in Shared Memory

Data Races

● Two threads write the same variable without coordination.

● Causes unpredictable results.

False Sharing
Threads modify different variables that happen to be on the same cache line, causing
performance degradation.
Scalability
Performance can degrade when too many threads contend for memory bandwidth or locks.
Synchronization Tools
Shared memory requires careful synchronization to ensure correctness:

Tool Purpose

Mutex/Lock Ensure only one thread enters a critical section.

Atomic Operations Guarantee indivisible updates to variables.

Barriers Wait for all threads to reach a point before continuing.

Condition Variables Allow threads to wait for specific conditions.

Example: Summing an Array with OpenMP
#include <stdio.h>
#include <omp.h>
int main()
{
int N = 1000;
int data[N];
for (int i = 0; i < N; i++) data[i] = 1;
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++)
{
sum += data[i];
}
printf("Sum = %d\n", sum);
return 0;
}
Why is this safe?
The `reduction(+:sum)` clause ensures each thread keeps a local copy of `sum` and combines results
automatically.
Performance Considerations
● Minimize critical sections.
● Avoid frequent locking.
● Use thread-local storage when possible.
● Balance workload among threads.

❖ Shared-Memory Programming is an efficient way to parallelize tasks on multicore systems.

❖ It simplifies data sharing compared to distributed-memory systems.
❖ It requires careful synchronization to avoid bugs and performance issues.
❖ OpenMP makes shared-memory programming much easier than low-level threads.

Characteristics:
● Threads/processes share the same address space.
● Easy to implement using threads.
● Synchronization via mutexes, semaphores, barriers.
● Examples: OpenMP, pthreads.

Challenges:
● Race conditions
● Deadlocks
● Cache coherence

1.10. Distributed-Memory Programming

Distributed-Memory Programming in Parallel Computing involves a model where each processor has
its own private memory, and communication between processors is done explicitly by passing
messages.

This model is widely used in clusters, supercomputers, and distributed systems where shared memory
is not physically feasible or scalable.
How does it work?
● Processes run on separate nodes (or processors).

● There is no global shared memory.

● To share data:

○ One process sends a message.

○ Another receives it and processes it.

Diagram:

Programming Models

The most common API used:

● MPI (Message Passing Interface)

Other tools:

● PVM (Parallel Virtual Machine)

● OpenMPI, MPICH (implementations of MPI)

Example: MPI Hello World
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv)
{
MPI_Init(&argc, &argv); // Initialize MPI environment
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // Get rank
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size); // Get total processes
printf("Hello from process %d of %d\n", world_rank, world_size);
MPI_Finalize(); // Clean up MPI
return 0;
}

Advantages

● Scalability: Can run on thousands of nodes.

● Cost-effectiveness: Can be built with commodity hardware.

● Fault Isolation: Errors in one node may not crash the whole system.

Disadvantages

● Complex Programming: Requires manual data partitioning and message management.

● Latency: Communication can be slow due to network overhead.

● Debugging Difficulty: Harder to trace bugs across distributed nodes.

When to use distributed memory?

● Large-scale scientific computations

● High-performance simulations

● Big data processing on clusters

● Cloud-based parallel computing

Characteristics:
● No shared address space.

● Communication done via explicit message passing.

● Suitable for large-scale computing on clusters.

● Examples: MPI, Hadoop (MapReduce model).

Challenges:
● Explicit communication management

● Higher programming complexity

● Data distribution and coordination

FIVE MARK QUESTIONS
1. Explain the differences between SIMD and MIMD systems with examples. (BT Level 2 –
Understand)

2. Describe various types of interconnection networks and their impact on performance.
(BT Level 2 – Understand)

3. Apply the concept of cache coherence in shared-memory systems using the MESI
protocol. (BT Level 3 – Apply)
4. Illustrate how coordination between threads is managed in shared-memory
programming. (BT Level 3 – Apply
5. Analyze the advantages and limitations of shared-memory versus distributed-memory
models. (BT Level 4 – Analyze)

TEN MARK QUESTIONS

1. Explain the classifications of parallel computers and their significance in modern computing.
(BT Level 2 – Understand)

2. Illustrate the working of SIMD and MIMD systems with diagrams and discuss their practical
applications. (BT Level 3 – Apply)

3. Apply the concept of interconnection networks by comparing different topologies and their effects
on data transfer efficiency. (BT Level 3 – Apply)

4. Analyze the causes and effects of cache coherence problems in shared-memory systems and
explain how coherence protocols address them. (BT Level 4 – Analyze)

5. Evaluate the advantages and limitations of shared-memory and distributed-memory models for
large-scale parallel applications. (BT Level 5 – Evaluate)

6. Assess the effectiveness of thread/process coordination methods in avoiding synchronization

issues and ensuring parallel efficiency. (BT Level 5 – Evaluate)

7. Design a parallel solution for matrix multiplication using a shared-memory model and explain
your approach. (BT Level 6 – Create)

8. Propose a hybrid architecture that combines the benefits of both shared and distributed memory
systems, and justify your design choices. (BT Level 6 – Create)

Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
DC Question Bank 5 Units
No ratings yet
DC Question Bank 5 Units
17 pages
Vtu-5 Sem Cse Computer Networks Lab Manual-17csl57-Sijin P
50% (2)
Vtu-5 Sem Cse Computer Networks Lab Manual-17csl57-Sijin P
47 pages
Research Methodology & Ipr (BRMK557
100% (1)
Research Methodology & Ipr (BRMK557
3 pages
VTU Algorithms Lab Manual 18CSL47
No ratings yet
VTU Algorithms Lab Manual 18CSL47
41 pages
3.multicore Architecture and Programming
0% (1)
3.multicore Architecture and Programming
3 pages
CCS375 Set1
No ratings yet
CCS375 Set1
3 pages
Loyola-Icam College of Engineering and Technology (LICET)
No ratings yet
Loyola-Icam College of Engineering and Technology (LICET)
2 pages
Oop Lesson Plan Cs3391 Jec
No ratings yet
Oop Lesson Plan Cs3391 Jec
4 pages
Agents and Communities
No ratings yet
Agents and Communities
53 pages
Vtu 5th Sem Cse Computer Networks
No ratings yet
Vtu 5th Sem Cse Computer Networks
91 pages
IoT Data Analytics Essentials
No ratings yet
IoT Data Analytics Essentials
19 pages
Unit1 FSWD
No ratings yet
Unit1 FSWD
22 pages
Module 3
No ratings yet
Module 3
28 pages
Process Based Multitasking V/s Thread Based Multitasking
No ratings yet
Process Based Multitasking V/s Thread Based Multitasking
4 pages
CP4292 Syllabus
No ratings yet
CP4292 Syllabus
4 pages
Data Analytics With R - BDS306C - LAB - Full
No ratings yet
Data Analytics With R - BDS306C - LAB - Full
61 pages
6th Sem ISE Syllabus 2022 Scheme
50% (2)
6th Sem ISE Syllabus 2022 Scheme
51 pages
Software Quality Essentials
No ratings yet
Software Quality Essentials
46 pages
3 Virtualization Implementation Levels
No ratings yet
3 Virtualization Implementation Levels
22 pages
Updated 5th and 6th Sem 2021 Scheme and Syllabus
No ratings yet
Updated 5th and 6th Sem 2021 Scheme and Syllabus
71 pages
VTU Exam Question Paper With Solution of 18CS35 Software Engineering May-2021-Dr. Savitha Hiremath
No ratings yet
VTU Exam Question Paper With Solution of 18CS35 Software Engineering May-2021-Dr. Savitha Hiremath
26 pages
User Interface Design Guide
100% (1)
User Interface Design Guide
39 pages
Architectural Design Challenges
No ratings yet
Architectural Design Challenges
12 pages
Software Engineering and Project Management PPT of Module 1
No ratings yet
Software Engineering and Project Management PPT of Module 1
74 pages
Domain Specific Iot
No ratings yet
Domain Specific Iot
17 pages
Univ QP cp4252 ML Univ Question Paper
No ratings yet
Univ QP cp4252 ML Univ Question Paper
5 pages
Web Technology Question Bank
No ratings yet
Web Technology Question Bank
7 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
5th Sem MCA Mini Project Report Format (Vtu) - 2016
No ratings yet
5th Sem MCA Mini Project Report Format (Vtu) - 2016
22 pages
Object-Oriented Modeling Guide
No ratings yet
Object-Oriented Modeling Guide
92 pages
Programming Language Concepts Q&A
No ratings yet
Programming Language Concepts Q&A
6 pages
Characterization of Distributed Systems Ds Module1
No ratings yet
Characterization of Distributed Systems Ds Module1
23 pages
CCS342 Devops
No ratings yet
CCS342 Devops
4 pages
GE3171 - Python Lab Syllabus
100% (1)
GE3171 - Python Lab Syllabus
2 pages
MC 4203 Cloud Computing Technologies Prev QP
No ratings yet
MC 4203 Cloud Computing Technologies Prev QP
2 pages
Web Essential IT3401 Technical Publication 2021 Regulation
No ratings yet
Web Essential IT3401 Technical Publication 2021 Regulation
429 pages
CS 3353 C Programming and Data Structure QB
No ratings yet
CS 3353 C Programming and Data Structure QB
7 pages
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
29 pages
Java Programming Question Bank
No ratings yet
Java Programming Question Bank
4 pages
Arduino - Architecture, Programming and Application
No ratings yet
Arduino - Architecture, Programming and Application
64 pages
21CS482 Unix Shell Programming
100% (1)
21CS482 Unix Shell Programming
3 pages
Library Management System
No ratings yet
Library Management System
60 pages
Mad Lab Viva Questions
0% (1)
Mad Lab Viva Questions
3 pages
20mcal16 DS Lab Manual Isem
100% (1)
20mcal16 DS Lab Manual Isem
41 pages
Crowd Sourcing Analytics
100% (1)
Crowd Sourcing Analytics
27 pages
Iot QB
No ratings yet
Iot QB
2 pages
First Review PPT Template-1
No ratings yet
First Review PPT Template-1
14 pages
@vtucode - in 21CS641 Module 1 PDF 2021 Scheme
No ratings yet
@vtucode - in 21CS641 Module 1 PDF 2021 Scheme
17 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
BE IT IoT Viva Questions
100% (1)
BE IT IoT Viva Questions
17 pages
Mental Health Report
No ratings yet
Mental Health Report
15 pages
CS3391 Object Oriented Programming Nov Dec 2023 Question Paper Download
No ratings yet
CS3391 Object Oriented Programming Nov Dec 2023 Question Paper Download
3 pages
MATLAB Programming Guide
No ratings yet
MATLAB Programming Guide
13 pages
Course: Software Engineering Principles and Practices (Code: 20CS44P) Week-6: Requirement Engineering & Modelling Session No. 01
No ratings yet
Course: Software Engineering Principles and Practices (Code: 20CS44P) Week-6: Requirement Engineering & Modelling Session No. 01
5 pages
CP5092-Cloud Computing Technologies
No ratings yet
CP5092-Cloud Computing Technologies
11 pages
FSWD University Question Paper
No ratings yet
FSWD University Question Paper
6 pages
Format - Summer Internship Report
No ratings yet
Format - Summer Internship Report
6 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Classroom Demonstration
No ratings yet
Classroom Demonstration
3 pages
CS8091 - Big Data Analytics - Unit 3
No ratings yet
CS8091 - Big Data Analytics - Unit 3
26 pages
Unit 3 Knowledge Representation
No ratings yet
Unit 3 Knowledge Representation
26 pages
CS8091 - Big Data Analytics - Unit 1
No ratings yet
CS8091 - Big Data Analytics - Unit 1
28 pages
Unit 4
No ratings yet
Unit 4
16 pages
Ybs Cat
No ratings yet
Ybs Cat
10 pages
Bizhubc368 C308 C258InstallInstr
No ratings yet
Bizhubc368 C308 C258InstallInstr
13 pages
Power Plant Engineering Q&A
No ratings yet
Power Plant Engineering Q&A
3 pages
Digital Electronics and Software Enggc 3rd Semester Btech Short Notes
No ratings yet
Digital Electronics and Software Enggc 3rd Semester Btech Short Notes
10 pages
PAFM250W Rev4.2
No ratings yet
PAFM250W Rev4.2
5 pages
SK6 Dual PDF
100% (1)
SK6 Dual PDF
204 pages
2021-01 HK Engineer
100% (1)
2021-01 HK Engineer
52 pages
Java Syllabus
No ratings yet
Java Syllabus
4 pages
Importance of Software Testing in Software Development Life Cycle
No ratings yet
Importance of Software Testing in Software Development Life Cycle
4 pages
Dental Suction System Guide
No ratings yet
Dental Suction System Guide
5 pages
CCIE Practical Exam Format
No ratings yet
CCIE Practical Exam Format
4 pages
Silicon Switching Diodes BAS 19 BAS 21: Type Ordering Code (Tape and Reel) Marking Package Pin Configuration
No ratings yet
Silicon Switching Diodes BAS 19 BAS 21: Type Ordering Code (Tape and Reel) Marking Package Pin Configuration
4 pages
Akash PPPP New
No ratings yet
Akash PPPP New
7 pages
1505223132foxboro - Idp10s - Brochure
No ratings yet
1505223132foxboro - Idp10s - Brochure
4 pages
MT41K512M8DA
No ratings yet
MT41K512M8DA
3 pages
Service Letter-Lubricating Oil Level Sensor
No ratings yet
Service Letter-Lubricating Oil Level Sensor
2 pages
Ivy - Installation and Programming Manual - A5-WEB
No ratings yet
Ivy - Installation and Programming Manual - A5-WEB
32 pages
Thesis Writing in Construction Management
100% (3)
Thesis Writing in Construction Management
4 pages
0795 A - LEVEL Computer SC P1
No ratings yet
0795 A - LEVEL Computer SC P1
5 pages
WI Install 24.2.0
No ratings yet
WI Install 24.2.0
40 pages
TM CVT Nissan Note
100% (1)
TM CVT Nissan Note
257 pages
Nstructions
No ratings yet
Nstructions
23 pages
A 219
No ratings yet
A 219
1 page
CSR - 40 - mP3000 Parts Catalogue
No ratings yet
CSR - 40 - mP3000 Parts Catalogue
47 pages
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
No ratings yet
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
1 page
Organizing Files For Performance
No ratings yet
Organizing Files For Performance
39 pages
MikroTik - DNS Servers Setup - ShellHacks
No ratings yet
MikroTik - DNS Servers Setup - ShellHacks
6 pages
Ip68 Waterproof Connector
No ratings yet
Ip68 Waterproof Connector
16 pages
Training Catalogue - EN - Low
No ratings yet
Training Catalogue - EN - Low
12 pages
N (0:1:40) A 1.2 F 0.1 X A Cos (2 Pi F N) Stem (N, X,'r','filled') Xlabel ('TIME') Ylabel ('AMPLITUDE')
No ratings yet
N (0:1:40) A 1.2 F 0.1 X A Cos (2 Pi F N) Stem (N, X,'r','filled') Xlabel ('TIME') Ylabel ('AMPLITUDE')
7 pages

Bcs702 Parallel Computing Module 1

Uploaded by

Bcs702 Parallel Computing Module 1

Uploaded by

BCS702 PARALLEL COMPUTING

Introduction to parallel programming, Parallel hardware and parallel software – Classifications of

1.1 Introduction to Parallel Programming

Advantages of Parallel Computing over Serial Computing are as follows:

Why parallel computing?

Applications of Parallel Computing:

Parallel Programming Models

Shared-Memory Model (using OpenMP)

Distributed-Memory Model (using MPI)

What is the Distributed-Memory Model?

Example: Parallel Sum of Array Elements using MPI

int main(int argc, char *argv[])

Key MPI Functions Used

Advantages of Distributed-Memory Model

1.2. Parallel Hardware and Parallel Software

Feature Parallel Hardware Parallel Software

Role Performs multiple computations Manages and distributes tasks

Examples Multicore processors, Clusters OpenMP, MPI, CUDA

1.3. Classifications of Parallel Computers

1. SISD (Single Instruction, Single Data)

#pragma omp parallel

Type Parallelism Memory Type Use Case

1.4. SIMD and MIMD Systems

2. MIMD (Multiple Instruction, Multiple Data)

1.5. Interconnection Networks

Why do we need Interconnection Networks?

●​ Share data (e.g., partial results)

What is an Interconnection Network?

●​ Processor-to-Memory communication (shared memory)​

●​ Nodes: Processors or memories.​

●​ Links: Physical connections (wires, buses).​

●​ Switches/Routers: Devices deciding which path data takes.

Classification of Interconnection Networks

There are two main classes:

Static (Direct) Networks

1.​ Linear Array:

Dynamic (Indirect) Networks

ii.​ Very high bandwidth, but cost grows O(n²).​

2.​ Multistage Interconnection Networks (MIN):​

i.​ Multiple layers of switches.

When designing or evaluating a network, we analyze:

Degree Number of links per node

Diameter Max number of hops between any two nodes

Bisection Bandwidth Bandwidth across the minimum cut dividing the

Latency Time to deliver a message

Throughput Total data transfer rate of the network

Scalability How well the network grows as you add nodes

Fault Tolerance How robust the network is to failures

Topology Degree Diameter Bisection

Linear Array 2 n-1 1 link

Ring 2 n/2 2 links

Mesh 4 2*sqrt(n) sqrt(n) links

Hypercube k k n/2 links

Processor 1: X = 10; // updates its cached X

1.​ Snooping Protocols

Common protocol: MESI (Modified, Exclusive, Shared, Invalid)

Key Protocols (Example: MESI)

MESI tracks each cache line’s state:

Importance in Parallel Computing

●​ Correct execution of parallel programs that share data.

●​ Programs might read stale values.

Cache coherence mechanisms:

●​ Add overhead (extra bus traffic, protocol complexity).​

MESI State Transfer

1.​ Invalid to Shared.​

2.​ Invalid to Exclusive.​

3.​ Invalid to Modified.​

4.​ Shared to Modified.​

5.​ Shared to Invalid.​

6.​ Exclusive to Modified.​

8.​ Modified to Shared.​

9.​ Modified to Invalid.​

1.7. Shared-Memory vs. Distributed-Memory

Programming Easier: threads, OpenMP, Pthreads Harder: explicit messages, MPI

Scalability Limited by memory bandwidth and Highly scalable across nodes

Examples Multi-core CPUs, SMP machines Clusters, supercomputers

● Share data (e.g., partial results)

● Processor-to-Memory communication (shared memory)

● Nodes: Processors or memories.

● Links: Physical connections (wires, buses).

● Switches/Routers: Devices deciding which path data takes.

1. Linear Array:

ii. Very high bandwidth, but cost grows O(n²).

2. Multistage Interconnection Networks (MIN):

i. Multiple layers of switches.

1. Snooping Protocols

● Correct execution of parallel programs that share data.

● Programs might read stale values.

● Add overhead (extra bus traffic, protocol complexity).

1. Invalid to Shared.

2. Invalid to Exclusive.

3. Invalid to Modified.

4. Shared to Modified.

5. Shared to Invalid.

6. Exclusive to Modified.

8. Modified to Shared.

9. Modified to Invalid.

● Symmetric Multiprocessors (SMPs)

● No explicit communication—data just lives in memory.

● Controlling the activities of multiple threads/processes.

● Work may be done out of order.

● When tasks depend on results from other tasks

● When multiple threads update shared variables

● When distributing or collecting data among processes

● Ensure that only one thread enters a critical section at a time.

● Prevent race conditions.

● Similar to mutexes but busy-wait (continuously check the lock).

● Faster in some situations but can waste CPU cycles.

● Counting mechanisms to control access to resources.

● Used for signaling between threads.

● Force all threads to wait until everyone reaches a synchronization point.

● After the barrier, all proceed together.

● Let threads wait for specific conditions to become true.

● Used with mutexes.

● Simple updates (e.g., increment) performed safely without locks.

#pragma omp atomic

● Which thread does which part of the work.

● Message Passing Interface (MPI)

● Processes explicitly send and receive messages.

● Unlike threads, processes do not share memory.

● Broadcast: Send data from one process to all others.

● Scatter: Distribute chunks of data to processes.

● Gather: Collect data from all processes.

● Reduce: Combine values (sum, max, etc.) across processes.

● Network latency in communication.

● Correct matching of send/receive operations.

● Deadlocks if communication patterns are mismatched.

● A master thread forks worker threads, then joins them.

● One process (master) distributes tasks to workers.

● Data flows through multiple stages of processing.

Data Parallel Reduction

● Each thread/process computes partial results, then combines them.

#pragma omp barrier

#pragma omp single

● reduction: Coordinates summing safely.

● All threads can access the same variables.

● No need to explicitly send messages—data is in shared memory.

● Multicore CPUs (e.g., Intel, AMD)

● All threads can update `data[i]`.

● The most common shared-memory API in C, C++, Fortran.

● Threads automatically split the loop iterations.

● POSIX Threads library (C).

pthread_create(&thread_id, NULL, function, args);

● Two threads write the same variable without coordination.