2018-19 PYQ
Question:=1(a) Discuss the motivation for concurrecy in software
1. Performance and Responsiveness
Improved throughput: Concurrency allows multiple tasks to be executed in
overlapping time periods, increasing the overall work done.
Responsiveness in UI applications: In interactive systems (like mobile or desktop
apps), concurrency ensures the user interface remains responsive while background
tasks (e.g., file downloads, data processing) run simultaneously.
2. Efficient Resource Utilization
CPU utilization: Modern CPUs have multiple cores. Concurrency enables software
to leverage these cores effectively, maximizing hardware potential.
I/O waiting: While waiting for I/O operations (like disk or network access),
concurrent systems can switch to other tasks instead of idling.
3. Real-Time and Asynchronous Processing
Real-time systems: Applications like robotics, gaming, or embedded systems require
tasks to run concurrently to meet timing constraints.
Asynchronous workflows: Web servers and cloud applications often handle
thousands of requests concurrently, improving scalability and user experience.
4. Modularity and Separation of Concerns
Decoupling components: Concurrency allows different parts of a system (e.g., data
collection, processing, and logging) to operate independently, improving
maintainability and clarity.
5. Scalability in Distributed Systems
Cloud and microservices: Concurrency is fundamental in distributed architectures,
where services run in parallel across machines or containers to handle large-scale
workloads.
Question:=1(b) differentiate between symmetric memory architecture and
data distributed memory architecture.
Architecture and Distributed Memory Architecture:
Symmetric Memory Architecture Distributed Memory
Feature
(SMA) Architecture (DMA)
All processors share a single, global Each processor has its own
Memory Access
memory space private memory
Communication Implicit via shared memory Explicit via message passing
Limited scalability due to memory Highly scalable across many
Scalability
contention nodes
Higher latency due to network
Latency Lower latency for memory access
communication
Programming Easier to program (shared memory More complex (requires explicit
Complexity model) communication)
Clusters, supercomputers, MPI-
Examples Multi-core desktops, SMP systems
based systems
Less fault-tolerant (shared memory is More fault-tolerant (nodes are
Fault Tolerance
a single point of failure) independent)
Summary
SMA is great for small to medium-scale systems where ease of programming and
low-latency access are important.
DMA shines in large-scale, distributed environments where scalability and fault
tolerance are critical.
Question:=1(c) what do you understand by task decomposition and data
decomposition
Task Decomposition (Functional Decomposition)
This involves breaking a problem into distinct tasks or functions, each performing a specific
part of the overall computation.
Focus: What needs to be done.
Example: In a web server, one task handles incoming requests, another processes
data, and another sends responses.
Use Case: Ideal when different operations can be performed independently or in
parallel.
Goal: Maximize concurrency by identifying independent or loosely coupled tasks.
Data Decomposition (Domain Decomposition)
This involves dividing the data into chunks and performing the same operation on each
chunk in parallel.
Focus: On how much data needs to be processed.
Example: In image processing, splitting an image into sections and applying a filter
to each section simultaneously.
Use Case: Best when the same computation is applied to large datasets.
Goal: Improve performance by distributing data across multiple processors.
Combined Use
In real-world applications, both techniques are often used together:
Task decomposition handles different stages of a pipeline.
Data decomposition speeds up each stage by parallelizing data processing.
Question:=1(d) discuss the 2 atomic operations performed on a "lock"
1. Acquire (or Lock)
Purpose: To gain exclusive access to a shared resource.
Behavior:
o If the lock is free, the thread acquires it and proceeds.
o If the lock is already held, the thread is blocked (or spins) until the lock
becomes available.
Atomicity: The check-and-set operation must be atomic to prevent race conditions.
This is often implemented using hardware instructions like Test-and-Set, Compare-
and-Swap (CAS), or Load-Link/Store-Conditional (LL/SC).
2. Release (or Unlock)
Purpose: To relinquish control of the lock so that other threads can acquire it.
Behavior:
o The lock is marked as available.
o If other threads are waiting, one may be awakened to acquire the lock.
Atomicity: Ensures that the lock state is updated without interference from other
threads.
Why Atomicity Matters
Without atomic operations, two threads could simultaneously believe they’ve acquired the
lock, leading to data corruption or undefined behavior. Atomic instructions ensure that
lock acquisition and release are indivisible, preserving correctness in concurrent
environments.
Question:=1(e) define convoying
General Definition
Convoying refers to the act of accompanying or escorting a group of vehicles, ships, or
people—typically for protection or coordination.
Example: Military vehicles convoying supply trucks through a conflict zone.
Usage: Common in military, humanitarian, and transportation contexts.
According to the Cambridge Dictionary:
“To travel with a vehicle or group of people to make certain that they arrive safely.”
In Computing (Contextual Note)
In computer science, particularly in concurrent systems, convoying has a more specific
meaning:
🧱 Convoying in Concurrency
Definition: A performance issue where a slow thread holding a lock causes a queue
of waiting threads to build up behind it.
Effect: Even fast threads are delayed, leading to reduced system throughput.
Cause: Often due to poor lock management or thread scheduling.
🔄 Example:
If Thread A holds a lock and is preempted or runs slowly, Threads B, C, and D must wait—
even if they could have completed their tasks quickly.
Question:=1(f) classify the synchronization primitives
1. Mutual Exclusion Primitives
These ensure that only one thread accesses a critical section at a time.
Mutex (Mutual Exclusion Lock): Basic lock that allows only one thread to enter a
critical section.
Spinlock: A lock where threads continuously check (spin) until the lock becomes
available.
Binary Semaphore: A semaphore with only two states (0 and 1), often used like a
mutex.
2. Signaling Primitives
Used for communication between threads—one thread signals another to proceed.
Semaphore: A counter-based signaling mechanism. Can be:
o Binary Semaphore (acts like a mutex)
o Counting Semaphore (allows multiple threads to access a resource)
Condition Variable: Allows threads to wait for certain conditions to be true.
Event: Used to signal one or more threads that an event has occurred.
3. Barriers
Used to synchronize a group of threads at a specific point.
Barrier: All threads must reach the barrier before any can proceed.
Cyclic Barrier: A reusable barrier that resets after all threads reach it.
4. Read/Write Locks
Allow multiple readers or one writer at a time.
Reader-Writer Lock: Optimizes access when reads are more frequent than writes.
5. Atomic Operations
Low-level primitives that perform operations atomically without locks.
Compare-and-Swap (CAS)
Fetch-and-Add
Test-and-Set
Question:=1(g) how threads overhead can be minimized.
1. Thread Pooling
What it is: Reusing a fixed number of threads to handle multiple tasks.
Why it helps: Avoids the overhead of frequent thread creation and destruction.
Example: ExecutorService in Java, ThreadPoolExecutor in Python.
2. Use Lightweight Threads or Coroutines
What it is: Use user-space threads (like coroutines or fibers) instead of OS threads.
Why it helps: They have lower context-switching overhead and are more scalable.
Example: async/await in Python, Kotlin coroutines, Go goroutines.
3. Reduce Context Switching
What it is: Minimize the number of times the CPU switches between threads.
Why it helps: Context switching is expensive due to saving/restoring thread states.
How:
o Reduce the number of active threads.
o Avoid unnecessary blocking.
o Use CPU affinity to keep threads on the same core.
4. Efficient Synchronization
What it is: Use fine-grained or lock-free synchronization mechanisms.
Why it helps: Reduces contention and waiting time between threads.
How:
o Use atomic operations.
o Prefer concurrent data structures.
o Avoid holding locks longer than necessary.
5. Batching and Task Granularity
What it is: Combine small tasks into larger ones to reduce scheduling overhead.
Why it helps: Fewer tasks mean fewer thread switches and less overhead.
6. Profile and Tune
What it is: Use profiling tools to identify bottlenecks and optimize thread usage.
Why it helps: Helps you make data-driven decisions about thread management.
Question:=2(a) Illustrate flynn's classification in detail with neat and clean
diagram.
Flynn's Classification Overview
Instruction Data
Category Full Form Example
Stream Stream
SISD Single Instruction, Single Data 1 1 Traditional single-core CPU
Single Instruction, Multiple
SIMD 1 Many GPUs, vector processors
Data
Multiple Instruction, Single Rare, used in fault-tolerant
MISD Many 1
Data systems
Multiple Instruction, Multiple Multi-core CPUs, distributed
MIMD Many Many
Data systems
Diagram of Flynn's Classification
Flynn’s Classification – Text Diagram
+----------------+--------------------+--------------------+
| Category | Instruction Stream | Data Stream |
+----------------+--------------------+--------------------+
| SISD | → | → |
| (Single Instr, | One instruction | One data element |
| Single Data) | stream | processed |
+----------------+--------------------+--------------------+
| SIMD | → | → → → |
| (Single Instr, | One instruction | Multiple data |
| Multiple Data)| stream | elements |
+----------------+--------------------+--------------------+
| MISD | → → → | → |
| (Multiple Instr| Multiple instr. | One data stream |
| Single Data) | streams | processed |
+----------------+--------------------+--------------------+
| MIMD | → → → | → → → |
| (Multiple Instr| Multiple instr. | Multiple data |
| Multiple Data)| streams | streams |
+----------------+--------------------+--------------------+
Each arrow (→) represents a stream. This layout helps visualize how instruction and data
streams vary across the four categories.
Question:=2(b) write a note on data flow decomposition and its implications
Data Flow Decomposition: A Note
Data flow decomposition is a parallel programming strategy where a problem is broken
down based on the flow of data between operations or components. Instead of focusing on
tasks or data chunks, this approach emphasizes how data moves through a sequence of
transformations.
Key Concepts
Pipeline structure: Computation is organized as a series of stages, where the output
of one stage becomes the input of the next.
Concurrency through flow: Each stage can be executed concurrently as soon as its
input data is available.
Streaming model: Often used in systems that process continuous streams of data
(e.g., video processing, signal processing).
Implications of Data Flow Decomposition
Advantages
Natural parallelism: Each stage can run in parallel, improving throughput.
Modularity: Each stage is typically a self-contained unit, making the system easier to
understand and maintain.
Scalability: Pipelines can be scaled by replicating stages or distributing them across
processors.
Challenges
Load balancing: If one stage is slower than others, it becomes a bottleneck.
Data dependencies: Complex dependencies between stages can limit parallelism.
Debugging difficulty: Tracing data through asynchronous stages can be harder than
in sequential code.
Real-World Examples
Compiler design: Lexical analysis → Parsing → Semantic analysis → Code
generation.
Multimedia processing: Decode → Filter → Encode.
Big data pipelines: Ingest → Transform → Analyze → Store (e.g., in Apache Spark
or Flink).
Question:=2(c) generalize on semaphores and barrier
Semaphores
A semaphore is a signaling mechanism used to control access to a shared resource by
multiple threads.
Key Characteristics:
Maintains a counter representing the number of available resources.
Two main operations:
o wait() or P(): Decrements the counter. If the counter is zero, the thread is
blocked.
o signal() or V(): Increments the counter and potentially wakes a waiting
thread.
Can be:
o Binary Semaphore (value is 0 or 1): Acts like a mutex.
o Counting Semaphore (value ≥ 0): Allows multiple threads to access a limited
number of resources.
Use Cases:
Controlling access to a pool of resources (e.g., database connections).
Implementing producer-consumer problems.
Barriers
A barrier is a synchronization point where multiple threads or processes must wait until all
have reached the barrier before any can proceed.
Key Characteristics:
Ensures that all threads reach a certain point before continuing.
Often used in parallel algorithms where phases must be synchronized.
Can be:
o One-time barrier: Used once in the program.
o Cyclic barrier: Reusable across multiple synchronization points.
Use Cases:
Parallel matrix computations.
Multi-phase simulations where each phase depends on the completion of the previous
one by all threads.
Comparison Summary
Feature Semaphore Barrier
Purpose Resource management Synchronization point
Coordination Between threads accessing a resource Among threads reaching a stage
Feature Semaphore Barrier
Blocking When resource is unavailable Until all threads arrive
Reusability Yes Yes (especially cyclic barriers)
Question:=2(d)
discuss the four schedule schemes in open MP
What is Scheduling in OpenMP?
In OpenMP, scheduling determines how loop iterations are divided among threads in a
parallel region. The goal is to balance the workload and optimize performance.
The Four Main Scheduling Schemes
Schedule
Description Use Case
Type
Divides iterations into equal-sized chunks and
Best when all iterations take
Static assigns them to threads in a round-robin fashion at
roughly the same time.
compile time.
Assigns chunks to threads at runtime as they
Useful when iteration times
Dynamic become available. Threads request new chunks
vary significantly.
after finishing their current one.
Balances load while reducing
Similar to dynamic, but chunk sizes start large and
Guided overhead of frequent
decrease exponentially.
scheduling.
Leaves the decision to the compiler and runtime When you trust the compiler
Auto
system. to choose the best strategy.
Example in OpenMP (C/C++)
c
#pragma omp parallel for schedule(static, 4)
for (int i = 0; i < 100; i++) {
// loop body
}
You can replace static with dynamic, guided, or auto and adjust the chunk size as needed.
Question:=2(e)
tabulate the difference between deadlock and livelocks. write conditions to
avoid dataraces.
Difference Between Deadlock and Livelock
Feature Deadlock 🛑 Livelock 🔁
A situation where two or more threads A situation where threads keep changing state
Definition are blocked forever, waiting for each in response to each other but make no
other. progress.
State Threads are stuck and do not proceed. Threads are active but not progressing.
Cause Circular wait on resources. Overreaction to avoid conflict or deadlock.
CPU Usage Low or zero (threads are blocked). High (threads are busy but ineffective).
Thread A holds Lock 1, waits for Lock 2; Two threads repeatedly yielding to each other,
Example
Thread B holds Lock 2, waits for Lock 1. trying to avoid conflict.
Conditions to Avoid Data Races
A data race occurs when two or more threads access shared data concurrently, and at least
one access is a write, without proper synchronization.
To avoid data races:
1. Use Mutual Exclusion (Locks)
o Protect shared resources using mutexes or critical sections.
2. Use Atomic Operations
o For simple updates (e.g., counters), use atomic variables or operations.
3. Thread Synchronization
o Use barriers, condition variables, or semaphores to coordinate thread
execution.
4. Avoid Shared State
o Design with thread-local storage or message passing to eliminate shared data.
5. Immutable Data
o Use read-only data structures where possible to avoid concurrent writes
Question:=3(a)
explain amdahl's law and gustafson's law in detail with limitations of each
Amdahl’s Law
Amdahl’s Law describes the theoretical maximum speedup of a program using multiple
processors, assuming a fixed workload.
Formula:
Speedup=1(1−P)+PN\text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}}
PP: Proportion of the program that can be parallelized
NN: Number of processors
Interpretation:
If only a small portion of a program is parallelizable, adding more processors yields
diminishing returns.
Even with infinite processors, speedup is limited by the serial portion.
Limitations:
Assumes a fixed problem size.
Ignores communication and synchronization overhead.
Not realistic for scalable systems where workload grows with resources.
Gustafson’s Law
Gustafson’s Law offers a more optimistic view by assuming that as more processors are
added, the problem size scales accordingly.
Formula:
Speedup=N−(1−P)(N−1)\text{Speedup} = N - (1 - P)(N - 1)
PP: Proportion of the program that can be parallelized
NN: Number of processors
Interpretation:
As the number of processors increases, we can solve larger problems in the same
amount of time.
More realistic for high-performance computing and scientific simulations.
Limitations:
Assumes perfect scalability of the parallel portion.
May underestimate overhead from communication and memory contention.
Not suitable for applications with strict real-time constraints.
Summary Comparison
Feature Amdahl’s Law 🧩 Gustafson’s Law 🚀
Assumes Fixed problem size Scalable problem size
Focus Limits of parallelism Benefits of scaling
Optimism Conservative Optimistic
Best for Small-scale parallel systems Large-scale, scalable systems
Question:=3(b)
what is thread? summarize the need and how threads communicate inside
OS
What is a Thread?
A thread is the smallest unit of execution within a process. It represents a single sequence of
instructions that can be scheduled and executed independently by the operating system.
A process can have one or more threads.
All threads within a process share the same memory space, file descriptors, and
resources.
Why Threads Are Needed
1. Concurrency: Threads allow multiple tasks to run seemingly at the same time,
improving responsiveness (e.g., UI + background tasks).
2. Resource Sharing: Threads within the same process can easily share data and
resources.
3. Efficiency: Creating and switching between threads is faster than between processes.
4. Scalability: Threads can take advantage of multi-core processors for parallel
execution.
How Threads Communicate in an OS
Since threads share the same address space, they can communicate through:
Shared Memory
Direct access to global variables or heap memory.
Requires synchronization mechanisms (e.g., mutexes, semaphores) to avoid race
conditions.
Thread Synchronization Tools
Mutexes: Ensure only one thread accesses a critical section at a time.
Condition Variables: Allow threads to wait for certain conditions to be true.
Semaphores: Control access to a limited number of resources.
Barriers: Synchronize multiple threads at a common point.
Question:=4(a)
discuss the challenges that we face while managing simultaneous activities
Challenges in Managing Simultaneous Activities (Concurrency)
Managing multiple activities or threads running at the same time introduces several
complexities:
1. Race Conditions
Occur when two or more threads access shared data at the same time, and the outcome
depends on the order of execution.
Can lead to unpredictable behavior and hard-to-reproduce bugs.
2. Deadlocks
Happen when two or more threads are waiting for each other to release resources,
causing all to freeze.
Typically caused by circular wait conditions.
3. Livelocks
Threads keep changing state in response to each other but make no progress.
Unlike deadlocks, threads are active but ineffective.
4. Starvation
A thread waits indefinitely because other threads are continuously given preference.
Often due to unfair scheduling or resource allocation.
5. Complex Synchronization
Coordinating access to shared resources requires careful use of locks, semaphores, or
other primitives.
Poor synchronization can lead to performance bottlenecks or bugs.
6. Performance Overhead
Context switching between threads consumes CPU time.
Excessive thread creation or poor load balancing can degrade performance.
7. Testing and Debugging Difficulty
Bugs in concurrent systems are often non-deterministic.
Hard to reproduce and diagnose issues like race conditions or timing bugs.
8. Resource Contention
Multiple threads competing for limited resources (CPU, memory, I/O) can lead to
delays and inefficiencies.
Question:=4(b) discuss error diffusion algorithm with C-language code.
What is Error Diffusion?
Error diffusion is a technique used in digital halftoning—converting a grayscale image into
a binary (black-and-white) image while preserving visual detail.
How it works:
For each pixel:
1. Compare the pixel value to a threshold (usually 128).
2. Set the pixel to black (0) or white (255).
3. Calculate the quantization error (original - new value).
4. Distribute the error to neighboring pixels that haven’t been processed yet.
The most common error diffusion method is the Floyd–Steinberg algorithm.
Floyd–Steinberg Error Diffusion Matrix
X 7
3 5 1
The error is distributed to neighboring pixels using these weights (divided by 16).
X is the current pixel.
C Code Example
#include <stdio.h>
#include <stdlib.h>
#define WIDTH 256
#define HEIGHT 256
void error_diffusion(unsigned char image[HEIGHT][WIDTH]) {
int x, y;
int error;
int new_pixel;
for (y = 0; y < HEIGHT; y++) {
for (x = 0; x < WIDTH; x++) {
int old_pixel = image[y][x];
new_pixel = old_pixel < 128 ? 0 : 255;
image[y][x] = new_pixel;
error = old_pixel - new_pixel;
// Distribute error
if (x + 1 < WIDTH)
image[y][x + 1] += error * 7 / 16;
if (x - 1 >= 0 && y + 1 < HEIGHT)
image[y + 1][x - 1] += error * 3 / 16;
if (y + 1 < HEIGHT)
image[y + 1][x] += error * 5 / 16;
if (x + 1 < WIDTH && y + 1 < HEIGHT)
image[y + 1][x + 1] += error * 1 / 16;
}
}
}
Notes:
Input image should be in grayscale (0–255).
This function modifies the image in-place.
Be cautious of overflow/underflow when adding error—use clamping if needed.
Question:=5(a) discuss threading APIs for microsoft. net framework
Threading APIs in Microsoft .NET Framework
The .NET Framework provides several powerful APIs for managing threads and
concurrency:
1. System.Threading.Thread
Low-level threading API.
Allows manual creation and control of threads.
Example:
csharp
Thread t = new Thread(() => Console.WriteLine("Hello from thread!"));
t.Start();
✅ Pros:
Full control over thread lifecycle.
❌ Cons:
More complex and error-prone.
2. ThreadPool (System.Threading.ThreadPool)
Manages a pool of worker threads.
Efficient for short-lived, background tasks.
Example:
csharp
ThreadPool.QueueUserWorkItem(state => Console.WriteLine("From thread
pool"));
3. Task Parallel Library (TPL) – System.Threading.Tasks
Introduced in .NET 4.0.
Provides Task and Task<T> for easier and more scalable parallelism.
Example:
csharp
Task.Run(() => Console.WriteLine("Running in a task"));
✅ Pros:
Simplifies parallelism.
Supports continuations and cancellation.
4. async/await (Asynchronous Programming Model)
Built on top of TPL.
Simplifies asynchronous code using async and await keywords.
Example:
csharp
async Task MyMethodAsync() {
await Task.Delay(1000);
Console.WriteLine("Async done");
}
5. Parallel Class (System.Threading.Tasks.Parallel)
Provides parallel loops and invokes.
Example:
csharp
Parallel.For(0, 10, i => Console.WriteLine(i));
6. Synchronization Primitives
Includes Mutex, Monitor, Semaphore, AutoResetEvent, ManualResetEvent, and
ReaderWriterLockSlim.
Question:=5(b) compare and contrast mutual exclusion(mutex)and locks
Mutex vs. Lock: Comparison Table
Feature Mutex (Mutual Exclusion) Lock
A synchronization primitive that enforces A general term for mechanisms that
Definition
mutual exclusion. control access to shared resources.
Often refers to OS-level or system- Can be implemented at language or library
Scope
provided constructs. level.
Typically has ownership—only the thread May or may not enforce ownership
Ownership
that locked it can unlock it. depending on implementation.
Blocking Blocks the thread until the mutex is Can be blocking or non-blocking (e.g., try-
Behavior available. lock).
Can be lighter if implemented in user
Overhead Slightly higher due to OS involvement.
space.
lock keyword in C#, std::lock_guard
Examples pthread_mutex_t in C, Mutex in .NET
in C++
Suitable for inter-process or inter-thread Typically used for intra-process thread
Use Case
synchronization. synchronization.
Summary
Mutex is a specific type of lock with strict ownership and often system-level support.
Lock is a broader concept that includes mutexes, spinlocks, read-write locks, etc.
Question:=6(a) write a note on.
(1)openMP Library functions
It looks like I don’t have permission to view the page you’re on. But I can still help! Here's a
list of commonly used OpenMP library functions, which are part of the OpenMP API for
managing parallelism in C, C++, and Fortran:
Common OpenMP Library Functions
Function Description
omp_get_thread_num() Returns the thread ID of the calling thread within a team.
omp_get_num_threads() Returns the number of threads in the current team.
omp_get_max_threads() Returns the maximum number of threads available.
omp_get_num_procs() Returns the number of processors available to the program.
omp_in_parallel() Returns non-zero if the code is executing in a parallel region.
omp_set_num_threads(int n) Sets the number of threads to use in the next parallel region.
omp_get_wtime() Returns the elapsed wall clock time (used for timing).
omp_get_wtick() Returns the resolution of omp_get_wtime().
omp_set_dynamic(int flag)
Enables or disables dynamic adjustment of the number of
threads.
omp_get_dynamic() Returns whether dynamic adjustment is enabled.
omp_set_nested(int flag) Enables or disables nested parallelism.
omp_get_nested() Returns whether nested parallelism is enabled.
Example in C
c
#include <omp.h>
#include <stdio.h>
int main() {
omp_set_num_threads(4);
#pragma omp parallel
{
int tid = omp_get_thread_num();
printf("Hello from thread %d\n", tid);
}
return 0;
}
(2) OpenMP environment variables
Common OpenMP Environment Variables
Variable Description
OMP_NUM_THREADS Sets the number of threads to use in parallel regions.
OMP_DYNAMIC
Enables (TRUE) or disables (FALSE) dynamic adjustment of the
number of threads.
OMP_NESTED Enables (TRUE) or disables (FALSE) nested parallel regions.
OMP_SCHEDULE
Sets the scheduling policy and chunk size for loops (e.g., static,4
or dynamic).
OMP_PROC_BIND
Controls whether threads are bound to processors (TRUE, FALSE, or
spread, close, master).
OMP_PLACES Specifies the places (logical processors) where threads can be
Variable Description
scheduled.
OMP_STACKSIZE Sets the stack size for threads.
OMP_WAIT_POLICY Defines the behavior of waiting threads (ACTIVE or PASSIVE).
OMP_MAX_ACTIVE_LEVELS Sets the maximum number of nested active parallel regions.
OMP_THREAD_LIMIT Limits the total number of threads used by the program.
Example Usage (Linux/macOS Shell)
bash
export OMP_NUM_THREADS=8
export OMP_SCHEDULE="dynamic,2"
These variables can be set before running your OpenMP program to control its execution
behavior.
Question:=6(b) generalize on how data and functional parallelism are
handled in shared memory programming with OpenMP
1. Data Parallelism in OpenMP
Data parallelism involves performing the same operation on different parts of a data set
simultaneously.
How OpenMP handles it:
Uses #pragma omp parallel for to divide loop iterations among threads.
Each thread processes a chunk of the data independently.
Example:
c
#pragma omp parallel for
for (int i = 0; i < N; i++) {
A[i] = B[i] + C[i];
}
Threads share memory, so no need to explicitly pass data.
Synchronization is minimal if each thread works on separate data.
2. Functional Parallelism in OpenMP
Functional parallelism (also called task parallelism) involves executing different tasks or
functions in parallel.
How OpenMP handles it:
Uses #pragma omp parallel sections to run independent code blocks
concurrently.
Each section can perform a different function.
Example:
c
#pragma omp parallel sections
{
#pragma omp section
{
process_audio();
}
#pragma omp section
{
process_video();
}
}
Useful when tasks are logically independent.
Threads still share memory, so communication is easy.
Summary
Aspect Data Parallelism Functional Parallelism
Focus Splitting data across threads Splitting tasks/functions across threads
OpenMP Construct #pragma omp parallel for #pragma omp parallel sections
Use Case Array processing, matrix operations Multimedia pipelines, I/O + computation
Memory Model Shared memory simplifies data access Shared memory simplifies coordination
Question:=7(a)
discuss the current IA-32 architecture .also state the methods to avoid
pipeline stalls on IA-32
IA-32 Architecture Overview
IA-32 (Intel Architecture, 32-bit) is the 32-bit version of the x86 instruction set architecture,
introduced with the Intel 80386 processor. It became the foundation for many generations of
Intel and compatible CPUs.
Key Features:
32-bit Registers: General-purpose registers like EAX, EBX, ECX, etc., support 32-
bit operations.
Segmented Memory Model: Supports both segmented and flat memory models.
Paging Support: Enables virtual memory through 4 KB pages.
Instruction-Level Parallelism: Later IA-32 implementations (e.g., Pentium) support
superscalar execution with multiple pipelines.
Floating Point Unit (FPU): Integrated FPU with its own registers and execution
units.
Caches: Separate instruction and data caches (Harvard architecture style).
Out-of-Order Execution: Instructions can be executed out of order and retired in
order using a Reorder Buffer (ROB).
Branch Prediction: Helps reduce control hazards in pipelined execution.
For a detailed breakdown, see the IA-32 architecture on Wikipedia or Ques10’s diagram and
explanation.
Avoiding Pipeline Stalls in IA-32
Pipeline stalls occur when the CPU pipeline cannot proceed with the next instruction due to
hazards. Here are methods to minimize them:
1. Out-of-Order Execution
Executes instructions as operands become available, not strictly in program order.
Helps bypass data hazards and improves instruction throughput.
2. Instruction Reordering by Compiler
Compilers can rearrange instructions to avoid dependencies and fill delay slots.
3. Register Renaming
Eliminates false dependencies (Write After Write, Write After Read) by mapping
logical registers to physical ones.
4. Forwarding (Data Bypassing)
Passes results directly from one pipeline stage to another without writing to registers
first.
5. Branch Prediction
Predicts the outcome of conditional branches to avoid control hazards.
Mispredictions can cause stalls, so accurate predictors are crucial.
6. Speculative Execution
Executes instructions before knowing if they are needed, rolling back if the
speculation was incorrect.
7. Hardware Prefetching
Loads data into cache before it’s needed to reduce memory access latency.
Question:=7(b) define deadlock. write the conditions under which a deadlock
situation may arise. also discuss the synchronization primitive in parallel
program challenges.
What is a Deadlock?
A deadlock is a situation in concurrent programming where a group of threads or processes
are each waiting for resources held by the others, and none can proceed. It results in a
complete standstill of execution.
Conditions for Deadlock (Coffman’s Conditions)
A deadlock can occur if all the following four conditions hold simultaneously:
Condition Description
Mutual
At least one resource must be held in a non-shareable mode.
Exclusion
A process holding at least one resource is waiting to acquire additional resources
Hold and Wait
held by others.
No Preemption Resources cannot be forcibly taken away; they must be released voluntarily.
Circular Wait A set of processes are waiting for each other in a circular chain.
🧵 Synchronization Primitives in Parallel Programming
Synchronization primitives are tools used to coordinate access to shared resources and avoid
issues like race conditions and deadlocks.
Common Synchronization Primitives:
Primitive Description Use Case
Mutex (Mutual Ensures only one thread accesses a critical
Protecting shared data.
Exclusion) section at a time.
A counter-based lock that allows a limited
Semaphore Managing resource pools.
number of threads to access a resource.
Primitive Description Use Case
A lock where threads wait in a loop ("spin") until Low-latency locking on multi-
Spinlock
the lock becomes available. core systems.
Allows threads to wait for certain conditions to Thread coordination (e.g.,
Condition Variable
be met. producer-consumer).
Synchronizes a group of threads at a specific Parallel algorithms with
Barrier
point. phases.
Optimizing read-heavy
Read-Write Lock Allows multiple readers or one writer.
workloads.
Challenges in Using Synchronization Primitives
Deadlocks: Poor lock ordering or resource management can cause circular waits.
Livelocks: Threads keep reacting to each other but make no progress.
Starvation: Some threads may never acquire the lock if others dominate access.
Performance Overhead: Excessive locking can reduce parallel efficiency.