Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views23 pages

Multi Core Architecture and Programming PYQ

Uploaded by

Ashish Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views23 pages

Multi Core Architecture and Programming PYQ

Uploaded by

Ashish Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

2018-19 PYQ

Question:=1(a) Discuss the motivation for concurrecy in software

1. Performance and Responsiveness

 Improved throughput: Concurrency allows multiple tasks to be executed in


overlapping time periods, increasing the overall work done.
 Responsiveness in UI applications: In interactive systems (like mobile or desktop
apps), concurrency ensures the user interface remains responsive while background
tasks (e.g., file downloads, data processing) run simultaneously.

2. Efficient Resource Utilization

 CPU utilization: Modern CPUs have multiple cores. Concurrency enables software
to leverage these cores effectively, maximizing hardware potential.
 I/O waiting: While waiting for I/O operations (like disk or network access),
concurrent systems can switch to other tasks instead of idling.

3. Real-Time and Asynchronous Processing

 Real-time systems: Applications like robotics, gaming, or embedded systems require


tasks to run concurrently to meet timing constraints.
 Asynchronous workflows: Web servers and cloud applications often handle
thousands of requests concurrently, improving scalability and user experience.

4. Modularity and Separation of Concerns

 Decoupling components: Concurrency allows different parts of a system (e.g., data


collection, processing, and logging) to operate independently, improving
maintainability and clarity.

5. Scalability in Distributed Systems

 Cloud and microservices: Concurrency is fundamental in distributed architectures,


where services run in parallel across machines or containers to handle large-scale
workloads.

Question:=1(b) differentiate between symmetric memory architecture and


data distributed memory architecture.

Architecture and Distributed Memory Architecture:


Symmetric Memory Architecture Distributed Memory
Feature
(SMA) Architecture (DMA)
All processors share a single, global Each processor has its own
Memory Access
memory space private memory
Communication Implicit via shared memory Explicit via message passing
Limited scalability due to memory Highly scalable across many
Scalability
contention nodes
Higher latency due to network
Latency Lower latency for memory access
communication
Programming Easier to program (shared memory More complex (requires explicit
Complexity model) communication)
Clusters, supercomputers, MPI-
Examples Multi-core desktops, SMP systems
based systems
Less fault-tolerant (shared memory is More fault-tolerant (nodes are
Fault Tolerance
a single point of failure) independent)

Summary

 SMA is great for small to medium-scale systems where ease of programming and
low-latency access are important.
 DMA shines in large-scale, distributed environments where scalability and fault
tolerance are critical.

Question:=1(c) what do you understand by task decomposition and data


decomposition

Task Decomposition (Functional Decomposition)

This involves breaking a problem into distinct tasks or functions, each performing a specific
part of the overall computation.

 Focus: What needs to be done.


 Example: In a web server, one task handles incoming requests, another processes
data, and another sends responses.
 Use Case: Ideal when different operations can be performed independently or in
parallel.
 Goal: Maximize concurrency by identifying independent or loosely coupled tasks.

Data Decomposition (Domain Decomposition)

This involves dividing the data into chunks and performing the same operation on each
chunk in parallel.

 Focus: On how much data needs to be processed.


 Example: In image processing, splitting an image into sections and applying a filter
to each section simultaneously.
 Use Case: Best when the same computation is applied to large datasets.
 Goal: Improve performance by distributing data across multiple processors.
Combined Use

In real-world applications, both techniques are often used together:

 Task decomposition handles different stages of a pipeline.


 Data decomposition speeds up each stage by parallelizing data processing.

Question:=1(d) discuss the 2 atomic operations performed on a "lock"

1. Acquire (or Lock)

 Purpose: To gain exclusive access to a shared resource.


 Behavior:
o If the lock is free, the thread acquires it and proceeds.
o If the lock is already held, the thread is blocked (or spins) until the lock
becomes available.
 Atomicity: The check-and-set operation must be atomic to prevent race conditions.
This is often implemented using hardware instructions like Test-and-Set, Compare-
and-Swap (CAS), or Load-Link/Store-Conditional (LL/SC).

2. Release (or Unlock)

 Purpose: To relinquish control of the lock so that other threads can acquire it.
 Behavior:
o The lock is marked as available.
o If other threads are waiting, one may be awakened to acquire the lock.
 Atomicity: Ensures that the lock state is updated without interference from other
threads.

Why Atomicity Matters

Without atomic operations, two threads could simultaneously believe they’ve acquired the
lock, leading to data corruption or undefined behavior. Atomic instructions ensure that
lock acquisition and release are indivisible, preserving correctness in concurrent
environments.

Question:=1(e) define convoying

General Definition

Convoying refers to the act of accompanying or escorting a group of vehicles, ships, or


people—typically for protection or coordination.

 Example: Military vehicles convoying supply trucks through a conflict zone.


 Usage: Common in military, humanitarian, and transportation contexts.

According to the Cambridge Dictionary:


“To travel with a vehicle or group of people to make certain that they arrive safely.”

In Computing (Contextual Note)

In computer science, particularly in concurrent systems, convoying has a more specific


meaning:

🧱 Convoying in Concurrency

 Definition: A performance issue where a slow thread holding a lock causes a queue
of waiting threads to build up behind it.
 Effect: Even fast threads are delayed, leading to reduced system throughput.
 Cause: Often due to poor lock management or thread scheduling.

🔄 Example:

If Thread A holds a lock and is preempted or runs slowly, Threads B, C, and D must wait—
even if they could have completed their tasks quickly.

Question:=1(f) classify the synchronization primitives

1. Mutual Exclusion Primitives

These ensure that only one thread accesses a critical section at a time.

 Mutex (Mutual Exclusion Lock): Basic lock that allows only one thread to enter a
critical section.
 Spinlock: A lock where threads continuously check (spin) until the lock becomes
available.
 Binary Semaphore: A semaphore with only two states (0 and 1), often used like a
mutex.

2. Signaling Primitives

Used for communication between threads—one thread signals another to proceed.

 Semaphore: A counter-based signaling mechanism. Can be:


o Binary Semaphore (acts like a mutex)
o Counting Semaphore (allows multiple threads to access a resource)
 Condition Variable: Allows threads to wait for certain conditions to be true.
 Event: Used to signal one or more threads that an event has occurred.

3. Barriers

Used to synchronize a group of threads at a specific point.

 Barrier: All threads must reach the barrier before any can proceed.
 Cyclic Barrier: A reusable barrier that resets after all threads reach it.
4. Read/Write Locks

Allow multiple readers or one writer at a time.

 Reader-Writer Lock: Optimizes access when reads are more frequent than writes.

5. Atomic Operations

Low-level primitives that perform operations atomically without locks.

 Compare-and-Swap (CAS)
 Fetch-and-Add
 Test-and-Set

Question:=1(g) how threads overhead can be minimized.

1. Thread Pooling

 What it is: Reusing a fixed number of threads to handle multiple tasks.


 Why it helps: Avoids the overhead of frequent thread creation and destruction.
 Example: ExecutorService in Java, ThreadPoolExecutor in Python.

2. Use Lightweight Threads or Coroutines

 What it is: Use user-space threads (like coroutines or fibers) instead of OS threads.
 Why it helps: They have lower context-switching overhead and are more scalable.
 Example: async/await in Python, Kotlin coroutines, Go goroutines.

3. Reduce Context Switching

 What it is: Minimize the number of times the CPU switches between threads.
 Why it helps: Context switching is expensive due to saving/restoring thread states.
 How:
o Reduce the number of active threads.
o Avoid unnecessary blocking.
o Use CPU affinity to keep threads on the same core.

4. Efficient Synchronization

 What it is: Use fine-grained or lock-free synchronization mechanisms.


 Why it helps: Reduces contention and waiting time between threads.
 How:
o Use atomic operations.
o Prefer concurrent data structures.
o Avoid holding locks longer than necessary.

5. Batching and Task Granularity


 What it is: Combine small tasks into larger ones to reduce scheduling overhead.
 Why it helps: Fewer tasks mean fewer thread switches and less overhead.

6. Profile and Tune

 What it is: Use profiling tools to identify bottlenecks and optimize thread usage.
 Why it helps: Helps you make data-driven decisions about thread management.

Question:=2(a) Illustrate flynn's classification in detail with neat and clean


diagram.

Flynn's Classification Overview

Instruction Data
Category Full Form Example
Stream Stream

SISD Single Instruction, Single Data 1 1 Traditional single-core CPU

Single Instruction, Multiple


SIMD 1 Many GPUs, vector processors
Data

Multiple Instruction, Single Rare, used in fault-tolerant


MISD Many 1
Data systems

Multiple Instruction, Multiple Multi-core CPUs, distributed


MIMD Many Many
Data systems

Diagram of Flynn's Classification

Flynn’s Classification – Text Diagram


+----------------+--------------------+--------------------+
| Category | Instruction Stream | Data Stream |
+----------------+--------------------+--------------------+
| SISD | → | → |
| (Single Instr, | One instruction | One data element |
| Single Data) | stream | processed |
+----------------+--------------------+--------------------+
| SIMD | → | → → → |
| (Single Instr, | One instruction | Multiple data |
| Multiple Data)| stream | elements |
+----------------+--------------------+--------------------+
| MISD | → → → | → |
| (Multiple Instr| Multiple instr. | One data stream |
| Single Data) | streams | processed |
+----------------+--------------------+--------------------+
| MIMD | → → → | → → → |
| (Multiple Instr| Multiple instr. | Multiple data |
| Multiple Data)| streams | streams |
+----------------+--------------------+--------------------+
Each arrow (→) represents a stream. This layout helps visualize how instruction and data
streams vary across the four categories.

Question:=2(b) write a note on data flow decomposition and its implications

Data Flow Decomposition: A Note

Data flow decomposition is a parallel programming strategy where a problem is broken


down based on the flow of data between operations or components. Instead of focusing on
tasks or data chunks, this approach emphasizes how data moves through a sequence of
transformations.

Key Concepts

 Pipeline structure: Computation is organized as a series of stages, where the output


of one stage becomes the input of the next.
 Concurrency through flow: Each stage can be executed concurrently as soon as its
input data is available.
 Streaming model: Often used in systems that process continuous streams of data
(e.g., video processing, signal processing).

Implications of Data Flow Decomposition

Advantages

 Natural parallelism: Each stage can run in parallel, improving throughput.


 Modularity: Each stage is typically a self-contained unit, making the system easier to
understand and maintain.
 Scalability: Pipelines can be scaled by replicating stages or distributing them across
processors.

Challenges

 Load balancing: If one stage is slower than others, it becomes a bottleneck.


 Data dependencies: Complex dependencies between stages can limit parallelism.
 Debugging difficulty: Tracing data through asynchronous stages can be harder than
in sequential code.

Real-World Examples

 Compiler design: Lexical analysis → Parsing → Semantic analysis → Code


generation.
 Multimedia processing: Decode → Filter → Encode.
 Big data pipelines: Ingest → Transform → Analyze → Store (e.g., in Apache Spark
or Flink).

Question:=2(c) generalize on semaphores and barrier


Semaphores

A semaphore is a signaling mechanism used to control access to a shared resource by


multiple threads.

Key Characteristics:

 Maintains a counter representing the number of available resources.


 Two main operations:
o wait() or P(): Decrements the counter. If the counter is zero, the thread is
blocked.
o signal() or V(): Increments the counter and potentially wakes a waiting
thread.
 Can be:
o Binary Semaphore (value is 0 or 1): Acts like a mutex.
o Counting Semaphore (value ≥ 0): Allows multiple threads to access a limited
number of resources.

Use Cases:

 Controlling access to a pool of resources (e.g., database connections).


 Implementing producer-consumer problems.

Barriers

A barrier is a synchronization point where multiple threads or processes must wait until all
have reached the barrier before any can proceed.

Key Characteristics:

 Ensures that all threads reach a certain point before continuing.


 Often used in parallel algorithms where phases must be synchronized.
 Can be:
o One-time barrier: Used once in the program.
o Cyclic barrier: Reusable across multiple synchronization points.

Use Cases:

 Parallel matrix computations.


 Multi-phase simulations where each phase depends on the completion of the previous
one by all threads.

Comparison Summary

Feature Semaphore Barrier

Purpose Resource management Synchronization point

Coordination Between threads accessing a resource Among threads reaching a stage


Feature Semaphore Barrier

Blocking When resource is unavailable Until all threads arrive

Reusability Yes Yes (especially cyclic barriers)

Question:=2(d)

discuss the four schedule schemes in open MP

What is Scheduling in OpenMP?

In OpenMP, scheduling determines how loop iterations are divided among threads in a
parallel region. The goal is to balance the workload and optimize performance.

The Four Main Scheduling Schemes

Schedule
Description Use Case
Type
Divides iterations into equal-sized chunks and
Best when all iterations take
Static assigns them to threads in a round-robin fashion at
roughly the same time.
compile time.
Assigns chunks to threads at runtime as they
Useful when iteration times
Dynamic become available. Threads request new chunks
vary significantly.
after finishing their current one.
Balances load while reducing
Similar to dynamic, but chunk sizes start large and
Guided overhead of frequent
decrease exponentially.
scheduling.
Leaves the decision to the compiler and runtime When you trust the compiler
Auto
system. to choose the best strategy.

Example in OpenMP (C/C++)

c
#pragma omp parallel for schedule(static, 4)
for (int i = 0; i < 100; i++) {
// loop body
}

You can replace static with dynamic, guided, or auto and adjust the chunk size as needed.
Question:=2(e)

tabulate the difference between deadlock and livelocks. write conditions to


avoid dataraces.

Difference Between Deadlock and Livelock

Feature Deadlock 🛑 Livelock 🔁

A situation where two or more threads A situation where threads keep changing state
Definition are blocked forever, waiting for each in response to each other but make no
other. progress.

State Threads are stuck and do not proceed. Threads are active but not progressing.

Cause Circular wait on resources. Overreaction to avoid conflict or deadlock.

CPU Usage Low or zero (threads are blocked). High (threads are busy but ineffective).

Thread A holds Lock 1, waits for Lock 2; Two threads repeatedly yielding to each other,
Example
Thread B holds Lock 2, waits for Lock 1. trying to avoid conflict.

Conditions to Avoid Data Races

A data race occurs when two or more threads access shared data concurrently, and at least
one access is a write, without proper synchronization.

To avoid data races:

1. Use Mutual Exclusion (Locks)


o Protect shared resources using mutexes or critical sections.
2. Use Atomic Operations
o For simple updates (e.g., counters), use atomic variables or operations.
3. Thread Synchronization
o Use barriers, condition variables, or semaphores to coordinate thread
execution.
4. Avoid Shared State
o Design with thread-local storage or message passing to eliminate shared data.
5. Immutable Data
o Use read-only data structures where possible to avoid concurrent writes

Question:=3(a)

explain amdahl's law and gustafson's law in detail with limitations of each
Amdahl’s Law

Amdahl’s Law describes the theoretical maximum speedup of a program using multiple
processors, assuming a fixed workload.

Formula:
Speedup=1(1−P)+PN\text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}}

 PP: Proportion of the program that can be parallelized


 NN: Number of processors

Interpretation:

 If only a small portion of a program is parallelizable, adding more processors yields


diminishing returns.
 Even with infinite processors, speedup is limited by the serial portion.

Limitations:

 Assumes a fixed problem size.


 Ignores communication and synchronization overhead.
 Not realistic for scalable systems where workload grows with resources.

Gustafson’s Law

Gustafson’s Law offers a more optimistic view by assuming that as more processors are
added, the problem size scales accordingly.

Formula:
Speedup=N−(1−P)(N−1)\text{Speedup} = N - (1 - P)(N - 1)

 PP: Proportion of the program that can be parallelized


 NN: Number of processors

Interpretation:

 As the number of processors increases, we can solve larger problems in the same
amount of time.
 More realistic for high-performance computing and scientific simulations.

Limitations:

 Assumes perfect scalability of the parallel portion.


 May underestimate overhead from communication and memory contention.
 Not suitable for applications with strict real-time constraints.

Summary Comparison
Feature Amdahl’s Law 🧩 Gustafson’s Law 🚀

Assumes Fixed problem size Scalable problem size

Focus Limits of parallelism Benefits of scaling

Optimism Conservative Optimistic

Best for Small-scale parallel systems Large-scale, scalable systems

Question:=3(b)

what is thread? summarize the need and how threads communicate inside
OS

What is a Thread?

A thread is the smallest unit of execution within a process. It represents a single sequence of
instructions that can be scheduled and executed independently by the operating system.

 A process can have one or more threads.


 All threads within a process share the same memory space, file descriptors, and
resources.

Why Threads Are Needed

1. Concurrency: Threads allow multiple tasks to run seemingly at the same time,
improving responsiveness (e.g., UI + background tasks).
2. Resource Sharing: Threads within the same process can easily share data and
resources.
3. Efficiency: Creating and switching between threads is faster than between processes.
4. Scalability: Threads can take advantage of multi-core processors for parallel
execution.

How Threads Communicate in an OS

Since threads share the same address space, they can communicate through:

Shared Memory

 Direct access to global variables or heap memory.


 Requires synchronization mechanisms (e.g., mutexes, semaphores) to avoid race
conditions.
Thread Synchronization Tools

 Mutexes: Ensure only one thread accesses a critical section at a time.


 Condition Variables: Allow threads to wait for certain conditions to be true.
 Semaphores: Control access to a limited number of resources.
 Barriers: Synchronize multiple threads at a common point.

Question:=4(a)

discuss the challenges that we face while managing simultaneous activities

Challenges in Managing Simultaneous Activities (Concurrency)

Managing multiple activities or threads running at the same time introduces several
complexities:

1. Race Conditions

 Occur when two or more threads access shared data at the same time, and the outcome
depends on the order of execution.
 Can lead to unpredictable behavior and hard-to-reproduce bugs.

2. Deadlocks

 Happen when two or more threads are waiting for each other to release resources,
causing all to freeze.
 Typically caused by circular wait conditions.

3. Livelocks

 Threads keep changing state in response to each other but make no progress.
 Unlike deadlocks, threads are active but ineffective.

4. Starvation

 A thread waits indefinitely because other threads are continuously given preference.
 Often due to unfair scheduling or resource allocation.

5. Complex Synchronization

 Coordinating access to shared resources requires careful use of locks, semaphores, or


other primitives.
 Poor synchronization can lead to performance bottlenecks or bugs.

6. Performance Overhead
 Context switching between threads consumes CPU time.
 Excessive thread creation or poor load balancing can degrade performance.

7. Testing and Debugging Difficulty

 Bugs in concurrent systems are often non-deterministic.


 Hard to reproduce and diagnose issues like race conditions or timing bugs.

8. Resource Contention

 Multiple threads competing for limited resources (CPU, memory, I/O) can lead to
delays and inefficiencies.

Question:=4(b) discuss error diffusion algorithm with C-language code.

What is Error Diffusion?

Error diffusion is a technique used in digital halftoning—converting a grayscale image into


a binary (black-and-white) image while preserving visual detail.

How it works:

 For each pixel:


1. Compare the pixel value to a threshold (usually 128).
2. Set the pixel to black (0) or white (255).
3. Calculate the quantization error (original - new value).
4. Distribute the error to neighboring pixels that haven’t been processed yet.

The most common error diffusion method is the Floyd–Steinberg algorithm.

Floyd–Steinberg Error Diffusion Matrix


X 7
3 5 1

 The error is distributed to neighboring pixels using these weights (divided by 16).
 X is the current pixel.

C Code Example

#include <stdio.h>
#include <stdlib.h>

#define WIDTH 256


#define HEIGHT 256

void error_diffusion(unsigned char image[HEIGHT][WIDTH]) {


int x, y;
int error;
int new_pixel;

for (y = 0; y < HEIGHT; y++) {


for (x = 0; x < WIDTH; x++) {
int old_pixel = image[y][x];
new_pixel = old_pixel < 128 ? 0 : 255;
image[y][x] = new_pixel;
error = old_pixel - new_pixel;

// Distribute error
if (x + 1 < WIDTH)
image[y][x + 1] += error * 7 / 16;
if (x - 1 >= 0 && y + 1 < HEIGHT)
image[y + 1][x - 1] += error * 3 / 16;
if (y + 1 < HEIGHT)
image[y + 1][x] += error * 5 / 16;
if (x + 1 < WIDTH && y + 1 < HEIGHT)
image[y + 1][x + 1] += error * 1 / 16;
}
}
}

Notes:

 Input image should be in grayscale (0–255).


 This function modifies the image in-place.
 Be cautious of overflow/underflow when adding error—use clamping if needed.

Question:=5(a) discuss threading APIs for microsoft. net framework

Threading APIs in Microsoft .NET Framework

The .NET Framework provides several powerful APIs for managing threads and
concurrency:

1. System.Threading.Thread

 Low-level threading API.


 Allows manual creation and control of threads.
 Example:

csharp

Thread t = new Thread(() => Console.WriteLine("Hello from thread!"));


t.Start();

✅ Pros:

 Full control over thread lifecycle.

❌ Cons:

 More complex and error-prone.


2. ThreadPool (System.Threading.ThreadPool)

 Manages a pool of worker threads.


 Efficient for short-lived, background tasks.
 Example:

csharp

ThreadPool.QueueUserWorkItem(state => Console.WriteLine("From thread


pool"));

3. Task Parallel Library (TPL) – System.Threading.Tasks

 Introduced in .NET 4.0.


 Provides Task and Task<T> for easier and more scalable parallelism.
 Example:

csharp

Task.Run(() => Console.WriteLine("Running in a task"));

✅ Pros:

 Simplifies parallelism.
 Supports continuations and cancellation.

4. async/await (Asynchronous Programming Model)

 Built on top of TPL.


 Simplifies asynchronous code using async and await keywords.
 Example:

csharp

async Task MyMethodAsync() {


await Task.Delay(1000);
Console.WriteLine("Async done");
}

5. Parallel Class (System.Threading.Tasks.Parallel)

 Provides parallel loops and invokes.


 Example:

csharp

Parallel.For(0, 10, i => Console.WriteLine(i));

6. Synchronization Primitives
 Includes Mutex, Monitor, Semaphore, AutoResetEvent, ManualResetEvent, and
ReaderWriterLockSlim.

Question:=5(b) compare and contrast mutual exclusion(mutex)and locks

Mutex vs. Lock: Comparison Table

Feature Mutex (Mutual Exclusion) Lock

A synchronization primitive that enforces A general term for mechanisms that


Definition
mutual exclusion. control access to shared resources.

Often refers to OS-level or system- Can be implemented at language or library


Scope
provided constructs. level.

Typically has ownership—only the thread May or may not enforce ownership
Ownership
that locked it can unlock it. depending on implementation.

Blocking Blocks the thread until the mutex is Can be blocking or non-blocking (e.g., try-
Behavior available. lock).

Can be lighter if implemented in user


Overhead Slightly higher due to OS involvement.
space.

lock keyword in C#, std::lock_guard


Examples pthread_mutex_t in C, Mutex in .NET
in C++

Suitable for inter-process or inter-thread Typically used for intra-process thread


Use Case
synchronization. synchronization.

Summary

 Mutex is a specific type of lock with strict ownership and often system-level support.
 Lock is a broader concept that includes mutexes, spinlocks, read-write locks, etc.

Question:=6(a) write a note on.

(1)openMP Library functions

It looks like I don’t have permission to view the page you’re on. But I can still help! Here's a
list of commonly used OpenMP library functions, which are part of the OpenMP API for
managing parallelism in C, C++, and Fortran:

Common OpenMP Library Functions


Function Description
omp_get_thread_num() Returns the thread ID of the calling thread within a team.
omp_get_num_threads() Returns the number of threads in the current team.
omp_get_max_threads() Returns the maximum number of threads available.
omp_get_num_procs() Returns the number of processors available to the program.
omp_in_parallel() Returns non-zero if the code is executing in a parallel region.
omp_set_num_threads(int n) Sets the number of threads to use in the next parallel region.
omp_get_wtime() Returns the elapsed wall clock time (used for timing).
omp_get_wtick() Returns the resolution of omp_get_wtime().
omp_set_dynamic(int flag)
Enables or disables dynamic adjustment of the number of
threads.
omp_get_dynamic() Returns whether dynamic adjustment is enabled.
omp_set_nested(int flag) Enables or disables nested parallelism.
omp_get_nested() Returns whether nested parallelism is enabled.

Example in C

c
#include <omp.h>
#include <stdio.h>

int main() {
omp_set_num_threads(4);

#pragma omp parallel


{
int tid = omp_get_thread_num();
printf("Hello from thread %d\n", tid);
}

return 0;
}

(2) OpenMP environment variables

Common OpenMP Environment Variables

Variable Description
OMP_NUM_THREADS Sets the number of threads to use in parallel regions.
OMP_DYNAMIC
Enables (TRUE) or disables (FALSE) dynamic adjustment of the
number of threads.
OMP_NESTED Enables (TRUE) or disables (FALSE) nested parallel regions.
OMP_SCHEDULE
Sets the scheduling policy and chunk size for loops (e.g., static,4
or dynamic).
OMP_PROC_BIND
Controls whether threads are bound to processors (TRUE, FALSE, or
spread, close, master).
OMP_PLACES Specifies the places (logical processors) where threads can be
Variable Description
scheduled.
OMP_STACKSIZE Sets the stack size for threads.
OMP_WAIT_POLICY Defines the behavior of waiting threads (ACTIVE or PASSIVE).
OMP_MAX_ACTIVE_LEVELS Sets the maximum number of nested active parallel regions.
OMP_THREAD_LIMIT Limits the total number of threads used by the program.

Example Usage (Linux/macOS Shell)

bash
export OMP_NUM_THREADS=8
export OMP_SCHEDULE="dynamic,2"

These variables can be set before running your OpenMP program to control its execution
behavior.

Question:=6(b) generalize on how data and functional parallelism are


handled in shared memory programming with OpenMP

1. Data Parallelism in OpenMP

Data parallelism involves performing the same operation on different parts of a data set
simultaneously.

How OpenMP handles it:

 Uses #pragma omp parallel for to divide loop iterations among threads.
 Each thread processes a chunk of the data independently.

Example:
c

#pragma omp parallel for


for (int i = 0; i < N; i++) {
A[i] = B[i] + C[i];
}

 Threads share memory, so no need to explicitly pass data.


 Synchronization is minimal if each thread works on separate data.

2. Functional Parallelism in OpenMP

Functional parallelism (also called task parallelism) involves executing different tasks or
functions in parallel.
How OpenMP handles it:

 Uses #pragma omp parallel sections to run independent code blocks


concurrently.
 Each section can perform a different function.

Example:
c

#pragma omp parallel sections


{
#pragma omp section
{
process_audio();
}
#pragma omp section
{
process_video();
}
}

 Useful when tasks are logically independent.


 Threads still share memory, so communication is easy.

Summary

Aspect Data Parallelism Functional Parallelism

Focus Splitting data across threads Splitting tasks/functions across threads

OpenMP Construct #pragma omp parallel for #pragma omp parallel sections

Use Case Array processing, matrix operations Multimedia pipelines, I/O + computation

Memory Model Shared memory simplifies data access Shared memory simplifies coordination

Question:=7(a)

discuss the current IA-32 architecture .also state the methods to avoid
pipeline stalls on IA-32

IA-32 Architecture Overview

IA-32 (Intel Architecture, 32-bit) is the 32-bit version of the x86 instruction set architecture,
introduced with the Intel 80386 processor. It became the foundation for many generations of
Intel and compatible CPUs.

Key Features:
 32-bit Registers: General-purpose registers like EAX, EBX, ECX, etc., support 32-
bit operations.
 Segmented Memory Model: Supports both segmented and flat memory models.
 Paging Support: Enables virtual memory through 4 KB pages.
 Instruction-Level Parallelism: Later IA-32 implementations (e.g., Pentium) support
superscalar execution with multiple pipelines.
 Floating Point Unit (FPU): Integrated FPU with its own registers and execution
units.
 Caches: Separate instruction and data caches (Harvard architecture style).
 Out-of-Order Execution: Instructions can be executed out of order and retired in
order using a Reorder Buffer (ROB).
 Branch Prediction: Helps reduce control hazards in pipelined execution.

For a detailed breakdown, see the IA-32 architecture on Wikipedia or Ques10’s diagram and
explanation.

Avoiding Pipeline Stalls in IA-32

Pipeline stalls occur when the CPU pipeline cannot proceed with the next instruction due to
hazards. Here are methods to minimize them:

1. Out-of-Order Execution

 Executes instructions as operands become available, not strictly in program order.


 Helps bypass data hazards and improves instruction throughput.

2. Instruction Reordering by Compiler

 Compilers can rearrange instructions to avoid dependencies and fill delay slots.

3. Register Renaming

 Eliminates false dependencies (Write After Write, Write After Read) by mapping
logical registers to physical ones.

4. Forwarding (Data Bypassing)

 Passes results directly from one pipeline stage to another without writing to registers
first.

5. Branch Prediction

 Predicts the outcome of conditional branches to avoid control hazards.


 Mispredictions can cause stalls, so accurate predictors are crucial.

6. Speculative Execution

 Executes instructions before knowing if they are needed, rolling back if the
speculation was incorrect.
7. Hardware Prefetching

 Loads data into cache before it’s needed to reduce memory access latency.

Question:=7(b) define deadlock. write the conditions under which a deadlock


situation may arise. also discuss the synchronization primitive in parallel
program challenges.

What is a Deadlock?

A deadlock is a situation in concurrent programming where a group of threads or processes


are each waiting for resources held by the others, and none can proceed. It results in a
complete standstill of execution.

Conditions for Deadlock (Coffman’s Conditions)

A deadlock can occur if all the following four conditions hold simultaneously:

Condition Description

Mutual
At least one resource must be held in a non-shareable mode.
Exclusion

A process holding at least one resource is waiting to acquire additional resources


Hold and Wait
held by others.

No Preemption Resources cannot be forcibly taken away; they must be released voluntarily.

Circular Wait A set of processes are waiting for each other in a circular chain.

🧵 Synchronization Primitives in Parallel Programming

Synchronization primitives are tools used to coordinate access to shared resources and avoid
issues like race conditions and deadlocks.

Common Synchronization Primitives:

Primitive Description Use Case

Mutex (Mutual Ensures only one thread accesses a critical


Protecting shared data.
Exclusion) section at a time.

A counter-based lock that allows a limited


Semaphore Managing resource pools.
number of threads to access a resource.
Primitive Description Use Case

A lock where threads wait in a loop ("spin") until Low-latency locking on multi-
Spinlock
the lock becomes available. core systems.

Allows threads to wait for certain conditions to Thread coordination (e.g.,


Condition Variable
be met. producer-consumer).

Synchronizes a group of threads at a specific Parallel algorithms with


Barrier
point. phases.

Optimizing read-heavy
Read-Write Lock Allows multiple readers or one writer.
workloads.

Challenges in Using Synchronization Primitives

 Deadlocks: Poor lock ordering or resource management can cause circular waits.
 Livelocks: Threads keep reacting to each other but make no progress.
 Starvation: Some threads may never acquire the lock if others dominate access.
 Performance Overhead: Excessive locking can reduce parallel efficiency.

You might also like