Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views91 pages

Advancedcomputer Architecture

The document discusses advanced computer architecture, focusing on parallelism, current computing trends, and the evolution of multiprocessor and multicomputer systems. It highlights various types of parallelism, models, benefits, challenges, and applications, along with the state of computing and future outlooks. Additionally, it covers SIMD and multivector computers, PRAM and VLSI models, and architectural development tracks that shape modern computing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views91 pages

Advancedcomputer Architecture

The document discusses advanced computer architecture, focusing on parallelism, current computing trends, and the evolution of multiprocessor and multicomputer systems. It highlights various types of parallelism, models, benefits, challenges, and applications, along with the state of computing and future outlooks. Additionally, it covers SIMD and multivector computers, PRAM and VLSI models, and architectural development tracks that shape modern computing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

ADVANCEDCOMPUTER ARCHITECTURE

UNIT 1:
➢ Theory of Parallelism:
1. Introduction

The theory of parallelism in advanced computer architecture explains how multiple


tasks or operations can be executed simultaneously to improve performance, reduce
execution time, and maximize hardware utilization. Instead of executing instructions
one after another (sequential processing), parallelism allows concurrent execution at
different levels of a computing system.

2. Types of Parallelism

Parallelism can be broadly classified into two dimensions:

(a) Instruction-Level Parallelism (ILP)

• Refers to executing multiple instructions at the same time within a single processor.
• Achieved through pipelining, superscalar execution, out-of-order execution, and
branch prediction.
• Example: Modern CPUs can fetch, decode, and execute multiple instructions per
clock cycle.

(b) Process / Thread-Level Parallelism (TLP)

• Involves executing multiple threads or processes simultaneously.


• Supported by multicore processors, multiprocessors, and distributed systems.
• Example: Running multiple applications on different cores of a CPU.

(c) Data-Level Parallelism (DLP)

• Exploits the fact that the same operation can be applied to multiple data elements
simultaneously.
• Implemented using vector processors, SIMD (Single Instruction Multiple Data),
and GPUs.
• Example: Image processing where the same filter is applied to millions of pixels at
once.
3. Models of Parallelism

Several models describe how parallelism is achieved:

• Flynn’s Taxonomy (SISD, SIMD, MISD, MIMD) → Classifies architectures by


instruction and data streams.
• Pipeline Model → Breaks instruction execution into stages for faster throughput.
• Multiprocessor Model → Multiple CPUs working together with shared or distributed
memory.
• Multithreading Model → Single CPU core executes multiple threads concurrently.

4. Benefits of Parallelism

• Increased Performance – Executes more instructions per unit time.


• Reduced Latency – Tasks finish faster.
• Scalability – Systems can be expanded with more cores or processors.
• Efficiency – Better utilization of hardware resources.

5. Challenges in Parallelism

• Synchronization & Communication Overhead (between processors/threads).


• Load Balancing (ensuring all processors get equal work).
• Amdahl’s Law (limits maximum speedup due to sequential portions of a program).
• Memory Bottlenecks (cache coherence, bandwidth issues).

6. Applications

• High-performance computing (HPC)


• Artificial Intelligence (AI) and Machine Learning
• Scientific simulations (weather forecasting, molecular modeling)
• Big Data and Cloud Computing
• Graphics rendering and gaming

➢ THE STATE OF COMPUTING:


1. Introduction

The state of computing in advanced computer architecture refers to the current trends,
developments, and challenges in designing high-performance systems. As the demand for
faster processing, massive data handling, artificial intelligence, and cloud computing grows,
architectures must evolve beyond traditional single-core designs.
2. Evolution of Computing Architectures

• Past (Single-core Era): Early computers relied on sequential execution, where one
instruction was processed at a time.
• Transition (Moore’s Law & Pipelining): Increased transistor counts enabled faster
processors, pipelining, and superscalar execution.
• Present (Multicore & Parallelism): Due to power and heat limitations, focus shifted
to multicore processors, GPUs, and heterogeneous architectures.
• Future (Quantum & Neuromorphic): Emerging technologies are pushing
computing toward quantum processors, AI-driven accelerators, and brain-
inspired chips.

3. Current State of Computing

(a) Multicore and Manycore Systems

• Modern CPUs contain multiple cores to allow parallel execution.


• Manycore architectures (like GPUs with thousands of cores) dominate AI, graphics,
and big data.

(b) Heterogeneous Computing

• Combination of CPU + GPU + FPGA + AI accelerators.


• Each unit handles tasks it is best suited for (e.g., GPU for matrix operations, CPU for
sequential logic).

(c) High-Performance Computing (HPC)

• Supercomputers use millions of cores for scientific simulations, climate modeling,


and space research.
• Example: Exascale computing (capable of 10^18 operations per second) is the new
milestone.

(d) Cloud and Edge Computing

• Distributed architectures enable scalable on-demand computing.


• Edge computing brings computation closer to users for real-time processing in IoT
and 5G.

(e) AI and Machine Learning Integration

• Specialized hardware (TPUs, NPUs) accelerates deep learning.


• Parallelism and data-driven architectures power applications like natural language
processing, computer vision, and autonomous systems.
4. Challenges in Current State

• Power and Heat Dissipation: Limits frequency scaling.


• Memory Bottlenecks: Need faster memory hierarchies and cache coherence
mechanisms.
• Programming Complexity: Parallel and distributed systems are harder to program
and debug.
• Security Concerns: Advanced systems face new vulnerabilities (side-channel
attacks, cloud data breaches).

5. Future Outlook

• Quantum Computing for solving intractable problems.


• Neuromorphic Computing mimicking brain networks.
• Energy-efficient architectures using low-power designs.
• Integration of AI into architecture design (self-optimizing chips).

➢MULTIPROCESSORS AND MULTICOMPUTERS:

. Introduction

As computing needs grew, single-processor systems became insufficient to handle large-scale,


complex, and time-critical applications. Advanced computer architecture thus adopted
multiprocessor and multicomputer systems to achieve high performance, scalability, and
reliability through parallel processing.

2. Multiprocessors

• A multiprocessor system consists of two or more processors sharing a common


memory and input/output system.
• Processors work cooperatively to execute multiple tasks simultaneously.
• Typically classified as tightly coupled systems since all processors share a single
address space.

Characteristics

• Shared memory model.


• Symmetric or asymmetric multiprocessing.
• Cache coherence protocols to maintain consistency.
• Easier to program compared to distributed systems.
Advantages

• Faster execution by parallelism.


• Shared resources → efficient communication.
• High throughput for scientific and commercial workloads.

Disadvantages

• Limited scalability (due to shared memory bottleneck).


• Complexity of synchronization and coherence.

3. Multicomputers

• A multicomputer system consists of multiple independent computers (nodes)


interconnected through a high-speed communication network.
• Each node has its own private memory and operating system.
• They are loosely coupled systems, relying on message passing for communication.

Characteristics

• Distributed memory model.


• Nodes communicate via interconnection networks (mesh, hypercube, torus, etc.).
• High scalability (can add more nodes easily).

Advantages

• Scalable to thousands of processors.


• Fault-tolerant (failure of one node doesn’t crash the entire system).
• Suitable for massively parallel workloads (AI, big data, simulations).

Disadvantages

• Communication overhead due to message passing.


• Programming complexity (distributed computing model).

4. Key Differences: Multiprocessors vs Multicomputers

Feature Multiprocessors Multicomputers


Memory Shared memory Distributed memory
Coupling Tightly coupled Loosely coupled
Communication Shared memory access Message passing
Scalability Limited (due to memory bottleneck) Highly scalable
Programming Easier (single memory space) Complex (distributed)
Examples Multi-core CPUs, IBM Power servers Cluster systems, supercomputers
5. Applications

• Multiprocessors: Transaction processing, real-time systems, databases, operating


system kernels.
• Multicomputers: Scientific simulations, weather modeling, AI/ML training, cloud &
grid computing.

➢MULTIVECTOR AND SIMD COMPUTERS:

1. Introduction

As applications in scientific computing, engineering, AI, and multimedia demand huge


amounts of data processing, architectures were developed to exploit data-level parallelism
(DLP). Two such approaches are multivector computers and SIMD (Single Instruction,
Multiple Data) computers, both designed to perform the same operation on multiple data
elements simultaneously.

2. SIMD (Single Instruction, Multiple Data) Computers

• Definition: A SIMD computer executes a single instruction across multiple data


streams at the same time.
• Architecture: One control unit directs multiple processing elements (PEs). Each PE
works on a different piece of data.
• Working Principle: The instruction is broadcast to all processors, and they operate
on their local data simultaneously.

Characteristics

• Exploits fine-grained data parallelism.


• Highly efficient for array/vector operations.
• Simple control logic (single instruction stream).

Examples

• GPUs (Graphics Processing Units).


• Classic SIMD machines like ILLIAC IV, Connection Machine.
• Intel SSE, AVX instructions in CPUs.

Applications
• Image and video processing.
• Matrix operations in scientific computing.
• Neural network training and inference.

3. Multivector Computers

• Definition: A multivector computer is designed to operate on entire vectors (arrays


of data) as single operands, rather than individual scalar values.
• Architecture: Contains vector registers and specialized pipelines for performing
vectorized arithmetic and logical operations.
• Working Principle: A vector instruction specifies operations on long data arrays,
reducing instruction fetch and decode overhead.

Characteristics

• Uses vector registers (can hold dozens or hundreds of elements).


• Highly optimized for long, repetitive computations.
• Supports pipelined execution of vector operations.

Examples

• Cray-1 supercomputer (classic vector processor).


• Modern CPUs with vector extensions (Intel AVX, ARM NEON).

Applications

• Scientific simulations (weather, physics, molecular modeling).


• Numerical linear algebra (matrix multiplication, FFT).
• Big data analytics.

4. Comparison: Multivector vs SIMD

Feature Multivector Computers SIMD Computers


Operates on multiple data
Data Handling Operates on vectors as operands
elements in parallel
Each PE works on its own data
Memory Model Vector registers store long arrays
memory
Instruction One vector instruction → multiple One instruction → multiple
Control element operations processors execute
More flexible, supports complex Simpler, suited for uniform
Flexibility
vector operations operations
GPUs, Intel SSE/AVX, ILLIAC
Examples Cray-1, NEC SX series
IV
5. Conclusion

• SIMD computers focus on executing the same instruction across multiple


independent data streams, making them ideal for tasks like graphics and AI.
• Multivector computers extend this concept by handling entire arrays of data with
vector instructions, making them highly efficient in scientific and engineering
applications.
Both exploit data-level parallelism and remain crucial in modern CPUs, GPUs, and
supercomputers

➢ PRAM AND VLSI MODELS:

1. Introduction
In advanced computer architecture, different models are used to analyze, design, and
evaluate parallel computing systems. Two important theoretical and practical models are:

1. PRAM (Parallel Random Access Machine) Model – a theoretical model for


parallel algorithms.
2. VLSI (Very Large Scale Integration) Model – a hardware model for designing
efficient parallel architectures.

2. PRAM (Parallel Random Access Machine) Model


• Definition: PRAM is a theoretical model of parallel computation where multiple
processors operate synchronously and share a single global memory with unit-time
access.
• It extends the classical RAM model into a parallel setting.

Characteristics

• Consists of P processors, each with local registers.


• All processors share a global memory.
• Works in lock-step synchronous mode (one instruction per clock cycle across
processors).
• Ignores practical constraints like communication delays.
Types of PRAM (based on memory access rules)

1. EREW (Exclusive Read Exclusive Write): No two processors can read/write the
same memory cell at the same time.
2. CREW (Concurrent Read Exclusive Write): Multiple processors can read the same
memory cell, but only one can write at a time.
3. ERCW (Exclusive Read Concurrent Write): Processors read exclusively but can
write concurrently.
4. CRCW (Concurrent Read Concurrent Write): Multiple processors can read and
write simultaneously (with rules like priority write, common write, etc.).

Advantages

• Simple and powerful model for designing parallel algorithms.


• Provides a basis for complexity analysis (time × number of processors).

Limitations

• Not realistic (assumes unit-time memory access and no communication delay).


• Hardware implementation is difficult due to memory bottlenecks.

3. VLSI (Very Large Scale Integration) Model


• Definition: The VLSI model is a hardware-based computational model used to
measure the efficiency of algorithms in terms of chip area and time when
implemented in silicon.
• Proposed by Thompson’s VLSI model (1979).

Characteristics

• Measures complexity using:


o Area (A): The physical silicon area used on a chip.
o Time (T): Execution time of computation.
o Area-Time Product (AT): Used as the efficiency metric.
• Focuses on layout efficiency and parallelism trade-offs.
• Assumes data is transmitted through wires on a chip, hence communication cost is
significant.

Applications

• Chip design for processors, GPUs, and AI accelerators.


• Efficient layout for circuits implementing sorting, FFT, matrix multiplication.
• Optimization of interconnection networks (mesh, torus, hypercube).

Advantages

• Provides a practical metric for hardware design.


• Balances parallelism and silicon cost.

Limitations

• Purely hardware-oriented (does not capture software-level issues).


• Difficult to apply directly to large distributed systems.

4. Comparison: PRAM vs VLSI Models


Feature PRAM Model VLSI Model
Theoretical model of parallel
Nature Hardware (chip) design model
computation
Memory Assumes global shared memory Distributed across chip components
Cost
Time × Processors Area × Time (AT) product
Metric
Realistic (wire length, chip area
Realism Idealized (no communication delay)
considered)
Usage Algorithm design & analysis Hardware/circuit layout optimization

5. Conclusion
• PRAM model provides a theoretical foundation for designing and analyzing parallel
algorithms, though it is abstract and not directly realizable.
• VLSI model provides a practical framework for implementing these algorithms on
real hardware, focusing on chip area and time efficiency.
Together, they bridge the gap between parallel algorithm design and hardware
implementation in advanced computer architecture.

➢ ARCHITECTURAL DEVELOPMENT
TRACKS:

1. Introduction
The progress of computer architecture has followed certain development tracks (or
directions), shaped by the need for higher performance, scalability, energy efficiency, and
specialized applications. These tracks represent how architectures evolved from simple
sequential machines to highly parallel and heterogeneous systems.

2. Major Architectural Development Tracks


(a) Instruction-Level Parallelism (ILP) Track

• Focuses on executing multiple instructions simultaneously within a single


processor.
• Techniques: Pipelining, superscalar execution, out-of-order execution, branch
prediction, VLIW (Very Long Instruction Word).
• Example: Modern Intel and AMD CPUs use ILP heavily.
• Limitation: Diminishing returns due to complexity, power, and heat issues.

(b) Thread-Level and Process-Level Parallelism (TLP/PLP) Track

• Uses multiple processors/cores to execute independent threads or processes


concurrently.
• Techniques: Multicore processors, Simultaneous Multithreading (SMT),
Multiprocessors.
• Example: Quad-core, octa-core CPUs in desktops and servers.
• Advantage: Scalable performance improvement.
• Challenge: Programming complexity (synchronization, concurrency control).

(c) Data-Level Parallelism (DLP) Track

• Exploits parallelism by applying the same operation to multiple data elements


simultaneously.
• Techniques: Vector processors, SIMD (Single Instruction Multiple Data), GPUs,
TPUs.
• Example: GPUs for AI and image processing.
• Key Role: Accelerates scientific computing, multimedia, and AI/ML workloads.

(d) Memory and Storage Development Track

• Focus on overcoming the memory bottleneck between CPU and memory.


• Developments: Cache hierarchy, virtual memory, NUMA (Non-Uniform Memory
Access), high-bandwidth memory (HBM).
• Example: DDR → GDDR → HBM memory systems in HPC and GPUs.
(e) Interconnection and Communication Track

• Deals with efficient processor-to-processor and processor-to-memory


communication.
• Developments: Bus systems, crossbar switches, mesh, torus, hypercube, Network-
on-Chip (NoC).
• Example: Supercomputers with high-speed interconnects like InfiniBand.

(f) Specialized / Application-Specific Track

• Development of domain-specific architectures optimized for particular tasks.


• Examples: AI accelerators (Google TPU, NVIDIA Tensor Cores), DSPs (Digital
Signal Processors), Neuromorphic chips, Quantum processors.
• Advantage: Energy-efficient and high performance for targeted workloads.

3. Summary Table
Track Focus Examples
Instruction-Level Parallelism Speed within single processor Superscalar, VLIW CPUs
Thread/Process-Level Parallelism across Multicore CPUs,
Parallelism threads/cores Multiprocessors
Same operation on large data SIMD, GPUs, Vector
Data-Level Parallelism
sets Processors
Reducing latency &
Memory/Storage Cache, HBM, NUMA
bottlenecks
Interconnection Networks Efficient communication Mesh, Torus, NoC
Specialized Architectures Task-specific acceleration TPUs, DSPs, Quantum Chips

4. Conclusion
Architectural development in advanced computing has branched into multiple tracks, each
addressing performance, efficiency, and scalability from a different perspective. While ILP
and multicore processors remain general-purpose, DLP and specialized accelerators dominate
domains like AI, scientific computing, and big data. Future development tracks point
toward quantum computing, neuromorphic systems, and energy-efficient architectures.
➢ PROGRAM AND NETWORK
PROPERTIES:

1. Introduction
In parallel and distributed computing, the performance and efficiency of a system depend
on two key aspects:

1. Program Properties – the characteristics of the program that determine how well it
can be parallelized.
2. Network Properties – the characteristics of the interconnection network that affect
communication among processors.

Understanding both is crucial in designing efficient multiprocessor and multicomputer


systems.

2. Program Properties
These describe how a program behaves in terms of parallelism, communication, and
execution requirements.

(a) Degree of Parallelism

• Measures how much of the program can be executed in parallel.


• Determined using Amdahl’s Law (sequential bottleneck) and Gustafson’s Law
(scalability with more processors).

(b) Grain Size (Granularity)

• Fine-grain programs → Small tasks, frequent communication (e.g., SIMD).


• Coarse-grain programs → Large tasks, less communication overhead (e.g.,
multiprocessors).

(c) Communication-to-Computation Ratio

• Ratio of time spent in communication vs computation.


• High ratio → communication dominates, performance degrades.
• Low ratio → program is computation-intensive, more scalable.

(d) Regularity of Parallelism


• Regular programs: predictable execution patterns (matrix operations, image
processing).
• Irregular programs: dynamic, unpredictable (graph algorithms, AI search
problems).

(e) Synchronization Requirements

• Programs may need frequent synchronization points (barriers, locks), which affect
efficiency.

3. Network Properties
These define how the interconnection network impacts the execution of parallel programs.

(a) Topology

• Defines how processors and memory modules are connected.


• Examples: Bus, Crossbar, Mesh, Torus, Hypercube, Tree, Network-on-Chip
(NoC).

(b) Diameter

• Maximum number of hops (links) required between any two nodes.


• Lower diameter → faster communication.

(c) Bisection Bandwidth

• The minimum bandwidth between two halves of the network.


• Higher bandwidth → better parallel performance.

(d) Latency and Throughput

• Latency: Time taken for a message to travel from source to destination.


• Throughput: Number of messages the network can handle simultaneously.

(e) Fault Tolerance

• Ability of the network to continue functioning despite failures.


• Important in large-scale HPC and cloud systems.

(f) Scalability

• Ability to add more processors/nodes without major performance degradation.


4. Relationship between Program and Network Properties
• A fine-grain program needs a low-latency, high-bandwidth network.
• A coarse-grain program can tolerate slower networks.
• Scalability of programs depends on both degree of parallelism and network
scalability.
• Poor network design may bottleneck highly parallel programs, reducing overall
system efficiency.

5. Conclusion
• Program properties (parallelism, granularity, communication ratio) define how
much parallel speedup can be achieved.
• Network properties (topology, latency, bandwidth, scalability) determine how
effectively processors can communicate.
• The balance of these two factors is essential for high-performance computing
(HPC), AI, big data, and distributed systems.

➢ CONDITIONS OF PARALLELISM:

Introduction
Parallelism is the foundation of modern high-performance computing. However, not all tasks
can be executed in parallel. To exploit Instruction-Level, Data-Level, Thread-Level, and
Process-Level Parallelism, certain conditions must be satisfied. These conditions are
generally derived from data dependencies, resource constraints, and control flow
requirements.

2. Major Conditions of Parallelism


(a) Data Dependency Condition

Parallel execution is possible only if tasks are not dependent on the results of each other.
Types of data dependencies:

1. True Dependency (Read After Write – RAW):


o Instruction B needs the result of Instruction A.
o Must wait → prevents parallelism.
2. Anti-Dependency (Write After Read – WAR):
o Instruction B writes to a variable after Instruction A reads it.
o May cause hazards in parallel pipelines.
3. Output Dependency (Write After Write – WAW):
o Two instructions write to the same variable.
o Execution order matters.

✔ Parallelism exists only if data dependencies are removed or minimized (via renaming,
loop unrolling, speculation, etc.).

(b) Resource Dependency Condition

• Parallelism requires sufficient hardware resources (ALUs, registers, memory


modules, I/O).
• Example: Two instructions needing the same multiplier unit cannot execute
simultaneously.
• Solution: Multiple functional units and superscalar architectures.

(c) Control Dependency Condition

• Occurs when program execution depends on branching/decision-making.


• Example: if-else conditions may prevent parallel instruction scheduling.
• Solution: Branch prediction, speculative execution, predication.

(d) Bernstein’s Conditions

Formally, parallelism is possible between two processes P1 and P2 if:

• Input sets (I) and Output sets (O) of both processes do not conflict.
• The conditions are:
1. I1∩O2=∅I_1 ∩ O_2 = ∅I1∩O2=∅ (P1 does not read values written by P2)
2. I2∩O1=∅I_2 ∩ O_1 = ∅I2∩O1=∅ (P2 does not read values written by P1)
3. O1∩O2=∅O_1 ∩ O_2 = ∅O1∩O2=∅ (P1 and P2 do not write to the same
variable)

If these are satisfied, P1 and P2 can execute in parallel.

(e) Granularity Condition

• Refers to the size of tasks relative to communication overhead.


• Fine-grain tasks → High communication, less efficient.
• Coarse-grain tasks → Better parallel efficiency.

3. Practical Considerations
• Amdahl’s Law: The maximum speedup is limited by the sequential portion of a
program.
• Load Balancing: Processors must be evenly loaded to prevent idle time.
• Synchronization: Proper coordination among processors is needed to maintain
correctness.

4. Conclusion
The conditions of parallelism are determined mainly by data, resource, and control
dependencies, along with Bernstein’s formal model. For efficient parallel computing,
dependencies must be minimized, resources must be adequate, and workloads balanced.
These conditions guide the design of parallel algorithms, compilers, and architectures in
modern computing systems

➢ PROGRAM PARTITIONING AND


SCHEDULING:

1. Introduction
In parallel and distributed computing, a program must be broken into smaller tasks and
then assigned to processors for execution. This process involves:

1. Program Partitioning – dividing a program into parallel tasks.


2. Program Scheduling – mapping those tasks onto processors to ensure efficient
execution.

Both steps are critical for achieving high performance, load balancing, and minimal
communication overhead.

2. Program Partitioning
Partitioning means breaking a large program into smaller tasks, modules, or processes that
can be executed concurrently.

Objectives

• Maximize parallel execution.


• Minimize inter-task communication.
• Ensure balanced workload distribution.

Methods of Partitioning

1. Functional Partitioning
o Divide program based on functionality.
o Example: One task for input, another for computation, another for output.
2. Data Partitioning
o Divide data among processors; each processor performs the same operation on
different data.
o Example: Matrix multiplication (each processor handles a block of the
matrix).
3. Recursive Partitioning
o Break tasks into smaller subtasks repeatedly until tasks fit processor limits.
o Example: Divide-and-conquer algorithms (QuickSort, FFT).
4. Domain Partitioning
o Divide problem space into sub-domains, each solved in parallel.
o Example: Weather simulation grids, finite element analysis.

Granularity in Partitioning

• Fine-grained: Many small tasks → high parallelism but high communication


overhead.
• Coarse-grained: Few large tasks → less overhead but limited parallelism.

3. Program Scheduling
After partitioning, tasks must be scheduled on processors. Scheduling decides execution
order, mapping, and timing.

Objectives

• Minimize overall execution time.


• Balance the workload across processors.
• Reduce communication and synchronization overhead.

Types of Scheduling

1. Static Scheduling
o Tasks are assigned before execution.
oSimple, less overhead.
oExample: Round-robin, block partitioning.
2. Dynamic Scheduling
o Tasks assigned at runtime based on system load.
o More flexible, adapts to workload changes.
o Example: Work stealing, load balancing schedulers.

Common Scheduling Techniques

• List Scheduling: Tasks are ordered by priority and assigned to processors.


• Loop Scheduling: Iterations of a loop distributed among processors.
• Task Graph Scheduling: Tasks represented as nodes in a graph; dependencies guide
scheduling.

4. Relationship Between Partitioning and Scheduling


• Partitioning decides what tasks exist.
• Scheduling decides when and where they execute.
• Good partitioning without good scheduling may still cause load imbalance or idle
processors.

5. Conclusion
• Program Partitioning ensures a program is broken into manageable, parallelizable
tasks.
• Program Scheduling ensures these tasks are efficiently mapped onto processors.
• Together, they determine the performance, scalability, and efficiency of parallel
systems in advanced computer architecture.

Applications include HPC, AI/ML training, real-time systems, cloud computing, and
scientific simulations.

➢ PROGRAM FLOW MECHANISMS:

1. Introduction
In computer architecture, program flow mechanisms define how instructions are executed
and controlled in a processor. They determine the order of instruction execution, support
parallelism, and help maximize CPU utilization.

Modern architectures use advanced flow mechanisms to overcome control hazards, data
hazards, and branch penalties while exploiting parallel execution.

2. Major Program Flow Mechanisms


(a) Control Flow Mechanism

• Traditional mechanism where instructions are executed sequentially as determined by


the program counter (PC).
• Flow changes with branching, loops, function calls.
• Limitation: Causes stalls in pipelines due to branch misprediction.
• Solutions: Branch prediction, speculative execution, predication.

(b) Data Flow Mechanism

• In dataflow architecture, instructions execute only when their input operands are
available (instead of strict program order).
• No central program counter → execution driven by data availability.
• Advantage: Naturally exposes parallelism.
• Example: Used in scientific computing, functional programming models.

(c) Demand-Driven / Reduction Flow

• Also called lazy evaluation.


• Instructions are executed only when their results are needed.
• Reduces unnecessary computation.
• Example: Used in some functional languages (Haskell) and AI inference engines.

(d) Systolic Flow

• Instructions and data move through a network of processors in a rhythmic fashion


(like blood flow in the body).
• Each processor performs a simple operation and passes data to the next.
• Highly efficient for matrix operations, DSP, and AI accelerators.
• Example: Google TPU uses systolic arrays for deep learning.
(e) Speculative Flow

• Execution of instructions ahead of actual program flow based on predictions (e.g.,


branch prediction).
• If prediction correct → saves time.
• If wrong → rollback is required.
• Widely used in superscalar and out-of-order processors.

(f) Multithreaded Flow

• Program execution flow is divided into multiple threads, which may run
concurrently.
• Helps hide latency (memory stalls, I/O delays).
• Implemented in SMT (Simultaneous Multithreading), GPUs, and multicore
CPUs.

3. Comparison of Flow Mechanisms


Flow Mechanism Execution Driven By Advantage Example Use
Program Counter Simple, sequential General-purpose
Control Flow
(PC) execution CPUs
Data Flow Data availability Exposes high parallelism Scientific computing
Avoids unnecessary Functional
Demand/Reduction Need for result
computation programming
Data movement in Efficient for matrix/AI
Systolic Flow TPUs, DSPs
array ops
Prediction-based Modern superscalar
Speculative Flow Hides control hazards
execution CPUs
Multithreaded Multiple concurrent GPUs, SMT
Better resource utilization
Flow threads processors

4. Conclusion
Program flow mechanisms are essential for efficient instruction execution in advanced
computer architectures.

• Control flow dominates traditional CPUs.


• Data flow and demand-driven models enable high parallelism.
• Systolic, speculative, and multithreaded flows power modern HPC, AI, and GPU-
based systems.
By combining these mechanisms, modern architectures achieve higher throughput, lower
latency, and better scalability.

➢ SYSTEM INTERCONNECT
ARCHITECTURES:

1. Introduction
In multiprocessor and multicomputer systems, multiple processors, memory modules, and
I/O devices must communicate efficiently.
The design of system interconnect architecture defines how these components are linked
and how data is transferred among them.

It plays a vital role in performance, scalability, latency, and bandwidth of parallel


computing systems.

2. Requirements of Interconnect Architecture


• High Bandwidth – support simultaneous data transfers.
• Low Latency – minimize communication delays.
• Scalability – should support large number of processors.
• Fault Tolerance – handle failures gracefully.
• Cost-effectiveness – optimized complexity vs. performance.

3. Classification of Interconnect Architectures


(A) Bus-Based Interconnect

• A shared communication path where all processors and memory modules are
connected.
• Only one transfer at a time.
• Advantages: Simple, low cost.
• Disadvantages: Limited scalability, bus contention.
• Example: Early multiprocessors, small SMP systems.
(B) Crossbar Switch

• Each processor connected to each memory module via dedicated switches.


• Allows multiple simultaneous transfers.
• Advantages: High performance, low contention.
• Disadvantages: Expensive, complex (scales poorly with large N).
• Example: High-end servers, vector supercomputers.

(C) Multistage Interconnection Networks (MINs)

• Data passes through multiple switching stages between processors and memory.
• Popular topologies:
o Omega Network
o Butterfly Network
o Clos Network
• Advantages: Lower cost than crossbar, supports parallel transfers.
• Disadvantages: Blocking may occur (two transfers needing same path).

(D) Point-to-Point Interconnects

• Each processor/node connected directly to a subset of other processors using network


topologies.
• Examples:
o Ring – simple, low-cost, but high latency for distant nodes.
o Mesh/2D-Torus – scalable, widely used in supercomputers.
o Hypercube – each node connected to log(N) neighbors, low diameter.
• Advantages: Highly scalable, good fault tolerance.
• Disadvantages: More complex routing, higher latency compared to crossbar.

(E) Hierarchical Interconnects

• Combine multiple interconnect schemes for better scalability.


• Example: Cluster of SMPs – each cluster uses a bus or crossbar internally, and
clusters are interconnected by mesh or tree.
• Used in large-scale HPC clusters and cloud data centers.

4. Examples in Modern Systems


• NUMA Systems (Non-Uniform Memory Access): Use point-to-point links like
AMD’s HyperTransport, Intel’s QuickPath Interconnect (QPI).
• GPUs & AI Accelerators: Use NVLink, Infinity Fabric for high bandwidth.
• Supercomputers: Use 3D Torus (IBM Blue Gene), Dragonfly (Cray XC series).

5. Comparison Table
Architecture Cost Performance Scalability Example Use
Bus Low Low Poor Small SMPs
Crossbar High High Poor (expensive) Supercomputers (small scale)
Multistage Networks Medium Medium-High Moderate Parallel computers
Mesh / Torus Medium High High HPC, GPUs
Hypercube Medium High High Research systems
Hierarchical Variable High Very High Cloud clusters, datacenters

6. Conclusion
System interconnect architectures are the backbone of multiprocessor and multicomputer
performance.

• Small systems → Bus-based.


• Medium systems → Crossbar or MIN.
• Large HPC/AI clusters → Mesh, Torus, Hypercube, Dragonfly.

Efficient interconnect design ensures low communication overhead, scalability, and


maximum parallelism utilization in advanced computer architectures
UNIT -2

➢ PRINCIPLES OF SCALABLE
PERFORMANCE:

1. Introduction
In advanced computer systems, scalability means that performance should increase
proportionally when resources (processors, memory, interconnects) are increased.

The principles of scalable performance define the rules and design techniques that ensure
parallel systems can deliver higher throughput without bottlenecks.

2. Key Principles
(A) Workload Scalability

• A system is scalable if it can handle increasing problem sizes efficiently.


• The performance should grow with the number of processors (N).
• Requires parallelizable workloads (not dominated by sequential parts).
• Related to Amdahl’s Law and Gustafson’s Law.

(B) Balanced System Design

• Balance between:
o Computation power (CPU speed)
o Memory capacity & bandwidth
o I/O throughput
o Interconnect bandwidth
• Example: Adding more processors without increasing memory bandwidth leads to
bottlenecks.
(C) Efficient Resource Utilization

• All processors, memory modules, and network links should be kept busy.
• Avoid idle processors due to poor scheduling, communication delays, or load
imbalance.
• Techniques: Dynamic scheduling, load balancing, multithreading.

(D) Minimize Communication Overhead

• Performance should not be limited by communication costs between processors.


• Use low-latency, high-bandwidth interconnects.
• Reduce synchronization and message-passing delays.
• Example: Mesh/Torus interconnects in supercomputers.

(E) Scalable Synchronization

• Locks, barriers, and message passing must scale with processor count.
• Avoid centralized control (single lock = bottleneck).
• Use distributed synchronization, lock-free algorithms, atomic operations.

(F) Locality of Reference

• Programs should maximize data locality (access nearby memory more often).
• Reduce remote memory accesses in distributed systems.
• Caching, NUMA-aware memory allocation, and data partitioning improve scalability.

(G) Fault Tolerance and Reliability

• As systems scale, probability of failures increases.


• Scalable performance requires checkpointing, redundancy, error recovery to
maintain reliability.

3. Performance Models Supporting Scalability


• Amdahl’s Law → Limits speedup due to sequential fraction.
• Gustafson’s Law → Suggests scaling problem size enables higher speedup.
• Isoefficiency Function → Defines how problem size must grow to maintain
efficiency as processors increase.
4. Design Techniques for Scalability
• Parallel algorithms with minimal communication.
• Hierarchical system interconnects (mesh, torus, dragonfly).
• Dynamic load balancing.
• NUMA-aware memory design.
• Cache coherence protocols optimized for many cores.

5. Conclusion
The principles of scalable performance ensure that as processors and resources increase,
the system continues to deliver proportional improvements in throughput.

A scalable architecture balances computation, memory, communication, and


synchronization while minimizing bottlenecks.

Thus, scalability is the foundation of modern HPC systems, cloud datacenters, and AI
accelerators.

➢ Performance Metrics and Measures

1. Introduction

Performance in computer architecture refers to how fast, efficient, and scalable a system is
in executing programs.
To evaluate and compare systems, we use metrics (quantitative values) and measures
(methods of evaluation).

2. Key Performance Metrics

(A) Execution Time (Response Time / Latency)

• Total time taken to complete a program.


• Formula:
Execution Time=Number of Instructions×CPIClock RateExecution\ Time = \frac{Number\
of\ Instructions \times CPI}{Clock\
Rate}Execution Time=Clock RateNumber of Instructions×CPI

where CPI = Cycles per Instruction.

• Lower execution time → higher performance.

(B) Throughput (Bandwidth)

• Number of tasks completed per unit time.


• Important in servers, parallel systems, and multiprogramming.
• Example: A CPU completing 4 billion instructions per second has 4 GIPS
throughput.

(C) Speedup (S)

• Ratio of performance improvement compared to a baseline system.


• Formula:

Speedup=Execution TimeoldExecution TimenewSpeedup = \frac{Execution\


Time_{old}}{Execution\ Time_{new}}Speedup=Execution TimenewExecution Timeold

• Used in parallel computing to measure benefits of multiple processors.

(D) Efficiency (E)

• Measures how well multiple processors are utilized.


• Formula:

Efficiency=SpeedupNumber of ProcessorsEfficiency = \frac{Speedup}{Number\ of\


Processors}Efficiency=Number of ProcessorsSpeedup

• High efficiency → less overhead, better scalability.

(E) Utilization

• Percentage of time resources (CPU, memory, I/O) are active vs idle.


• Example: If CPU is busy 80% of the time → utilization = 0.8.
(F) CPI (Cycles per Instruction)

• Average number of clock cycles needed per instruction.


• Formula:

CPI=Total Clock CyclesInstruction CountCPI = \frac{Total\ Clock\ Cycles}{Instruction\


Count}CPI=Instruction CountTotal Clock Cycles

• Lower CPI → better performance.

(G) MIPS (Million Instructions Per Second)

• Execution rate of instructions.


• Formula:

MIPS=Instruction CountExecution Time×106MIPS = \frac{Instruction\ Count}{Execution\


Time \times 10^6}MIPS=Execution Time×106Instruction Count

• Example: CPU executes 200M instructions in 1 sec → 200 MIPS.


• Limitation: Doesn’t account for instruction complexity.

(H) FLOPS (Floating Point Operations Per Second)

• Performance measure in scientific & AI applications.


• Example: 1 TFLOPS = 10¹² floating point operations per second.

(I) Reliability and Availability

• Reliability: Probability system works correctly without failure.


• Availability: Percentage of time system is operational.
• Important in datacenters, HPC, and mission-critical systems.

3. Parallel Performance Metrics

• Amdahl’s Law (Speedup Limit):

Speedup(N)=1S+1−SNSpeedup(N) = \frac{1}{S + \frac{1-S}{N}}Speedup(N)=S+N1−S1

where S = sequential fraction.

• Gustafson’s Law (Scalability):


Speedup(N)=N−S(N−1)Speedup(N) = N - S(N-1)Speedup(N)=N−S(N−1)

• Isoefficiency Metric: Defines how problem size must increase with processors to
keep efficiency constant.

4. Performance Measurement Techniques

• Benchmarking: Running standard programs (SPEC, LINPACK, TPC).


• Simulation & Profiling: Measuring instruction mix, CPI, cache hits/misses.
• Monitoring Tools: Hardware counters, OS utilities (perf, top, etc.).

5. Conclusion

Performance metrics and measures provide a quantitative basis to:

• Compare different architectures,


• Identify bottlenecks,
• Improve scalability.

Execution time, throughput, speedup, efficiency, CPI, and FLOPS are core metrics, while
benchmarks and models (Amdahl, Gustafson) guide real-world evaluation.

➢ PARALLEL PROCESSIND AND


APPLICATIONS:

. Introduction
• Parallel Processing refers to the simultaneous execution of multiple instructions or
tasks by dividing a problem into smaller parts and processing them concurrently.
• It overcomes the limitations of sequential execution and improves performance,
throughput, and scalability.
• Enabled by multiprocessors, multicomputers, SIMD, MIMD, vector processors,
GPUs, and cloud clusters.

2. Need for Parallel Processing


• Increasing demand for high performance computing (HPC).
• Physical limits of clock speed (heat, power consumption).
• Exploiting data-level, instruction-level, and task-level parallelism.
• Essential for real-time, large-scale, and data-intensive applications.

3. Types of Parallel Processing


1. Bit-level Parallelism – Operates on multiple bits simultaneously.
2. Instruction-level Parallelism (ILP) – Superscalar processors, pipelining.
3. Data-level Parallelism (DLP) – SIMD, vector processors, GPUs.
4. Task-level Parallelism (TLP) – Independent processes run in parallel (MIMD).
5. Thread-level Parallelism – Multithreading, multicore processors.

4. Architectures Supporting Parallel Processing


• Multiprocessors (Shared Memory) → Symmetric (SMP), NUMA.
• Multicomputers (Distributed Memory) → Message-passing clusters.
• SIMD Systems → Array processors, GPUs.
• MIMD Systems → Parallel servers, supercomputers.
• Hybrid Systems → CPU + GPU + accelerators.

5. Applications of Parallel Processing


(A) Scientific and Engineering Applications

• Weather forecasting & climate modeling.


• Molecular dynamics and drug discovery.
• Computational fluid dynamics (aerospace, automotive).
• Seismic data analysis for oil and gas exploration.

(B) Artificial Intelligence & Machine Learning

• Deep learning model training (GPUs, TPUs).


• Natural language processing (chatbots, translators).
• Computer vision (face recognition, medical imaging).

(C) Big Data and Analytics

• Real-time data mining and pattern recognition.


• Parallel database queries.
• Large-scale simulations and graph analytics.
(D) Industry and Engineering

• CAD/CAM simulations.
• Robotics and automation.
• Real-time control systems.

(E) Everyday Applications

• Parallel rendering in gaming and multimedia.


• Video encoding and streaming.
• Cloud services (Google, AWS, Azure).

6. Advantages of Parallel Processing


• Higher speedup and throughput.
• Handles large, complex problems.
• Cost-effective using clusters and GPUs.
• Improves system reliability with redundancy.

7. Challenges
• Programming complexity (parallel algorithms, synchronization).
• Communication overhead.
• Load balancing issues.
• Scalability limitations (Amdahl’s Law).

8. Conclusion
Parallel processing is the foundation of modern computing, powering everything from
scientific research to AI applications.
By leveraging multiprocessors, multicomputers, SIMD/MIMD architectures, and
scalable algorithms, advanced computer architectures can meet the growing demand for
high-speed, data-intensive, and real-time applications.
➢ SPEED UP PERFORMANCE LAWS:

1. Introduction
• Speedup measures how much faster a parallel system is compared to a single-
processor system.
• Performance laws help in evaluating, predicting, and optimizing parallel processing
systems.
• They explain limits of parallelism, scalability, and efficiency.

2. Amdahl’s Law (Fixed Workload Law)


• Focuses on the limits of speedup due to the sequential part of a program.
• Formula:

Speedup(N)=1S+(1−S)NSpeedup(N) = \frac{1}{S + \frac{(1-


S)}{N}}Speedup(N)=S+N(1−S)1

where:

o SSS = fraction of sequential execution,


o NNN = number of processors.
• Key Insight:
o Even with infinite processors, maximum speedup = 1S\frac{1}{S}S1.
o Example: If 10% of a program is sequential (S=0.1S=0.1S=0.1), maximum
speedup = 10.

Shows parallelism is limited by the sequential bottleneck.

3. Gustafson’s Law (Scaled Speedup Law)


• Considers scaling problem size with more processors.
• Formula:

Speedup(N)=N−S×(N−1)Speedup(N) = N - S \times (N - 1)Speedup(N)=N−S×(N−1)

where:

o SSS = sequential fraction,


o NNN = number of processors.
• Key Insight:
o As workload increases with processor count, speedup grows nearly linearly.
o More realistic for large-scale parallel computing.

Shows parallelism is more beneficial when workload grows with system size.

4. Karp–Flatt Metric (Serial Fraction Estimation)


• Helps measure parallel overheads (synchronization, communication).
• Formula:

e=1Speedup(N)−1N1−1Ne = \frac{\frac{1}{Speedup(N)} - \frac{1}{N}}{1 -


\frac{1}{N}}e=1−N1Speedup(N)1−N1

where eee = effective serial fraction.

Useful for diagnosing performance bottlenecks in real systems.

5. Sun and Ni’s Law (Memory-Bounded Speedup)


• Extends Amdahl and Gustafson by considering memory constraints.
• Performance depends not only on processor count but also on memory capacity and
bandwidth.

Important in data-intensive applications (AI, Big Data).

6. Isoefficiency Law (Scalability Measure)


• Describes how problem size must increase with processors to maintain efficiency.
• If workload doesn’t scale, efficiency drops with larger NNN.

Guides design of scalable parallel algorithms.

7. Comparison of Speedup Laws


Law Focus Insight
Limits of parallelism due to sequential
Amdahl’s Law Fixed problem size
fraction.
Gustafson’s Law Scaled problem size Speedup grows with workload and processors.
Law Focus Insight
Karp–Flatt Measured speedup
Estimates parallel overheads.
Metric efficiency
Memory-bounded
Sun & Ni’s Law Highlights role of memory in scalability.
workload
Defines workload growth needed to sustain
Isoefficiency Algorithm scalability
efficiency.

8. Conclusion
• Amdahl’s Law → pessimistic (bottleneck focus).
• Gustafson’s Law → optimistic (scalable workload).
• Karp–Flatt, Sun & Ni, Isoefficiency → practical for real-world HPC.

Together, these laws guide the design, evaluation, and optimization of advanced
computer architectures for parallelism.

➢ SCALABILITY ANALYSIS AND


APPROCHES:

1. Introduction
• Scalability = ability of a parallel computer or algorithm to maintain performance
when resources (processors, memory, workload) increase.
• A system is scalable if speedup and efficiency improve proportionally with added
resources.
• Key for High-Performance Computing (HPC), Cloud, Big Data, and AI systems.

2. Scalability Metrics
1. Speedup (S): Ratio of serial execution time to parallel execution time.
2. Efficiency (E): Speedup per processor.

E=SpeedupNE = \frac{Speedup}{N}E=NSpeedup
3. Isoefficiency Function: Defines how problem size must grow with processors to
maintain efficiency.
4. Cost: Product of processors and execution time.
5. Scalability Factor: How performance increases when scaling both workload and
resources.

3. Scalability Analysis
To analyze scalability, we look at:

• Amdahl’s Law: Shows upper limits due to sequential parts (not scalable beyond
bottleneck).
• Gustafson’s Law: Demonstrates scalability when workload grows with processors.
• Isoefficiency Analysis: Evaluates how problem size must scale with resources.
• Bottleneck Identification: Communication, synchronization, memory bandwidth.

4. Approaches to Scalability
(A) Hardware Approaches

• Multiprocessors (SMP, NUMA): Efficient shared memory access.


• Multicomputers (Clusters): Distributed memory with message passing.
• Network Topologies: Mesh, Torus, Hypercube for reduced communication delay.
• Memory Hierarchy Optimization: Cache coherence, scalable interconnects.
• Specialized Accelerators: GPUs, TPUs for scalable parallel workloads.

(B) Software Approaches

• Parallel Algorithms: Designed to minimize sequential parts and overhead.


• Load Balancing: Distributing work evenly among processors.
• Efficient Synchronization: Minimizing locks and barriers.
• Message Passing (MPI) / Shared Memory (OpenMP): Programming models for
scalable parallelism.
• Task Scheduling: Static and dynamic scheduling for optimal resource use.

(C) Hybrid Approaches

• Heterogeneous Systems: CPU + GPU co-processing.


• Cloud & Distributed Computing: Elastic scaling of workloads.
• Grid and Cluster Computing: Integrating multiple systems for scalability.

5. Challenges in Scalability
• Communication Overhead: More processors = more data exchange.
• Synchronization Delays: Locks and barriers reduce efficiency.
• Load Imbalance: Some processors idle while others overloaded.
• Memory Bottleneck: Limited bandwidth and latency.
• Energy and Cost Constraints: Scaling hardware is expensive and power-hungry.

6. Conclusion
• Scalability is a key measure of parallel system effectiveness.
• An ideal scalable system maintains high efficiency, low overhead, and balanced
workload as processors increase.
• Achieved by combining hardware improvements (multiprocessors, networks) and
software optimizations (parallel algorithms, scheduling, load balancing).
• Future Scalability Approaches: AI-driven scheduling, quantum computing
integration, and energy-efficient parallelism.

➢ HARDWARE TECHNOLOGIES:

1. Introduction
• Hardware technologies form the foundation of parallel and scalable computing.
• They determine performance, energy efficiency, scalability, and communication
speed in modern architectures.
• Advances in processor design, memory systems, interconnection networks, and
accelerators drive next-generation computing.

2. Key Hardware Technologies


(A) Processor Technologies

• Multicore Processors: Multiple cores on a single chip for parallel execution.


• Manycore Architectures: Dozens or hundreds of cores (e.g., GPUs, Intel Xeon Phi).
• Superscalar Processors: Issue multiple instructions per clock cycle.
• Pipelining: Instruction-level parallelism (ILP) with multiple pipeline stages.
• Vector Processors: Operate on arrays of data in a single instruction (SIMD).
• Heterogeneous Processors: Combine CPUs, GPUs, and accelerators.
(B) Memory Technologies

• Cache Hierarchies (L1, L2, L3): Reduce latency in accessing frequently used data.
• NUMA (Non-Uniform Memory Access): Memory distributed across nodes for
scalability.
• High-Bandwidth Memory (HBM): 3D-stacked memory for faster data transfer
(used in GPUs, AI chips).
• Non-Volatile Memory (NVM): Combines speed of RAM with persistence of
storage.
• Memory Coherence Protocols: Ensure consistency across multiple processors
(MESI, MOESI).

(C) Interconnection Technologies

• Bus-Based Systems: Traditional shared communication, limited scalability.


• Crossbar Switches: High-speed connections between processors and memory.
• Network-on-Chip (NoC): Scalable interconnect for multicore processors.
• Cluster Interconnects: InfiniBand, Ethernet, Myrinet for HPC clusters.
• Optical Interconnects: High-speed, low-latency communication using light signals.

(D) Parallel Accelerators

• GPUs (Graphics Processing Units): Thousands of SIMD cores for massive data
parallelism.
• TPUs (Tensor Processing Units): Google’s AI accelerator for deep learning.
• FPGAs (Field Programmable Gate Arrays): Customizable hardware for specific
workloads.
• ASICs (Application-Specific Integrated Circuits): Highly optimized chips for tasks
like AI or cryptography.
• Quantum Processors: Exploit quantum mechanics for parallel computation
(emerging).

(E) Storage and I/O Technologies

• Solid-State Drives (SSDs): Faster access than HDDs, essential for HPC.
• Parallel File Systems: (Lustre, GPFS) for large-scale data access.
• NVMe over Fabrics: High-speed I/O across networks.

(F) Power and Cooling Technologies


• Dynamic Voltage and Frequency Scaling (DVFS): Balances performance and
energy.
• Liquid Cooling Systems: Required in large-scale supercomputers.
• Low-Power Architectures (ARM, RISC-V): Optimized for energy-efficient
parallelism.

3. Trends in Hardware Technologies


• Moore’s Law Slowing: Need for 3D chip stacking and new architectures.
• Heterogeneous Computing: Integration of CPUs, GPUs, and accelerators.
• Neuromorphic Hardware: Brain-inspired chips (IBM TrueNorth, Intel Loihi).
• Optical & Photonic Computing: Future for ultra-fast interconnects.
• Quantum Computing: Next-generation paradigm beyond classical scaling.

4. Conclusion
• Hardware technologies in advanced computer architecture enable parallelism, speed,
scalability, and efficiency.
• Innovations in processors, memory, interconnects, and accelerators are shaping the
future of HPC, AI, and cloud systems.
• Future progress lies in heterogeneous, energy-efficient, and specialized hardware.

PROCESS AND MEMORY HIERACHY:

1. Introduction
• In modern computer systems, processors
(CPUs/cores) and memory systems are
organized in a hierarchy to balance speed,
cost, and capacity.
• The hierarchy ensures that frequently used
data is quickly accessible while large data
sets are stored efficiently.
• Together, they enable high-performance
parallel computing.

2. Processor Hierarchy
Processors are organized to exploit parallelism
at multiple levels:
1.Instruction Level (ILP):
o Techniques like pipelining and
superscalar execution.
o Multiple instructions executed per
cycle.
2.Thread Level (TLP):
o Multithreading (e.g., SMT, Hyper-
Threading).
o Multiple threads share processor
resources.
3.Core Level (CMP – Chip Multiprocessors):
o Multiple cores on a single chip (dual-

core, quad-core, many-core).


o Support for parallel execution of tasks.
4.Processor Level:
o Multiprocessor systems (SMP,
NUMA).
o Multiple processors connected via
shared memory or interconnect
networks.
5.Cluster Level:
o Multiple computers connected in a
distributed system.
o Parallelism across nodes (e.g.,
supercomputers, data centers).

3. Memory Hierarchy
Memory is arranged to minimize latency and
cost, while maximizing speed and capacity.
Memory Hierarchy Levels:
1.Registers:
o Inside the CPU, fastest, smallest storage

(nanoseconds).
o Store operands for immediate
execution.
2.Cache Memory:
L1 Cache: Closest to the core, smallest
o

but fastest.
o L2 Cache: Larger, slower than L1.

o L3 Cache: Shared across multiple cores,

bigger but slower.


o Reduces the “memory wall” problem.

3.Main Memory (RAM – DRAM):


o Holds active programs and data.

o Slower than cache but larger in size.

4.Secondary Storage (SSD/HDD):


o Permanent storage, large capacity.

o Much slower than main memory.

5.Tertiary Storage (Cloud/Archival, Tape):


o Very large, cheap storage for backups.

o Extremely slow access.

4. Process–Memory Interaction
• Processor Speed vs. Memory Speed Gap
(Memory Wall): CPUs are much faster than
memory, requiring caches to reduce
latency.
• Locality of Reference: Programs repeatedly
access the same memory locations (spatial
and temporal locality) → basis for caching.
• NUMA (Non-Uniform Memory Access):
In multiprocessors, memory is distributed,
and access speed depends on location.
• Virtual Memory: Provides abstraction so
programs use more memory than physically
available, managed via paging.

5. Importance in Advanced Computer


Architecture
• Optimized hierarchy = high throughput,
low latency, and efficient scalability.
• Parallel applications rely heavily on fast
inter-core communication and memory
bandwidth.
• Modern systems use HBM (High
Bandwidth Memory), 3D-stacked memory,
and memory coherence protocols to
enhance performance.
6. Conclusion
• Processor hierarchy exploits parallelism
(ILP, TLP, multiprocessors, clusters).
• Memory hierarchy bridges the gap between
fast processors and slow memory, ensuring
efficiency.
• Together, they form the backbone of
advanced computing performance in HPC,
AI, and big data systems.

➢ ADAVNCED PROCESSOR
TECHNOLOGY:

1. Introduction
• Advanced processor technologies aim to improve performance, efficiency, and
scalability of modern computer systems.
• Focus areas include instruction-level parallelism, multicore design, heterogeneous
processing, pipelining, and specialized accelerators.
• They are the backbone of high-performance computing (HPC), AI, cloud, and
embedded systems.

2. Key Advanced Processor Technologies


(A) Superscalar Processors

• Issue multiple instructions per clock cycle.


• Uses instruction-level parallelism (ILP).
• Requires branch prediction, register renaming, and out-of-order execution.
• Example: Intel Core i7, AMD Ryzen.

(B) Pipelined Processors

• Divide instruction execution into stages (fetch, decode, execute, memory, write-
back).
• Multiple instructions execute concurrently in different pipeline stages.
• Increases throughput but may face hazards (data, control, structural).

(C) VLIW (Very Long Instruction Word) Processors

• Compiler packs multiple independent operations into one long instruction.


• Parallelism handled at compile-time (not hardware).
• Used in DSPs (Digital Signal Processors) and Itanium processors.

(D) Multicore and Manycore Processors

• Multicore: 2–16 cores per chip (mainstream CPUs).


• Manycore: Dozens to thousands of cores (GPUs, accelerators).
• Enable thread-level parallelism (TLP).
• Example: AMD EPYC (64 cores), NVIDIA GPUs.

(E) Heterogeneous Processors

• Combine different types of cores for different workloads.


• Example:
o CPU + GPU (NVIDIA CUDA, AMD ROCm).
o ARM big.LITTLE architecture (high-power + low-power cores).
o AI accelerators (TPUs, NPUs).

(F) Out-of-Order Execution (OOE)

• Instructions executed based on availability of operands, not strict order.


• Improves pipeline utilization.
• Requires register renaming and scheduling logic.

(G) Speculative Execution


• Processor predicts and executes instructions before knowing branch outcome.
• If prediction is correct → speed-up.
• If wrong → rollback and discard.
• Used in modern Intel/AMD CPUs (Spectre/Meltdown vulnerabilities exploit this).

(H) Parallel and Vector Processors

• Vector Processors (SIMD): Operate on arrays in one instruction.


• SIMT (Single Instruction, Multiple Threads): Used in GPUs.
• Enable data-level parallelism for AI, graphics, and HPC workloads.

(I) Emerging Technologies

• Quantum Processors: Exploit quantum states for massive parallelism.


• Neuromorphic Processors: Brain-inspired architectures (Intel Loihi, IBM
TrueNorth).
• RISC-V Processors: Open-source ISA enabling customization.

3. Trends in Advanced Processor Technology


• Shift from frequency scaling → parallelism scaling (Moore’s Law slowing).
• Energy-efficient processors (ARM, RISC-V gaining popularity).
• 3D Chip stacking (Chiplets, HBM with CPUs/GPUs).
• Domain-specific accelerators (AI, ML, cryptography).

4. Conclusion
• Advanced processor technologies focus on parallelism (ILP, TLP, DLP), energy
efficiency, and specialization.
• They are essential for scalable, high-performance computing in fields like AI, big
data, cloud, and supercomputing.
• The future lies in heterogeneous, domain-specific, and quantum-inspired
architectures.
➢ SUPER SCALAR AND VECTOR
PROCESSOR:

1. Introduction
In advanced computer architecture, superscalar and vector processors are two powerful
techniques designed to enhance parallel execution and improve performance.

• Superscalar processors exploit Instruction-Level Parallelism (ILP).


• Vector processors exploit Data-Level Parallelism (DLP).

2. Superscalar Processors
Definition:

A superscalar processor can issue and execute multiple instructions per clock cycle using
multiple functional units.

Characteristics:

• Uses instruction pipelines (Fetch, Decode, Execute).


• Employs branch prediction, register renaming, and out-of-order execution.
• Executes scalar instructions but in parallel.

Advantages:

• Higher instruction throughput.


• General-purpose: works on any program.

Examples:

• Intel Pentium, Core i7, AMD Ryzen.

3. Vector Processors
Definition:

A vector processor executes a single instruction on multiple data elements


simultaneously (SIMD model).
Characteristics:

• Operates on vectors/arrays of data instead of single values.


• Uses vector registers and vector pipelines.
• Specially designed for scientific, AI, and multimedia workloads.

Advantages:

• Very high performance for repetitive numerical computations.


• Reduces instruction overhead by processing large data sets with one instruction.

Examples:

• Cray supercomputers, modern GPUs (SIMD/SIMT).

4. Superscalar vs Vector Processors (Comparison Table)


Feature Superscalar Processor Vector Processor
Instruction-Level Parallelism
Parallelism Type Data-Level Parallelism (DLP)
(ILP)
Executes multiple independent Executes one instruction on multiple
Operation
instructions data
Hardware Multiple functional units, branch
Vector registers, vector pipelines
Support prediction
Requires vectorizable code (loops,
Programming Works with normal scalar code
arrays)
General-purpose workloads (OS,
Best Suited For Scientific computing, AI, graphics
apps)
Cray Supercomputers, GPUs
Example Intel Core, AMD Ryzen CPUs
(CUDA/OpenCL)

5. Conclusion
• Superscalar processors improve speed by exploiting ILP across multiple
instructions.
• Vector processors accelerate performance by exploiting DLP across large datasets.
• Modern systems often combine both:
o CPUs → Superscalar execution.
o GPUs/Accelerators → Vector/SIMD processing.
UNIT-3

SHARED MEMORY ORGANISATIONS:

1. Introduction
• In parallel computer systems, processors often need to communicate and share
data.
• A shared memory organization provides a common address space accessible by all
processors.
• It enables inter-processor communication through read/write operations instead of
explicit message passing.

2. Characteristics of Shared Memory Systems


• Single global memory accessible to all processors.
• Uniform address space: Each processor uses the same address to access a memory
location.
• Cache coherence protocols (like MESI) are needed to maintain data consistency.
• Synchronization mechanisms (locks, semaphores, barriers) prevent race conditions.

3. Types of Shared Memory Organizations


1. Uniform Memory Access (UMA)
o All processors have equal access time to memory.
o Memory is centralized.
o Suitable for small-scale multiprocessors.
o Example: Symmetric Multiprocessing (SMP).
2. Non-Uniform Memory Access (NUMA)
o Memory is distributed across nodes but forms a single logical address space.
o Accessing local memory is faster than remote memory.
o Scales better for large systems.
o Example: Modern multi-core servers (AMD EPYC, Intel Xeon NUMA
systems).
3. Cache-Only Memory Architecture (COMA)
o No centralized memory, each node’s memory acts like a large cache.
o Data migrates dynamically to where it is needed.
o Useful for reducing remote memory bottlenecks.

4. Advantages
• Easier programming model (shared variables instead of explicit messaging).
• Faster communication compared to message-passing systems.
• Efficient for tightly coupled multiprocessors.

5. Challenges
• Scalability issues: Memory bus can become a bottleneck as processor count
increases.
• Cache coherence overhead: Maintaining consistency across many caches is
complex.
• Synchronization delays: Multiple processors competing for the same data can cause
contention.

6. Applications
• Widely used in:
o Multicore CPUs (Intel, AMD processors).
o High-performance servers.
o Parallel programming models like OpenMP and Pthreads.

7. Conclusion
• Shared memory organizations are fundamental in multiprocessor systems.
• UMA works for small multiprocessors, while NUMA and COMA improve
scalability for larger systems.
• Despite challenges like coherence and contention, shared memory remains a core
architecture for parallel computing.
➢ SEQUENTIAL AND WEAK
CONSISTENCY MODELS:

1. Introduction
In shared memory multiprocessor systems, multiple processors may read and write shared
variables.

• To ensure correctness, the system must define how memory operations appear to all
processors.
• This is governed by memory consistency models.

2. Sequential Consistency (SC)


Definition:

A memory system is sequentially consistent if:

“The result of execution is the same as if all operations were executed in some sequential
order, and the operations of each processor appear in this sequence in the order issued by
the program.” — Lamport (1979)

Key Properties:

• All processors see memory operations in the same global order.


• Program order is preserved.
• Easy for programmers to reason about.

Example:

If Processor P1 executes A=1; B=1;


and Processor P2 executes print(B); print(A);
→ Under SC, P2 cannot see B=1 before A=1.

Advantages:

• Intuitive and simple.


• Easy to debug and predict.

Disadvantages:

• Expensive to implement in hardware.


• Reduces performance because it restricts out-of-order execution and aggressive
memory optimizations.

3. Weak Consistency (WC)


Definition:

A system is weakly consistent if:

• Memory operations may not be immediately visible to all processors.


• Synchronization operations (locks, barriers, fences) enforce consistency.

Key Properties:

• Relaxed ordering of memory operations.


• Writes may be delayed or reordered until a synchronization point.
• Requires programmers to insert explicit synchronization instructions.

Example:

If P1 writes A=1; B=1;


and P2 reads B then A without synchronization,
→ P2 may see B=1 but A=0.
Consistency is only guaranteed after a synchronization barrier.

Advantages:

• Higher performance (allows caching, pipelining, out-of-order execution).


• Scales better for large multiprocessors.

Disadvantages:

• Harder to program.
• Bugs may occur if synchronization is missing.

4. Comparison: Sequential vs Weak Consistency


Feature Sequential Consistency (SC) Weak Consistency (WC)
Ordering Strict global ordering Relaxed ordering
Programmer
Easy (automatic consistency) Hard (requires explicit sync)
Effort
Higher (allows reordering,
Performance Lower (restricts optimizations)
caching)
Feature Sequential Consistency (SC) Weak Consistency (WC)
Debugging, correctness, simple
Use Case High-performance, large systems
models

5. Conclusion
• Sequential consistency ensures correctness but limits performance.
• Weak consistency improves scalability and speed but requires careful
synchronization.
• Modern architectures (x86, ARM, GPUs) use weaker consistency models with
explicit synchronization instructions to balance performance and correctness.

➢ PIPELINING AND SUPERSCALAR


TECHNIQUES:

1. Introduction
Modern processors aim to achieve high performance by exploiting parallelism in
instruction execution. Two important methods are:

• Pipelining → Improves throughput by overlapping instruction execution.


• Superscalar techniques → Allow multiple instructions to be issued and executed per
clock cycle.

2. Pipelining
Definition

• A technique in which the instruction execution process is divided into stages, and
different instructions are executed in parallel at different stages.
• Similar to an assembly line in manufacturing.

Basic Pipeline Stages (RISC example)

1. IF – Instruction Fetch
2. ID – Instruction Decode & Operand Fetch
3. EX – Execute (ALU operations)
4. MEM – Memory Access
5. WB – Write Back
Advantages

• Increases throughput (more instructions per unit time).


• Better utilization of processor resources.

Challenges

• Hazards limit efficiency:


o Structural hazards – resource conflicts.
o Data hazards – dependencies between instructions.
o Control hazards – branch mispredictions.

3. Superscalar Techniques
Definition

• A superscalar processor can issue and execute multiple instructions per clock
cycle by using multiple functional units.

Key Features

• Instruction Fetching: Can fetch multiple instructions per cycle.


• Instruction Issue: Can dispatch multiple instructions to functional units.
• Dynamic Scheduling: Out-of-order execution to maximize resource usage.
• Multiple ALUs, FPUs, and Load/Store units.

Example:

• Intel Pentium (dual-issue superscalar).


• Modern processors (Intel Core, AMD Ryzen) → up to 4–8 instructions per cycle.

Advantages

• Higher Instruction Level Parallelism (ILP).


• Better performance compared to simple pipelines.

Challenges

• Complex hardware design.


• Dependency checking and instruction scheduling overhead.
• Diminishing returns beyond a certain instruction issue width.

4. Comparison: Pipelining vs Superscalar


Feature Pipelining Superscalar
Parallelism Overlaps stages of multiple Executes multiple instructions per
Type instructions cycle
Throughput 1 instruction per cycle (ideal) >1 instruction per cycle
More complex (scheduling, hazard
Complexity Simple, easier to design
handling)
Goal Increase throughput via overlap Increase ILP via multiple issue

5. Conclusion
• Pipelining improves performance by overlapping instruction execution stages.
• Superscalar techniques go further, allowing multiple instructions to be executed in
parallel per cycle.
• Together, these form the backbone of modern CPU architectures, enabling high-
performance computing.

➢ LINEAR PIPELINE PROCESSORS:

1. Introduction
• Pipelining is a fundamental technique used to improve CPU performance.
• A Linear Pipeline Processor is the simplest form of pipeline where:
o Processing stages are arranged linearly (sequentially).
o Each stage performs a part of the computation.
o Instructions (or data) flow step by step through these stages, just like in an
assembly line.

2. Definition
A Linear Pipeline Processor is a sequence of processing stages connected in a linear
manner, where:

• Each stage receives input from its immediate predecessor.


• Produces output for its immediate successor.
• No feedback paths or branching inside the pipeline.
3. Structure
• Divided into k stages:
o Each stage performs a sub-operation.
o Operates in parallel with other stages (on different instructions).
• Example (Instruction Pipeline):
1. IF – Instruction Fetch
2. ID – Instruction Decode
3. EX – Execution
4. MEM – Memory Access
5. WB – Write Back

At a given clock cycle:

• Stage 1 works on instruction i+4.


• Stage 2 works on instruction i+3.
• … and so on.

4. Characteristics
• Linear flow of data/instructions.
• Single entry and single exit.
• No branching or feedback loops within the pipeline.
• Throughput improves as multiple instructions are overlapped.

5. Advantages
1. High throughput – One result per cycle (after initial filling).
2. Simple design – Easy to implement and control.
3. Resource utilization – Each stage works in parallel.
4. Scalability – Can add more stages to increase performance.

6. Limitations
1. Pipeline hazards:
o Structural (resource conflicts).
o Data (RAW, WAR, WAW dependencies).
o Control (branch misprediction).
2. Stalling occurs if dependencies are not resolved.
3. Performance gain is limited by the slowest stage.
4. Not flexible for irregular tasks (works best for repetitive tasks).
7. Applications
• Instruction pipelines in RISC processors.
• Arithmetic pipelines (e.g., floating-point adders, multipliers).
• Image and signal processing where sequential stages are common.

8. Example
Suppose we have a 4-stage linear pipeline:

1. Fetch → 2. Decode → 3. Execute → 4. Write Back

➢ For a stream of 5 instructions:


➢ Instruction 1 finishes in 4 cycles.
➢ After pipeline fill, one instruction finishes per cycle.

➢ NON-LINEAR PIPELINE
PROCESSORS:

1. Introduction
• A pipeline processor can be linear (simple sequential flow) or non-linear (complex
paths).
• In contrast to linear pipelines, which have a single input → sequential stages →
single output,
Non-Linear Pipeline Processors have:
o Multiple inputs/outputs,
o Branching, merging, or feedback paths.

This allows them to handle more complex computations beyond straight-line processing.

2. Definition
A Non-Linear Pipeline Processor is a pipeline in which:
• Stages are not arranged in a strict sequence.
• Data/instructions may split into multiple paths, recombine, or even loop back.
• It supports conditional execution, parallel branches, and feedback loops.

3. Structure
• Composed of stages connected in non-linear ways:
o Branching → One stage may feed into multiple next stages.
o Merging → Multiple paths may join into one stage.
o Feedback → Output of a stage may return as input to a previous stage.

Examples:

• Arithmetic pipelines for iterative operations.


• Multipath instruction pipelines (handling conditional branches).
• VLSI circuits with complex control flow.

4. Characteristics
• Flexible execution paths.
• Supports parallel as well as conditional tasks.
• Higher hardware complexity compared to linear pipelines.
• Can improve performance for non-sequential tasks.

5. Advantages
1. Can handle complex operations (loops, conditionals, multi-output computations).
2. Greater flexibility compared to linear pipelines.
3. Supports parallel branches for higher throughput.
4. Useful in VLSI design, multiprocessors, and complex data processing.

6. Limitations
1. High complexity in design and control logic.
2. Pipeline hazards are harder to manage (especially with feedback loops).
3. Difficult to balance stages (some paths may become bottlenecks).
4. Requires sophisticated scheduling and dependency resolution.
7. Applications
• Computer graphics and multimedia → where multiple paths of computation exist.
• Signal and image processing → where iterative and conditional operations are
common.
• VLSI and multiprocessor systems → for flexible dataflow architectures.
• Instruction pipelines with branch prediction and speculative execution.

8. Comparison: Linear vs Non-Linear Pipelines


Feature Linear Pipeline Non-Linear Pipeline
Structure Sequential, single path Branching, merging, feedback allowed
Complexity Simple Complex
Flexibility Low High
Performance Good for regular tasks Better for irregular/complex tasks
Hazard
Easier More difficult
Handling
RISC instruction pipelines, VLSI design, multipath processors,
Applications
ALUs graphics

➢ INSTRUTION PIPELINE DESIGN:

1. Introduction
• Instruction execution in a processor involves multiple steps (fetch, decode, execute,
etc.).
• Instead of executing one instruction at a time, pipelining allows overlapping these
steps for multiple instructions.
• Instruction Pipeline Design is the method of organizing CPU stages to execute
several instructions concurrently, improving throughput.

2. Basic Idea
• Divide instruction execution into k stages.
• Each stage performs a specific sub-task.
• Different instructions are processed simultaneously in different stages.
• Works like an assembly line in a factory.

Example (5-stage RISC pipeline):

1. IF (Instruction Fetch) – get instruction from memory.


2. ID (Instruction Decode & Register Read) – decode opcode, read registers.
3. EX (Execute / ALU Operation) – perform arithmetic/logical operation.
4. MEM (Memory Access) – read/write data from memory.
5. WB (Write Back) – store results back to registers.

3. Instruction Pipeline Design Steps


1. Partitioning of instruction cycle
o Divide instruction execution into stages of nearly equal duration.
2. Stage organization
o Assign hardware units to each stage (e.g., ALU in EX stage).
3. Control design
o Ensure instructions flow correctly through stages.
4. Hazard handling mechanisms
o Detect and resolve conflicts (stalling, forwarding, branch prediction).
5. Performance optimization
o Balance stages, minimize delays, maximize throughput.

4. Hazards in Instruction Pipeline


1. Structural Hazards – Resource conflicts (e.g., one memory for both instruction and
data).
2. Data Hazards – Dependencies between instructions:
o RAW (Read After Write)
o WAR (Write After Read)
o WAW (Write After Write)
3. Control Hazards – Arise from branch/jump instructions.

Solutions:

• Hardware techniques: Forwarding, hazard detection units, branch prediction.


• Software techniques: Instruction scheduling, loop unrolling.

5. Performance Metrics
• Pipeline Throughput (TP): Number of instructions completed per unit time.
• Pipeline Latency (L): Time taken for one instruction to complete (pipeline depth ×
clock cycle).
• Speedup (S): Ratio of execution time without pipeline to execution time with
pipeline.
o Ideal speedup = number of stages.
• Efficiency (E): E=Actual speedupIdeal speedupE = \frac{\text{Actual
speedup}}{\text{Ideal speedup}}E=Ideal speedupActual speedup.

6. Advantages
• High instruction throughput.
• Better utilization of CPU hardware.
• Increases system performance without faster clock speeds.
• Foundation for RISC design and superscalar architectures.

7. Limitations
• Suffering from hazards → stalls, bubbles reduce performance.
• Complexity increases with deeper pipelines.
• Not every instruction benefits equally (e.g., branches cause disruptions).

8. Applications
• Used in modern RISC processors (MIPS, ARM).
• Basis for superscalar processors (multiple pipelines in parallel).
• Used in vector processors and VLIW architectures.

➢ ARITHMETIC PIPELINE DESIGN:


1. Introduction
• An Arithmetic Pipeline is a type of processor pipeline designed specifically for
performing arithmetic operations (e.g., addition, multiplication, division, floating-
point calculations).
• The main goal is to overlap the execution of arithmetic sub-operations so that
multiple operations can be carried out simultaneously.
• Widely used in scientific computing, graphics, and signal processing where
repetitive arithmetic operations are common.

2. Definition
Arithmetic Pipeline Design is the process of dividing an arithmetic operation into
sequential stages, where each stage performs a part of the operation, allowing multiple
operands to be processed concurrently.

Example: A floating-point addition can be divided into stages:

1. Align exponents
2. Add mantissas
3. Normalize result
4. Round and store

3. Basic Structure
An arithmetic pipeline consists of:

• Input stage – Receives operands.


• Intermediate stages – Perform partial arithmetic computations.
• Output stage – Produces the final result.

Each stage is separated by pipeline registers to hold intermediate results.

4. Example Designs
(a) Floating-Point Addition Pipeline

1. Exponent Comparison – Align exponents of two numbers.


2. Mantissa Alignment – Shift mantissa of smaller exponent.
3. Mantissa Addition/Subtraction – Perform binary operation.
4. Normalization – Adjust result to standard floating-point form.
5. Rounding – Round to nearest representable value.
6. Result Write-Back – Store final output.

(b) Floating-Point Multiplication Pipeline

1. Exponent Addition – Add exponents of operands.


2. Mantissa Multiplication – Multiply mantissas.
3. Normalization – Adjust result to valid floating-point format.
4. Rounding – Apply rounding rules.
5. Result Storage – Write back to register/memory.

5. Performance
• Latency (L): Time taken for one operation to complete.
• Throughput (TP): Number of operations completed per unit time.
o After the pipeline is filled, one arithmetic operation can be completed per
clock cycle.
• Speedup: Pipeline can speed up arithmetic-intensive tasks significantly.

6. Advantages
• High throughput for repetitive arithmetic tasks.
• Efficient hardware utilization.
• Parallelism at the micro-operation level.
• Very useful in scientific and engineering applications requiring floating-point
operations.

7. Limitations
• Complexity of pipeline control increases.
• Pipeline hazards (e.g., data dependency between arithmetic operations).
• Fixed pipeline stages → not flexible for all types of arithmetic operations.
• Uneven stage delays can cause imbalance and reduce efficiency.

8. Applications
• Floating-point units (FPUs) in modern processors.
• Digital Signal Processing (DSP) – Fast FFT, convolution, filtering.
• Graphics Processing Units (GPUs) – Vector/matrix arithmetic.
• Scientific supercomputers – Weather forecasting, simulations, AI computations.
➢ SUPERSCALAR PIPELINE DESIGN:

1. Introduction
• Superscalar architecture improves CPU performance by allowing multiple
instructions to be issued and executed in parallel per clock cycle.
• Unlike scalar processors (which fetch/issue one instruction per cycle), a superscalar
processor may fetch/issue 2, 4, or more instructions per cycle.
• To achieve this, the CPU uses parallel pipelines and advanced hardware techniques
for dependency checking and scheduling.

2. Definition
A Superscalar Pipeline is a processor design that uses multiple parallel execution
pipelines within the CPU to execute more than one instruction per clock cycle, subject to
dependency and resource availability.

3. Key Features
1. Multiple Instruction Fetch & Decode Units – Can fetch/issue several instructions at
once.
2. Parallel Execution Pipelines – ALUs, FPUs, and Load/Store units operate
simultaneously.
3. Dynamic Scheduling – Detects instruction dependencies at runtime (Tomasulo’s
algorithm, scoreboarding).
4. Out-of-Order Execution – Instructions may execute out of program order to
maximize pipeline utilization.
5. Speculative Execution – Predicts branches to avoid pipeline stalls.

4. Structure of Superscalar Pipeline


A superscalar pipeline consists of:

1. Instruction Fetch Unit (IFU): Fetches multiple instructions per cycle.


2. Instruction Decode / Dispatch Unit: Decodes instructions and checks for
dependencies.
3. Instruction Scheduling Unit: Determines which instructions can issue
simultaneously.
4. Multiple Execution Units:
o Integer ALU pipelines
o Floating-point pipelines
o Load/Store pipelines
o Branch units
5. Commit Unit: Ensures results are written back in correct program order
5. Example
A 4-way superscalar processor can issue and execute up to 4 instructions per cycle:

• Cycle 1: Fetch 4 instructions


• Cycle 2: Decode & issue all 4 (if no hazards)
• Cycle 3+: Parallel execution in separate pipelines

6. Performance
• Ideal Speedup: If an n-way superscalar CPU can issue n instructions per cycle →
theoretical speedup = n × scalar performance.
• Practical Limitation: Dependency hazards, cache misses, and branch mispredictions
reduce actual speedup.

7. Advantages
• High Instruction-Level Parallelism (ILP) → better throughput.
• Faster execution of general-purpose programs.
• Transparent to programmers (no need for explicit parallel code).

8. Limitations
• Hardware complexity (dependency checking, branch prediction).
• Diminishing returns – beyond 4–8 pipelines, parallelism is limited by available ILP
in programs.
• Higher power consumption and chip area.

9. Applications
• Almost all modern general-purpose CPUs (Intel Core, AMD Ryzen, ARM Cortex,
Apple M-series).
• High-performance servers and workstations.
• Advanced embedded processors.
UNIT-4

➢ PARALLEL AND SCALABLE


ARCHITECTURES:
1. Introduction
• Modern computing requires handling large-scale, data-intensive, and high-
performance applications (e.g., AI, big data, simulations).
• Single-processor (uniprocessor) systems have performance limits due to power,
speed, and memory bottlenecks.
• Parallel Architectures exploit multiple processing units to work simultaneously.
• Scalable Architectures ensure that performance increases proportionally when more
processors/resources are added.

2. Parallel Architecture
Parallel architecture is the organization of a computer system that uses multiple processors
or processing elements to execute tasks concurrently.

Types of Parallelism:

1. Data Parallelism – Same operation performed on different data (e.g., SIMD, GPUs).
2. Task Parallelism – Different tasks executed in parallel (e.g., MIMD
multiprocessors).
3. Pipeline Parallelism – Breaking tasks into pipeline stages (e.g., superscalar, VLIW).

Classification (Flynn’s Taxonomy):

• SISD – Single Instruction, Single Data (traditional CPU).


• SIMD – Single Instruction, Multiple Data (vector processors, GPUs).
• MISD – Multiple Instruction, Single Data (rare, fault-tolerant systems).
• MIMD – Multiple Instruction, Multiple Data (multiprocessors, clusters).

3. Scalable Architecture
A system is scalable if its performance grows with the addition of more processors/resources.
Scalability Aspects:

1. Processor Scalability – Add more CPUs without major redesign.


2. Memory Scalability – Support larger memory efficiently.
3. Interconnect Scalability – High-speed communication among processors.
4. Software Scalability – Programs and OS must utilize hardware scaling.

4. Models of Parallel and Scalable Architectures


1. Shared Memory Systems
o Processors share a common memory.
o Easy programming model.
o Example: SMP (Symmetric Multiprocessors).
2. Distributed Memory Systems
o Each processor has private memory.
o Communication via message passing.
o Example: Cluster computing, MPI.
3. Hybrid Systems
o Combination of shared + distributed memory.
o Example: Modern supercomputers (NUMA, cc-NUMA).

5. Interconnection Networks for Scalability


• Bus-based systems (low scalability).
• Crossbar switches (high cost, but high performance).
• Multistage networks (Omega, Butterfly, Clos networks).
• Mesh/Hypercube topologies (used in supercomputers).

6. Principles of Scalability
• Minimize communication overhead.
• Avoid bottlenecks (memory, network).
• Balance computation and communication.
• Use load balancing strategies.

7. Applications
• Supercomputers – Weather prediction, molecular modeling.
• AI/ML Systems – Training deep neural networks on GPUs/TPUs.
• Big Data Processing – Hadoop, Spark on distributed clusters.
• Cloud Computing – Scalable virtualization and container systems.

8. Advantages
• Handles large-scale problems.
• High throughput and reduced execution time.
• Scalable to meet future demands.

9. Limitations
• Complex hardware and software design.
• Communication overhead in large-scale systems.
• Amdahl’s Law limits maximum speedup.

➢ MULTIPROCESSORS AND
MULTICOMPUTERS:

1. Introduction
• To achieve parallel processing, computer systems can be designed as either:
1. Multiprocessors → Shared-memory systems.
2. Multicomputers → Distributed-memory systems.
• Both are used to improve performance, scalability, and fault tolerance in advanced
computing.

2. Multiprocessors
A multiprocessor system is a computer with two or more CPUs that share a common
memory and are connected by a high-speed bus or interconnection network.

Characteristics:

• Shared Memory: All processors access the same global memory.


• Uniform Address Space: Same address used by all CPUs for same data.
• Communication: Via shared memory (load/store operations).
• Tightly Coupled: Processors closely connected.

Types:

1. Symmetric Multiprocessors (SMP)


o All processors are equal and share memory.
o Example: Modern multi-core CPUs.
2. Asymmetric Multiprocessors (AMP)
o One master processor controls others (used in older systems).

Advantages:

• Easy to program (shared memory).


• Fast inter-processor communication.

Limitations:

• Scalability issues (memory contention, bus bottleneck).


• Requires cache coherence protocols.

3. Multicomputers
A multicomputer system consists of multiple independent computers (nodes), each with its
own private memory and CPUs, connected via a communication network.

Characteristics:

• Distributed Memory: Each processor has local memory.


• No Shared Address Space: Communication done via message passing.
• Loosely Coupled: Each node can work independently.

Examples:

• Clusters (Beowulf clusters, Hadoop clusters).


• Massively Parallel Processors (MPPs).

Advantages:

• Highly scalable (can add more nodes).


• Fault tolerance (failure in one node doesn’t stop entire system).

Limitations:

• Programming is harder (explicit message passing with MPI).


• Communication overhead can be high.
4. Comparison Table
Multiprocessors (Shared Multicomputers (Distributed
Feature
Memory) Memory)
Memory Shared global memory Private local memory
Communication Via shared memory Via message passing (network)
Coupling Tightly coupled Loosely coupled
Scalability Limited Highly scalable
Programming
Easier (OpenMP, threads) Complex (MPI, PVM)
Model
Examples SMP servers, Multi-core CPUs Clusters, Supercomputers

5. Applications
• Multiprocessors: Real-time systems, databases, operating systems.
• Multicomputers: Supercomputing, scientific simulations, big data processing, AI
training.

6. Conclusion
• Multiprocessors → Best for small-to-medium scale systems where ease of
programming is important.
• Multicomputers → Best for large-scale parallel computing requiring scalability and
distributed control.

MULTIPROCESSOR SYSTEM
INTERCONNECTS:

1. Introduction
• In multiprocessor systems, multiple CPUs (and caches) must communicate with
each other and with shared memory.
• The interconnection network defines how processors, memory modules, and I/O
devices are linked.
• The efficiency of these interconnects determines:
o Performance (latency, bandwidth).
o Scalability (number of processors supported).
o Reliability (fault tolerance).
2. Types of Interconnects
A. Shared Bus Interconnect

• All processors, memory modules, and I/O devices are connected to a single common
bus.
• Only one transfer at a time is allowed.

Advantages:

• Simple and cost-effective.


• Easy to implement cache coherence.

Limitations:

• Bus contention → poor scalability (good only up to ~8–16 CPUs).

B. Crossbar Switch Interconnect

• Provides direct paths between any processor and memory module.


• Multiple transfers can happen simultaneously (if no conflicts).

Advantages:

• High bandwidth.
• Eliminates bus bottleneck.

Limitations:

• Expensive (requires P×MP \times MP×M switches for P processors and M memory
modules).

C. Multistage Interconnection Networks (MINs)

• Built using small 2×2 switches arranged in stages.


• Examples: Omega, Banyan, Clos, Butterfly networks.

Advantages:

• More cost-effective than crossbar.


• Supports parallel transfers.

Limitations:

• Blocking can occur if two paths need the same switch simultaneously.

D. Mesh and Torus Networks


• Processors arranged in a grid (2D or 3D).
• Each node connected to its neighbors.

Advantages:

• Scalable.
• Local communication efficient.

Limitations:

• Longer communication paths (higher latency).

E. Hypercube Interconnect

• Processors connected as vertices of a d-dimensional cube.


• Each node connects to d neighbors.

Advantages:

• Very scalable (2d2^d2d nodes).


• Low diameter (short communication paths).

Limitations:

• More complex wiring as dimension increases.

3. Classification
• Static Interconnects: Fixed links (Mesh, Torus, Hypercube).
• Dynamic Interconnects: Use switching elements (Bus, Crossbar, MINs).

4. Performance Metrics
• Bandwidth: Amount of data transferred per unit time.
• Latency: Time to deliver a message.
• Scalability: Ability to support more processors without bottlenecks.
• Fault Tolerance: Ability to reroute in case of failures.

5. Applications
➢ Shared Bus → Small multiprocessor systems (SMPs).
➢ Crossbar / MINs → Medium-sized multiprocessors.
➢ Mesh, Torus, Hypercube → Large-scale parallel computers and supercomputers.
➢ CACHE COHERENCE AND
SYNCHRONIZATION MECHANISM:
1. Introduction
• In multiprocessor systems, each processor often has a local cache to reduce memory
access latency.
• When multiple caches store copies of the same memory block, inconsistencies can
arise if one cache updates the data while others still hold stale values.
• To ensure correct execution of parallel programs, we need:
o Cache coherence protocols → maintain consistency of shared data.
o Synchronization mechanisms → control access to shared variables and
prevent race conditions.

2. Cache Coherence Problem


• Definition: A system is coherent if any read returns the most recent write of a data
item.
• Example of problem:
o CPU1 updates variable X = 10 in its cache.
o CPU2 still sees old value X = 5 from its cache.
o → Inconsistency occurs.

3. Cache Coherence Protocols


A. Directory-Based Protocols

• A central directory keeps track of which caches store copies of a block.


• When a cache updates data, the directory ensures invalidation or update messages are
sent to others.

Pros: Scales better for large systems.


Cons: Directory may become a bottleneck.

B. Snoopy Protocols

• All caches monitor (snoop) a shared bus to observe memory operations.


• If one cache modifies data, others invalidate or update their copies.

Two main types:


1. Write-Invalidate: Writer invalidates other copies before modifying.
2. Write-Update: Writer updates other copies with new data.

Pros: Simple, works well in small bus-based multiprocessors.


Cons: Poor scalability due to bus traffic.

4. Memory Consistency Models


• Define the order in which memory operations (loads/stores) appear to execute across
processors.
• Common models:
o Sequential Consistency → All operations appear in a single global order.
o Weak Consistency → Synchronization primitives enforce ordering, allowing
better performance.

5. Synchronization Mechanisms
Used to control concurrent access to shared resources.

A. Locks

• Ensure mutual exclusion (only one processor enters a critical section).


• Implemented using hardware instructions like Test-and-Set (TAS), Compare-and-
Swap (CAS).

B. Semaphores

• Generalized locks with counters, used for resource management.

C. Barriers

• Force all processors to wait until each has reached a synchronization point before
continuing.

D. Monitors & Condition Variables

• High-level synchronization constructs for structured access to shared data.

6. Challenges
• Coherence overhead (extra communication).
• False sharing: Different variables in the same cache block cause unnecessary
invalidations.
• Scalability: More processors → higher complexity.

7. Conclusion
• Cache coherence ensures data consistency across multiple caches.
• Synchronization mechanisms prevent race conditions and enable correct parallel
execution.
• Together, they are fundamental for achieving correctness and performance in
multiprocessor systems.

➢ THREE GENERATIONS OF
MULTICOMPUTERS:
1. Introduction
• Multicomputers are message-passing parallel computers where each processor has:
o Private memory (no global shared memory).
o Interconnection network for communication.
• Unlike multiprocessors (shared memory), multicomputers use explicit message
passing for data exchange.
• Their evolution is classified into three generations based on technology and
architecture.

2. First Generation Multicomputers (1980s)


• Architecture:
o Based on bus or simple ring interconnects.
o Used commercial microprocessors with small private memory.
• Programming model: Message Passing (using libraries).
• Examples:
o Intel iPSC/1 (hypercube).
o Cosmic Cube.
• Limitations:
o Limited scalability (tens of processors).
o High communication latency.
o Low bandwidth networks.
3. Second Generation Multicomputers (1990s)
• Architecture:
o Advanced direct interconnection networks (mesh, torus, hypercube).
o Improved communication hardware (low-latency switches).
o Hundreds to thousands of processors.
• Software support:
o Better message-passing standards (PVM, MPI introduced).
o Parallel file systems and runtime support.
• Examples:
o Intel Paragon.
o nCUBE-2.
o Thinking Machines CM-5.
• Improvements:
o Scalability improved significantly.
o Higher bandwidth interconnects.

4. Third Generation Multicomputers (2000s – Present)


• Architecture:
o Large-scale clusters and supercomputers with commodity processors (Intel,
AMD).
o Use of high-speed interconnects (InfiniBand, Myrinet).
o Distributed memory but integrated with distributed shared memory models
(hybrid).
• Software support:
o Standardized MPI + OpenMP hybrid programming.
o Advanced compilers and runtime systems.
• Examples:
o IBM Blue Gene.
o Cray XT series.
o Modern supercomputers (TOP500 systems).
• Features:
o Millions of processing cores.
o Petaflop and exaflop performance.
o Power-efficient architectures.

5. Comparison of Generations
Feature 1st Gen (1980s) 2nd Gen (1990s) 3rd Gen (2000s–Now)
Processors Tens Hundreds–Thousands Millions (clusters)
Feature 1st Gen (1980s) 2nd Gen (1990s) 3rd Gen (2000s–Now)
Mesh, Torus,
Interconnect Bus, Ring, Cube High-speed networks
Hypercube
Programming Basic message
MPI, PVM support MPI + Hybrid models
Model passing
Performance GFLOPS TFLOPS PFLOPS → ExaFLOPS
Cosmic Cube, Intel Paragon, nCUBE- IBM Blue Gene, Cray
Examples
iPSC/1 2 XT

6. Conclusion
• First Gen: Prototype and early research machines (limited scalability).
• Second Gen: High-performance interconnects, adoption in scientific computing.
• Third Gen: Commodity-based large-scale clusters, powering today’s
supercomputers.
• Multicomputers evolved from experimental parallelism to practical large-scale
HPC systems.

➢MESSAGE -PASSING MECHANISMS:


. Introduction
• In multicomputer systems (distributed memory), processors do not share memory.
• Communication happens via explicit message passing across the interconnection
network.
• This mechanism is essential for parallel processing, synchronization, and data
exchange.

2. Characteristics of Message Passing


1. Explicit Communication – processes send and receive data explicitly.
2. Point-to-Point or Collective – communication may be one-to-one or one-to-many.
3. Synchronization – requires coordination between sender and receiver.
4. Latency and Bandwidth – performance depends on communication cost.

3. Basic Message-Passing Operations


1. Send – transmits data from one processor to another.
2. Receive – accepts data from another processor.
3. Broadcast/Multicast – sends a message to multiple processors.
4. Barrier Synchronization – ensures all processes reach the same point before
continuing.

4. Types of Message Passing


(a) Synchronous Message Passing

• Sender waits until receiver is ready.


• Ensures reliable communication but increases latency.
• Example: Blocking send() in MPI.

(b) Asynchronous Message Passing

• Sender transmits and continues execution without waiting.


• Receiver collects message later.
• Increases parallelism but may require buffering.

5. Mechanisms
1. Direct Communication
o Processes communicate by naming each other explicitly.
o Example: send(P1, data) → receive(P2, data).
2. Indirect Communication (via Mailboxes/Ports)
o Messages delivered to a mailbox/queue.
o Processes retrieve from it asynchronously.
3. Buffered vs. Unbuffered
o Buffered → messages stored temporarily in queues.
o Unbuffered → sender/receiver must be synchronized.
4. Reliability Mechanisms
o Error detection, acknowledgment, retransmission.
o Essential for large distributed systems.

6. Message-Passing Libraries
• MPI (Message Passing Interface) – standard for scientific computing.
• PVM (Parallel Virtual Machine) – earlier system for heterogeneous clusters.
• Features include point-to-point and collective operations.

7. Applications
• High-performance computing (HPC).
• Distributed simulations.
• Cloud and cluster computing.
• Data-parallel applications like weather forecasting, AI training, etc.

8. Advantages and Limitations


Advantages

• High scalability (works well on thousands of nodes).


• Fault isolation (no shared memory bottleneck).
• Explicit control of communication.

Limitations

• Higher programming complexity.


• Communication overhead (latency, bandwidth).
• Requires synchronization for correctness.

9. Conclusion
• Message-passing mechanisms provide a foundation for distributed memory
parallel computing.
• Modern HPC relies on MPI for scalable performance.
• Trade-off: More scalable than shared memory, but harder to program.

MULTIVECTOR AND SIMD COMPUTERS:


1. Introduction
• Parallelism in processors can be achieved at multiple levels.
• Two important architectures are:
o Multivector Computers → based on vector processing.
o SIMD Computers → based on Single Instruction, Multiple Data.

Both are widely used for scientific computing, AI, image processing, and HPC
applications.
2. Multivector Computers
• Definition: Multivector processors are systems capable of executing multiple vector
instructions concurrently.
• Vector processing means handling entire arrays (vectors) of data with a single
instruction.
• Example: Instead of adding two numbers, it adds two vectors element-wise.

Characteristics

• Operates on vectors (1D arrays of data).


• Exploits data-level parallelism.
• Vector pipelines allow high throughput.
• Can issue and process multiple vector operations simultaneously.

Advantages

• Reduces loop overhead in scientific computations.


• Efficient for matrix operations, simulations, weather modeling.

Examples

• Cray supercomputers (e.g., Cray-1).


• Modern GPUs have vector processing elements.

3. SIMD Computers
• Definition: SIMD stands for Single Instruction, Multiple Data.
• A single instruction is broadcast to multiple processing elements, each working on
different data simultaneously.

Characteristics

• One control unit directs many processing elements.


• All processors execute the same instruction but on different data streams.
• Works best for data-parallel problems.

Advantages

• High efficiency in array and matrix computations.


• Reduces control overhead (one instruction, many data points).
• Good for graphics, AI, multimedia processing.

Examples

• GPU architectures (NVIDIA CUDA, AMD GCN).


• Intel SSE/AVX instructions in CPUs.
• Early SIMD machines like ILLIAC IV, Connection Machine.

4. Comparison: Multivector vs SIMD


Feature Multivector Computers SIMD Computers
Parallelism Type Vector-level Data-level
Execution Unit Vector pipelines Multiple PEs
Instruction Control Multiple vector instructions Single instruction for all
Best For Scientific simulations, matrix/vector ops Graphics, AI, multimedia
Examples Cray supercomputers GPUs, Intel AVX

5. Applications
• Multivector → Scientific computing, engineering simulations, physics.
• SIMD → Graphics rendering, machine learning, image/video processing.

6. Conclusion
• Both Multivector and SIMD architectures exploit data parallelism.
• Multivector = focuses on vector pipelines and multiple vector ops.
• SIMD = focuses on applying one instruction across multiple data streams.
• Together, they form the backbone of modern HPC, AI, and GPU computing.
UNIT-5
VECTOR PROCESSING PRINCIPLES:
1. Introduction
• Vector Processing is a form of data-level parallelism where a single instruction
operates on a set of data elements (vector) instead of a single scalar value.
• Example: Instead of adding two numbers, a vector processor can add two arrays
element by element in one instruction.
• Widely used in scientific computing, simulations, multimedia, and AI
applications.

2. Basic Principles of Vector Processing


(a) Vector Instructions

• Operate on entire arrays (vectors) rather than individual data.


• Example: VADD A, B, C → Add vector B and vector C, store result in vector A.

(b) Pipelining

• Vector processors use deep pipelines so that multiple elements are processed
concurrently.
• Each pipeline stage handles one element at a time, producing high throughput.

(c) Vector Registers

• Special registers store long sequences of data (vectors).


• Instead of fetching data for every scalar operation, the processor loads/stores entire
vectors at once.

(d) Memory Access Patterns

• Vector processors optimize memory operations by supporting stride addressing


(accessing elements at regular intervals).
• Example: Accessing every 4th element of an array efficiently.

(e) Data-Level Parallelism

• Exploits parallelism in data operations (e.g., matrix multiplication, image filtering).


• Works best when the same operation is applied to large data sets.
3. Features of Vector Processors
• Vectorization: Converting loops into vector operations.
• Chaining: Output of one vector instruction directly fed into another without waiting
for completion.
• Masking: Selectively enable/disable vector elements for conditional operations.
• Gather/Scatter: Support irregular memory access patterns.

4. Advantages
• High throughput for repetitive numeric computations.
• Reduces instruction fetch/decode overhead.
• Efficient for large datasets in scientific and multimedia applications.
• Simplifies parallel programming compared to message passing.

5. Limitations
• Not suitable for small or irregular data sets.
• Performance depends on vector length (longer vectors = better efficiency).
• More expensive hardware (vector registers, pipelines).

6. Applications
• Scientific computing: Physics simulations, weather forecasting.
• Engineering: Finite element analysis, CAD.
• AI & ML: Matrix-vector multiplications in neural networks.
• Graphics & Multimedia: Image filtering, video encoding.

7. Conclusion
• Vector processing is a cornerstone of advanced computer architecture, exploiting
data-level parallelism.
• Key principles include vector instructions, pipelining, chaining, masking, and
memory optimizations.
• Though challenged by irregular data patterns, vector processors form the foundation
for modern GPUs, SIMD units, and high-performance computing systems.
MULTIVECTOR MULTIPROCESSORS:
. Introduction
• A multivector multiprocessor is a parallel computing system that combines the
power of multiprocessing (multiple CPUs working together) with vector processing
(operating on arrays of data).
• It extends the SIMD (Single Instruction, Multiple Data) and MIMD (Multiple
Instruction, Multiple Data) paradigms by supporting multiple vector processors
working in parallel.
• Goal: Achieve massive data-level and task-level parallelism for scientific, AI, and
big data applications.

2. Architecture
• Consists of multiple vector processors (VPUs), each with:
o Vector registers (to store long arrays).
o Vector pipelines (for arithmetic operations like add, multiply, divide).
o Scalar unit (to handle control and scalar data).
• Processors are connected via a multiprocessor interconnect (shared memory,
NUMA, or message-passing network).
• Supports both vector instructions and multiprocessor coordination.

3. Features
1. Parallel Vector Units: Multiple vector processors execute vector operations
simultaneously.
2. Task + Data Parallelism: Supports MIMD task distribution while exploiting
SIMD vector execution within each task.
3. Chaining Across Processors: Output of one vector unit can be fed to another without
delay.
4. Scalable Interconnects: Uses high-speed interconnects like crossbar switches,
hypercube, mesh, or fat-tree networks.
5. Shared or Distributed Memory Models: Works with UMA, NUMA, or distributed
memory depending on system size.

4. Advantages
• High throughput for scientific and engineering workloads.
• Exploits both fine-grained (vector) and coarse-grained (multiprocessor)
parallelism.
• Scalable to large numbers of processors.
• Reduces instruction overhead (one vector instruction = many operations).

5. Limitations
• Complex interconnection network and synchronization mechanisms needed.
• Performance drops if workloads are not vectorizable.
• High hardware cost (vector pipelines + multiprocessor interconnects).

6. Applications
• Supercomputers (e.g., Cray-2, NEC SX series, Fujitsu VP).
• Scientific simulations (climate modeling, astrophysics, quantum mechanics).
• AI & ML workloads (matrix multiplications, tensor computations).
• Big Data Analytics & Multimedia (image/video processing).

7. Conclusion
• Multivector multiprocessors are a powerful architectural model combining
multiprocessing with vector processing.
• They achieve massive speedup for large-scale scientific and engineering problems
by exploiting data-level parallelism at multiple levels.
• Though expensive and complex, they represent a key step in the evolution toward
modern GPUs, AI accelerators, and exascale supercomputers.

COMPOUND VECTOR PROCESSING:


1. Introduction
• Compound Vector Processing (CVP) is an advanced vector processing technique
where multiple vector operations are combined and executed in parallel or
pipelined fashion within a processor.
• Unlike simple vector processing (which executes one vector instruction at a time),
CVP chains multiple vector operations together to increase throughput and reduce
execution time.
• It’s mainly used in supercomputers and high-performance processors for
workloads like scientific simulations, AI, and data analytics.
2. Concept
• In a normal vector processor:
o One vector instruction (e.g., VADD A, B, C) performs an operation across
elements of arrays.
• In compound vector processing:
o Multiple vector instructions (e.g., VADD, VSUB, VMUL) are issued together as a
compound instruction set.
o The output of one instruction can immediately feed into another without
storing back to memory (called vector chaining).

Example:
Instead of executing separately:

VADD V1 = A + B
VMUL V2 = V1 * C
VSUB V3 = V2 - D

Compound vector processing chains them, reducing overhead.

3. Features
1. Vector Chaining – Intermediate results pass directly between functional units.
2. Pipelined Compound Operations – Multiple vector operations are executed in a
pipeline fashion.
3. Parallel Execution – Different functional units (adders, multipliers, dividers) work in
parallel.
4. Reduced Instruction Overhead – Fewer instructions need to be fetched and
decoded.
5. High Data Reuse – Avoids unnecessary memory accesses.

4. Advantages
• Higher throughput due to parallelism across multiple vector operations.
• Reduced memory traffic (results are forwarded directly).
• Better performance on scientific workloads (matrix operations, PDE solvers).
• Exploits data locality by reusing operands.

5. Challenges
• Requires complex control logic to manage dependencies between chained operations.
• Limited benefit if workload is not vectorizable.
• Hardware cost is higher due to multiple pipelines and forwarding paths.

6. Applications
• Supercomputers (e.g., Cray vector processors, NEC SX series).
• Scientific computing – weather forecasting, computational physics, fluid dynamics.
• AI & ML – matrix multiplications, tensor contractions.
• Graphics & Multimedia – image filtering, transformations.

7. Conclusion
• Compound Vector Processing enhances traditional vector processing by executing
multiple vector operations together using chaining and pipelining.
• It significantly boosts performance in scientific and AI workloads.
• Though complex and costly, it paved the way for modern GPUs and AI accelerators
that also rely on compound vector-like parallel execution.

SIMD COMPUTER ORGANISATIONS:


1. Introduction
• SIMD (Single Instruction, Multiple Data) is a parallel computing model where a
single instruction is executed simultaneously on multiple data elements.
• It is highly efficient for vectorizable tasks such as graphics, image processing, matrix
operations, and scientific simulations.
• SIMD is widely used in vector processors, GPUs, and multimedia instruction sets
like Intel SSE, AVX, and ARM NEON.

2. Characteristics of SIMD Computers


• One Control Unit (CU): Issues the same instruction to multiple processing elements
(PEs).
• Multiple Processing Elements (PEs): Operate in parallel on different pieces of data.
• Synchronous Operation: All PEs execute the same instruction at the same time.
• Efficient for Data-Parallel Tasks: Works best when the same computation is applied
to large datasets.
3. Types of SIMD Organizations
a) Array Processors

• Organized as a 2D array of processors, each connected to its local memory.


• A single control unit broadcasts instructions to all PEs.
• Example: ILLIAC IV (one of the earliest SIMD array processors).

b) Vector Processors

• Operates on vectors (arrays of data elements) rather than individual scalar data.
• Uses vector registers to store large datasets and perform operations in a pipelined
manner.
• Example: Cray vector supercomputers.

c) SIMD Extensions in CPUs

• Modern CPUs integrate SIMD instructions into their instruction sets.


• Examples: Intel SSE/AVX, ARM NEON, IBM AltiVec.
• Allow parallel execution of operations on multiple integers/floats within a single CPU
core.

d) GPUs (Graphics Processing Units)

• Designed with thousands of small PEs operating under SIMD/SIMT (Single


Instruction, Multiple Threads).
• Highly optimized for parallel workloads like graphics rendering, AI, and scientific
computing.

4. Memory Organizations in SIMD


1. Shared Memory – All PEs share the same global memory.
2. Distributed Memory – Each PE has its own local memory.
3. Hybrid Memory – Combination of shared and distributed memory.

5. Advantages
• High throughput for data-parallel problems.
• Reduced instruction overhead (one instruction operates on many data).
• Energy-efficient (same control for multiple operations).
• Ideal for multimedia, AI, and scientific computing.
6. Challenges
• Not suitable for control-parallel problems (where different tasks need different
instructions).
• Data alignment issues (data must be structured for parallel access).
• Scalability limitations compared to MIMD (Multiple Instruction, Multiple Data)
systems.

7. Applications
• Scientific computing – matrix multiplication, weather modeling, simulations.
• Multimedia & graphics – video encoding, image filtering, rendering.
• AI & Machine Learning – tensor/matrix computations, deep learning acceleration.
• Cryptography & signal processing – FFT, encryption/decryption.

8. Conclusion
• SIMD computer organizations exploit data parallelism by applying the same
instruction across large datasets.
• Early SIMD systems like array processors influenced modern vector
supercomputers, SIMD CPU extensions, and GPUs.

THE CONNECTION MACHINE CM-5:

1. Introduction
• The Connection Machine CM-5 is a massively parallel supercomputer developed
by Thinking Machines Corporation (TMC) in the early 1990s.
• It was designed to support scalable parallelism for scientific, AI, and large-scale
computational problems.
• Unlike its predecessors (CM-1 and CM-2, which were SIMD-based), the CM-5
adopted a MIMD architecture but could emulate SIMD when required.
2. Architecture
• Type: Massively Parallel MIMD (Multiple Instruction, Multiple Data).
• Processing Elements (PEs):
o Each PE was a SPARC processor with local memory.
o Could range from 32 to 16,384 processors depending on system
configuration.
• Interconnection Network:
o Used a fat-tree network (highly scalable and low-latency).
o Allowed efficient communication between thousands of processors.
• Memory Model:
o Each node had local memory (distributed memory model).
o Global operations were achieved through the network.

3. Features
• Scalable Design: Performance scaled almost linearly with added processors.
• Hybrid Programming Models:
o Supported both data-parallel (SIMD-like) and task-parallel (MIMD)
execution.
• High Bandwidth Communication: Fat-tree interconnect minimized bottlenecks.
• Peak Performance:
o Initially achieved several GFLOPS (billions of floating-point operations per
second).
o Later upgraded to teraflop-scale performance, making it one of the fastest
systems of its time.

4. Software Environment
• Programming Languages:
o CM Fortran (data-parallel extensions of Fortran).
o C* (an extended version of C for parallel programming).
• OS Support: UNIX-based operating systems.

5. Applications
• Weather forecasting and climate modeling.
• Molecular dynamics and computational chemistry.
• Image processing and computer vision.
• Large-scale simulations in physics and engineering.
• AI research and machine learning (early neural networks).
6. Legacy
• The CM-5 was among the first commercial supercomputers to successfully
combine MIMD parallelism with scalable interconnection networks.
• Influenced the design of later cluster-based supercomputers and parallel
architectures.
• Made famous by appearing in the movie Jurassic Park (1993) as the park’s
"supercomputer".

You might also like