Advancedcomputer Architecture
Advancedcomputer Architecture
UNIT 1:
➢ Theory of Parallelism:
1. Introduction
2. Types of Parallelism
• Refers to executing multiple instructions at the same time within a single processor.
• Achieved through pipelining, superscalar execution, out-of-order execution, and
branch prediction.
• Example: Modern CPUs can fetch, decode, and execute multiple instructions per
clock cycle.
• Exploits the fact that the same operation can be applied to multiple data elements
simultaneously.
• Implemented using vector processors, SIMD (Single Instruction Multiple Data),
and GPUs.
• Example: Image processing where the same filter is applied to millions of pixels at
once.
3. Models of Parallelism
4. Benefits of Parallelism
5. Challenges in Parallelism
6. Applications
The state of computing in advanced computer architecture refers to the current trends,
developments, and challenges in designing high-performance systems. As the demand for
faster processing, massive data handling, artificial intelligence, and cloud computing grows,
architectures must evolve beyond traditional single-core designs.
2. Evolution of Computing Architectures
• Past (Single-core Era): Early computers relied on sequential execution, where one
instruction was processed at a time.
• Transition (Moore’s Law & Pipelining): Increased transistor counts enabled faster
processors, pipelining, and superscalar execution.
• Present (Multicore & Parallelism): Due to power and heat limitations, focus shifted
to multicore processors, GPUs, and heterogeneous architectures.
• Future (Quantum & Neuromorphic): Emerging technologies are pushing
computing toward quantum processors, AI-driven accelerators, and brain-
inspired chips.
5. Future Outlook
. Introduction
2. Multiprocessors
Characteristics
Disadvantages
3. Multicomputers
Characteristics
Advantages
Disadvantages
1. Introduction
Characteristics
Examples
Applications
• Image and video processing.
• Matrix operations in scientific computing.
• Neural network training and inference.
3. Multivector Computers
Characteristics
Examples
Applications
1. Introduction
In advanced computer architecture, different models are used to analyze, design, and
evaluate parallel computing systems. Two important theoretical and practical models are:
Characteristics
1. EREW (Exclusive Read Exclusive Write): No two processors can read/write the
same memory cell at the same time.
2. CREW (Concurrent Read Exclusive Write): Multiple processors can read the same
memory cell, but only one can write at a time.
3. ERCW (Exclusive Read Concurrent Write): Processors read exclusively but can
write concurrently.
4. CRCW (Concurrent Read Concurrent Write): Multiple processors can read and
write simultaneously (with rules like priority write, common write, etc.).
Advantages
Limitations
Characteristics
Applications
Advantages
Limitations
5. Conclusion
• PRAM model provides a theoretical foundation for designing and analyzing parallel
algorithms, though it is abstract and not directly realizable.
• VLSI model provides a practical framework for implementing these algorithms on
real hardware, focusing on chip area and time efficiency.
Together, they bridge the gap between parallel algorithm design and hardware
implementation in advanced computer architecture.
➢ ARCHITECTURAL DEVELOPMENT
TRACKS:
1. Introduction
The progress of computer architecture has followed certain development tracks (or
directions), shaped by the need for higher performance, scalability, energy efficiency, and
specialized applications. These tracks represent how architectures evolved from simple
sequential machines to highly parallel and heterogeneous systems.
3. Summary Table
Track Focus Examples
Instruction-Level Parallelism Speed within single processor Superscalar, VLIW CPUs
Thread/Process-Level Parallelism across Multicore CPUs,
Parallelism threads/cores Multiprocessors
Same operation on large data SIMD, GPUs, Vector
Data-Level Parallelism
sets Processors
Reducing latency &
Memory/Storage Cache, HBM, NUMA
bottlenecks
Interconnection Networks Efficient communication Mesh, Torus, NoC
Specialized Architectures Task-specific acceleration TPUs, DSPs, Quantum Chips
4. Conclusion
Architectural development in advanced computing has branched into multiple tracks, each
addressing performance, efficiency, and scalability from a different perspective. While ILP
and multicore processors remain general-purpose, DLP and specialized accelerators dominate
domains like AI, scientific computing, and big data. Future development tracks point
toward quantum computing, neuromorphic systems, and energy-efficient architectures.
➢ PROGRAM AND NETWORK
PROPERTIES:
1. Introduction
In parallel and distributed computing, the performance and efficiency of a system depend
on two key aspects:
1. Program Properties – the characteristics of the program that determine how well it
can be parallelized.
2. Network Properties – the characteristics of the interconnection network that affect
communication among processors.
2. Program Properties
These describe how a program behaves in terms of parallelism, communication, and
execution requirements.
• Programs may need frequent synchronization points (barriers, locks), which affect
efficiency.
3. Network Properties
These define how the interconnection network impacts the execution of parallel programs.
(a) Topology
(b) Diameter
(f) Scalability
5. Conclusion
• Program properties (parallelism, granularity, communication ratio) define how
much parallel speedup can be achieved.
• Network properties (topology, latency, bandwidth, scalability) determine how
effectively processors can communicate.
• The balance of these two factors is essential for high-performance computing
(HPC), AI, big data, and distributed systems.
➢ CONDITIONS OF PARALLELISM:
Introduction
Parallelism is the foundation of modern high-performance computing. However, not all tasks
can be executed in parallel. To exploit Instruction-Level, Data-Level, Thread-Level, and
Process-Level Parallelism, certain conditions must be satisfied. These conditions are
generally derived from data dependencies, resource constraints, and control flow
requirements.
Parallel execution is possible only if tasks are not dependent on the results of each other.
Types of data dependencies:
✔ Parallelism exists only if data dependencies are removed or minimized (via renaming,
loop unrolling, speculation, etc.).
• Input sets (I) and Output sets (O) of both processes do not conflict.
• The conditions are:
1. I1∩O2=∅I_1 ∩ O_2 = ∅I1∩O2=∅ (P1 does not read values written by P2)
2. I2∩O1=∅I_2 ∩ O_1 = ∅I2∩O1=∅ (P2 does not read values written by P1)
3. O1∩O2=∅O_1 ∩ O_2 = ∅O1∩O2=∅ (P1 and P2 do not write to the same
variable)
3. Practical Considerations
• Amdahl’s Law: The maximum speedup is limited by the sequential portion of a
program.
• Load Balancing: Processors must be evenly loaded to prevent idle time.
• Synchronization: Proper coordination among processors is needed to maintain
correctness.
4. Conclusion
The conditions of parallelism are determined mainly by data, resource, and control
dependencies, along with Bernstein’s formal model. For efficient parallel computing,
dependencies must be minimized, resources must be adequate, and workloads balanced.
These conditions guide the design of parallel algorithms, compilers, and architectures in
modern computing systems
1. Introduction
In parallel and distributed computing, a program must be broken into smaller tasks and
then assigned to processors for execution. This process involves:
Both steps are critical for achieving high performance, load balancing, and minimal
communication overhead.
2. Program Partitioning
Partitioning means breaking a large program into smaller tasks, modules, or processes that
can be executed concurrently.
Objectives
Methods of Partitioning
1. Functional Partitioning
o Divide program based on functionality.
o Example: One task for input, another for computation, another for output.
2. Data Partitioning
o Divide data among processors; each processor performs the same operation on
different data.
o Example: Matrix multiplication (each processor handles a block of the
matrix).
3. Recursive Partitioning
o Break tasks into smaller subtasks repeatedly until tasks fit processor limits.
o Example: Divide-and-conquer algorithms (QuickSort, FFT).
4. Domain Partitioning
o Divide problem space into sub-domains, each solved in parallel.
o Example: Weather simulation grids, finite element analysis.
Granularity in Partitioning
3. Program Scheduling
After partitioning, tasks must be scheduled on processors. Scheduling decides execution
order, mapping, and timing.
Objectives
Types of Scheduling
1. Static Scheduling
o Tasks are assigned before execution.
oSimple, less overhead.
oExample: Round-robin, block partitioning.
2. Dynamic Scheduling
o Tasks assigned at runtime based on system load.
o More flexible, adapts to workload changes.
o Example: Work stealing, load balancing schedulers.
5. Conclusion
• Program Partitioning ensures a program is broken into manageable, parallelizable
tasks.
• Program Scheduling ensures these tasks are efficiently mapped onto processors.
• Together, they determine the performance, scalability, and efficiency of parallel
systems in advanced computer architecture.
Applications include HPC, AI/ML training, real-time systems, cloud computing, and
scientific simulations.
1. Introduction
In computer architecture, program flow mechanisms define how instructions are executed
and controlled in a processor. They determine the order of instruction execution, support
parallelism, and help maximize CPU utilization.
Modern architectures use advanced flow mechanisms to overcome control hazards, data
hazards, and branch penalties while exploiting parallel execution.
• In dataflow architecture, instructions execute only when their input operands are
available (instead of strict program order).
• No central program counter → execution driven by data availability.
• Advantage: Naturally exposes parallelism.
• Example: Used in scientific computing, functional programming models.
• Program execution flow is divided into multiple threads, which may run
concurrently.
• Helps hide latency (memory stalls, I/O delays).
• Implemented in SMT (Simultaneous Multithreading), GPUs, and multicore
CPUs.
4. Conclusion
Program flow mechanisms are essential for efficient instruction execution in advanced
computer architectures.
➢ SYSTEM INTERCONNECT
ARCHITECTURES:
1. Introduction
In multiprocessor and multicomputer systems, multiple processors, memory modules, and
I/O devices must communicate efficiently.
The design of system interconnect architecture defines how these components are linked
and how data is transferred among them.
• A shared communication path where all processors and memory modules are
connected.
• Only one transfer at a time.
• Advantages: Simple, low cost.
• Disadvantages: Limited scalability, bus contention.
• Example: Early multiprocessors, small SMP systems.
(B) Crossbar Switch
• Data passes through multiple switching stages between processors and memory.
• Popular topologies:
o Omega Network
o Butterfly Network
o Clos Network
• Advantages: Lower cost than crossbar, supports parallel transfers.
• Disadvantages: Blocking may occur (two transfers needing same path).
5. Comparison Table
Architecture Cost Performance Scalability Example Use
Bus Low Low Poor Small SMPs
Crossbar High High Poor (expensive) Supercomputers (small scale)
Multistage Networks Medium Medium-High Moderate Parallel computers
Mesh / Torus Medium High High HPC, GPUs
Hypercube Medium High High Research systems
Hierarchical Variable High Very High Cloud clusters, datacenters
6. Conclusion
System interconnect architectures are the backbone of multiprocessor and multicomputer
performance.
➢ PRINCIPLES OF SCALABLE
PERFORMANCE:
1. Introduction
In advanced computer systems, scalability means that performance should increase
proportionally when resources (processors, memory, interconnects) are increased.
The principles of scalable performance define the rules and design techniques that ensure
parallel systems can deliver higher throughput without bottlenecks.
2. Key Principles
(A) Workload Scalability
• Balance between:
o Computation power (CPU speed)
o Memory capacity & bandwidth
o I/O throughput
o Interconnect bandwidth
• Example: Adding more processors without increasing memory bandwidth leads to
bottlenecks.
(C) Efficient Resource Utilization
• All processors, memory modules, and network links should be kept busy.
• Avoid idle processors due to poor scheduling, communication delays, or load
imbalance.
• Techniques: Dynamic scheduling, load balancing, multithreading.
• Locks, barriers, and message passing must scale with processor count.
• Avoid centralized control (single lock = bottleneck).
• Use distributed synchronization, lock-free algorithms, atomic operations.
• Programs should maximize data locality (access nearby memory more often).
• Reduce remote memory accesses in distributed systems.
• Caching, NUMA-aware memory allocation, and data partitioning improve scalability.
5. Conclusion
The principles of scalable performance ensure that as processors and resources increase,
the system continues to deliver proportional improvements in throughput.
Thus, scalability is the foundation of modern HPC systems, cloud datacenters, and AI
accelerators.
1. Introduction
Performance in computer architecture refers to how fast, efficient, and scalable a system is
in executing programs.
To evaluate and compare systems, we use metrics (quantitative values) and measures
(methods of evaluation).
(E) Utilization
• Isoefficiency Metric: Defines how problem size must increase with processors to
keep efficiency constant.
5. Conclusion
Execution time, throughput, speedup, efficiency, CPI, and FLOPS are core metrics, while
benchmarks and models (Amdahl, Gustafson) guide real-world evaluation.
. Introduction
• Parallel Processing refers to the simultaneous execution of multiple instructions or
tasks by dividing a problem into smaller parts and processing them concurrently.
• It overcomes the limitations of sequential execution and improves performance,
throughput, and scalability.
• Enabled by multiprocessors, multicomputers, SIMD, MIMD, vector processors,
GPUs, and cloud clusters.
• CAD/CAM simulations.
• Robotics and automation.
• Real-time control systems.
7. Challenges
• Programming complexity (parallel algorithms, synchronization).
• Communication overhead.
• Load balancing issues.
• Scalability limitations (Amdahl’s Law).
8. Conclusion
Parallel processing is the foundation of modern computing, powering everything from
scientific research to AI applications.
By leveraging multiprocessors, multicomputers, SIMD/MIMD architectures, and
scalable algorithms, advanced computer architectures can meet the growing demand for
high-speed, data-intensive, and real-time applications.
➢ SPEED UP PERFORMANCE LAWS:
1. Introduction
• Speedup measures how much faster a parallel system is compared to a single-
processor system.
• Performance laws help in evaluating, predicting, and optimizing parallel processing
systems.
• They explain limits of parallelism, scalability, and efficiency.
where:
where:
Shows parallelism is more beneficial when workload grows with system size.
8. Conclusion
• Amdahl’s Law → pessimistic (bottleneck focus).
• Gustafson’s Law → optimistic (scalable workload).
• Karp–Flatt, Sun & Ni, Isoefficiency → practical for real-world HPC.
Together, these laws guide the design, evaluation, and optimization of advanced
computer architectures for parallelism.
1. Introduction
• Scalability = ability of a parallel computer or algorithm to maintain performance
when resources (processors, memory, workload) increase.
• A system is scalable if speedup and efficiency improve proportionally with added
resources.
• Key for High-Performance Computing (HPC), Cloud, Big Data, and AI systems.
2. Scalability Metrics
1. Speedup (S): Ratio of serial execution time to parallel execution time.
2. Efficiency (E): Speedup per processor.
E=SpeedupNE = \frac{Speedup}{N}E=NSpeedup
3. Isoefficiency Function: Defines how problem size must grow with processors to
maintain efficiency.
4. Cost: Product of processors and execution time.
5. Scalability Factor: How performance increases when scaling both workload and
resources.
3. Scalability Analysis
To analyze scalability, we look at:
• Amdahl’s Law: Shows upper limits due to sequential parts (not scalable beyond
bottleneck).
• Gustafson’s Law: Demonstrates scalability when workload grows with processors.
• Isoefficiency Analysis: Evaluates how problem size must scale with resources.
• Bottleneck Identification: Communication, synchronization, memory bandwidth.
4. Approaches to Scalability
(A) Hardware Approaches
5. Challenges in Scalability
• Communication Overhead: More processors = more data exchange.
• Synchronization Delays: Locks and barriers reduce efficiency.
• Load Imbalance: Some processors idle while others overloaded.
• Memory Bottleneck: Limited bandwidth and latency.
• Energy and Cost Constraints: Scaling hardware is expensive and power-hungry.
6. Conclusion
• Scalability is a key measure of parallel system effectiveness.
• An ideal scalable system maintains high efficiency, low overhead, and balanced
workload as processors increase.
• Achieved by combining hardware improvements (multiprocessors, networks) and
software optimizations (parallel algorithms, scheduling, load balancing).
• Future Scalability Approaches: AI-driven scheduling, quantum computing
integration, and energy-efficient parallelism.
➢ HARDWARE TECHNOLOGIES:
1. Introduction
• Hardware technologies form the foundation of parallel and scalable computing.
• They determine performance, energy efficiency, scalability, and communication
speed in modern architectures.
• Advances in processor design, memory systems, interconnection networks, and
accelerators drive next-generation computing.
• Cache Hierarchies (L1, L2, L3): Reduce latency in accessing frequently used data.
• NUMA (Non-Uniform Memory Access): Memory distributed across nodes for
scalability.
• High-Bandwidth Memory (HBM): 3D-stacked memory for faster data transfer
(used in GPUs, AI chips).
• Non-Volatile Memory (NVM): Combines speed of RAM with persistence of
storage.
• Memory Coherence Protocols: Ensure consistency across multiple processors
(MESI, MOESI).
• GPUs (Graphics Processing Units): Thousands of SIMD cores for massive data
parallelism.
• TPUs (Tensor Processing Units): Google’s AI accelerator for deep learning.
• FPGAs (Field Programmable Gate Arrays): Customizable hardware for specific
workloads.
• ASICs (Application-Specific Integrated Circuits): Highly optimized chips for tasks
like AI or cryptography.
• Quantum Processors: Exploit quantum mechanics for parallel computation
(emerging).
• Solid-State Drives (SSDs): Faster access than HDDs, essential for HPC.
• Parallel File Systems: (Lustre, GPFS) for large-scale data access.
• NVMe over Fabrics: High-speed I/O across networks.
4. Conclusion
• Hardware technologies in advanced computer architecture enable parallelism, speed,
scalability, and efficiency.
• Innovations in processors, memory, interconnects, and accelerators are shaping the
future of HPC, AI, and cloud systems.
• Future progress lies in heterogeneous, energy-efficient, and specialized hardware.
1. Introduction
• In modern computer systems, processors
(CPUs/cores) and memory systems are
organized in a hierarchy to balance speed,
cost, and capacity.
• The hierarchy ensures that frequently used
data is quickly accessible while large data
sets are stored efficiently.
• Together, they enable high-performance
parallel computing.
2. Processor Hierarchy
Processors are organized to exploit parallelism
at multiple levels:
1.Instruction Level (ILP):
o Techniques like pipelining and
superscalar execution.
o Multiple instructions executed per
cycle.
2.Thread Level (TLP):
o Multithreading (e.g., SMT, Hyper-
Threading).
o Multiple threads share processor
resources.
3.Core Level (CMP – Chip Multiprocessors):
o Multiple cores on a single chip (dual-
3. Memory Hierarchy
Memory is arranged to minimize latency and
cost, while maximizing speed and capacity.
Memory Hierarchy Levels:
1.Registers:
o Inside the CPU, fastest, smallest storage
(nanoseconds).
o Store operands for immediate
execution.
2.Cache Memory:
L1 Cache: Closest to the core, smallest
o
but fastest.
o L2 Cache: Larger, slower than L1.
4. Process–Memory Interaction
• Processor Speed vs. Memory Speed Gap
(Memory Wall): CPUs are much faster than
memory, requiring caches to reduce
latency.
• Locality of Reference: Programs repeatedly
access the same memory locations (spatial
and temporal locality) → basis for caching.
• NUMA (Non-Uniform Memory Access):
In multiprocessors, memory is distributed,
and access speed depends on location.
• Virtual Memory: Provides abstraction so
programs use more memory than physically
available, managed via paging.
➢ ADAVNCED PROCESSOR
TECHNOLOGY:
1. Introduction
• Advanced processor technologies aim to improve performance, efficiency, and
scalability of modern computer systems.
• Focus areas include instruction-level parallelism, multicore design, heterogeneous
processing, pipelining, and specialized accelerators.
• They are the backbone of high-performance computing (HPC), AI, cloud, and
embedded systems.
• Divide instruction execution into stages (fetch, decode, execute, memory, write-
back).
• Multiple instructions execute concurrently in different pipeline stages.
• Increases throughput but may face hazards (data, control, structural).
4. Conclusion
• Advanced processor technologies focus on parallelism (ILP, TLP, DLP), energy
efficiency, and specialization.
• They are essential for scalable, high-performance computing in fields like AI, big
data, cloud, and supercomputing.
• The future lies in heterogeneous, domain-specific, and quantum-inspired
architectures.
➢ SUPER SCALAR AND VECTOR
PROCESSOR:
1. Introduction
In advanced computer architecture, superscalar and vector processors are two powerful
techniques designed to enhance parallel execution and improve performance.
2. Superscalar Processors
Definition:
A superscalar processor can issue and execute multiple instructions per clock cycle using
multiple functional units.
Characteristics:
Advantages:
Examples:
3. Vector Processors
Definition:
Advantages:
Examples:
5. Conclusion
• Superscalar processors improve speed by exploiting ILP across multiple
instructions.
• Vector processors accelerate performance by exploiting DLP across large datasets.
• Modern systems often combine both:
o CPUs → Superscalar execution.
o GPUs/Accelerators → Vector/SIMD processing.
UNIT-3
1. Introduction
• In parallel computer systems, processors often need to communicate and share
data.
• A shared memory organization provides a common address space accessible by all
processors.
• It enables inter-processor communication through read/write operations instead of
explicit message passing.
4. Advantages
• Easier programming model (shared variables instead of explicit messaging).
• Faster communication compared to message-passing systems.
• Efficient for tightly coupled multiprocessors.
5. Challenges
• Scalability issues: Memory bus can become a bottleneck as processor count
increases.
• Cache coherence overhead: Maintaining consistency across many caches is
complex.
• Synchronization delays: Multiple processors competing for the same data can cause
contention.
6. Applications
• Widely used in:
o Multicore CPUs (Intel, AMD processors).
o High-performance servers.
o Parallel programming models like OpenMP and Pthreads.
7. Conclusion
• Shared memory organizations are fundamental in multiprocessor systems.
• UMA works for small multiprocessors, while NUMA and COMA improve
scalability for larger systems.
• Despite challenges like coherence and contention, shared memory remains a core
architecture for parallel computing.
➢ SEQUENTIAL AND WEAK
CONSISTENCY MODELS:
1. Introduction
In shared memory multiprocessor systems, multiple processors may read and write shared
variables.
• To ensure correctness, the system must define how memory operations appear to all
processors.
• This is governed by memory consistency models.
“The result of execution is the same as if all operations were executed in some sequential
order, and the operations of each processor appear in this sequence in the order issued by
the program.” — Lamport (1979)
Key Properties:
Example:
Advantages:
Disadvantages:
Key Properties:
Example:
Advantages:
Disadvantages:
• Harder to program.
• Bugs may occur if synchronization is missing.
5. Conclusion
• Sequential consistency ensures correctness but limits performance.
• Weak consistency improves scalability and speed but requires careful
synchronization.
• Modern architectures (x86, ARM, GPUs) use weaker consistency models with
explicit synchronization instructions to balance performance and correctness.
1. Introduction
Modern processors aim to achieve high performance by exploiting parallelism in
instruction execution. Two important methods are:
2. Pipelining
Definition
• A technique in which the instruction execution process is divided into stages, and
different instructions are executed in parallel at different stages.
• Similar to an assembly line in manufacturing.
1. IF – Instruction Fetch
2. ID – Instruction Decode & Operand Fetch
3. EX – Execute (ALU operations)
4. MEM – Memory Access
5. WB – Write Back
Advantages
Challenges
3. Superscalar Techniques
Definition
• A superscalar processor can issue and execute multiple instructions per clock
cycle by using multiple functional units.
Key Features
Example:
Advantages
Challenges
5. Conclusion
• Pipelining improves performance by overlapping instruction execution stages.
• Superscalar techniques go further, allowing multiple instructions to be executed in
parallel per cycle.
• Together, these form the backbone of modern CPU architectures, enabling high-
performance computing.
1. Introduction
• Pipelining is a fundamental technique used to improve CPU performance.
• A Linear Pipeline Processor is the simplest form of pipeline where:
o Processing stages are arranged linearly (sequentially).
o Each stage performs a part of the computation.
o Instructions (or data) flow step by step through these stages, just like in an
assembly line.
2. Definition
A Linear Pipeline Processor is a sequence of processing stages connected in a linear
manner, where:
4. Characteristics
• Linear flow of data/instructions.
• Single entry and single exit.
• No branching or feedback loops within the pipeline.
• Throughput improves as multiple instructions are overlapped.
5. Advantages
1. High throughput – One result per cycle (after initial filling).
2. Simple design – Easy to implement and control.
3. Resource utilization – Each stage works in parallel.
4. Scalability – Can add more stages to increase performance.
6. Limitations
1. Pipeline hazards:
o Structural (resource conflicts).
o Data (RAW, WAR, WAW dependencies).
o Control (branch misprediction).
2. Stalling occurs if dependencies are not resolved.
3. Performance gain is limited by the slowest stage.
4. Not flexible for irregular tasks (works best for repetitive tasks).
7. Applications
• Instruction pipelines in RISC processors.
• Arithmetic pipelines (e.g., floating-point adders, multipliers).
• Image and signal processing where sequential stages are common.
8. Example
Suppose we have a 4-stage linear pipeline:
➢ NON-LINEAR PIPELINE
PROCESSORS:
1. Introduction
• A pipeline processor can be linear (simple sequential flow) or non-linear (complex
paths).
• In contrast to linear pipelines, which have a single input → sequential stages →
single output,
Non-Linear Pipeline Processors have:
o Multiple inputs/outputs,
o Branching, merging, or feedback paths.
This allows them to handle more complex computations beyond straight-line processing.
2. Definition
A Non-Linear Pipeline Processor is a pipeline in which:
• Stages are not arranged in a strict sequence.
• Data/instructions may split into multiple paths, recombine, or even loop back.
• It supports conditional execution, parallel branches, and feedback loops.
3. Structure
• Composed of stages connected in non-linear ways:
o Branching → One stage may feed into multiple next stages.
o Merging → Multiple paths may join into one stage.
o Feedback → Output of a stage may return as input to a previous stage.
Examples:
4. Characteristics
• Flexible execution paths.
• Supports parallel as well as conditional tasks.
• Higher hardware complexity compared to linear pipelines.
• Can improve performance for non-sequential tasks.
5. Advantages
1. Can handle complex operations (loops, conditionals, multi-output computations).
2. Greater flexibility compared to linear pipelines.
3. Supports parallel branches for higher throughput.
4. Useful in VLSI design, multiprocessors, and complex data processing.
6. Limitations
1. High complexity in design and control logic.
2. Pipeline hazards are harder to manage (especially with feedback loops).
3. Difficult to balance stages (some paths may become bottlenecks).
4. Requires sophisticated scheduling and dependency resolution.
7. Applications
• Computer graphics and multimedia → where multiple paths of computation exist.
• Signal and image processing → where iterative and conditional operations are
common.
• VLSI and multiprocessor systems → for flexible dataflow architectures.
• Instruction pipelines with branch prediction and speculative execution.
1. Introduction
• Instruction execution in a processor involves multiple steps (fetch, decode, execute,
etc.).
• Instead of executing one instruction at a time, pipelining allows overlapping these
steps for multiple instructions.
• Instruction Pipeline Design is the method of organizing CPU stages to execute
several instructions concurrently, improving throughput.
2. Basic Idea
• Divide instruction execution into k stages.
• Each stage performs a specific sub-task.
• Different instructions are processed simultaneously in different stages.
• Works like an assembly line in a factory.
Solutions:
5. Performance Metrics
• Pipeline Throughput (TP): Number of instructions completed per unit time.
• Pipeline Latency (L): Time taken for one instruction to complete (pipeline depth ×
clock cycle).
• Speedup (S): Ratio of execution time without pipeline to execution time with
pipeline.
o Ideal speedup = number of stages.
• Efficiency (E): E=Actual speedupIdeal speedupE = \frac{\text{Actual
speedup}}{\text{Ideal speedup}}E=Ideal speedupActual speedup.
6. Advantages
• High instruction throughput.
• Better utilization of CPU hardware.
• Increases system performance without faster clock speeds.
• Foundation for RISC design and superscalar architectures.
7. Limitations
• Suffering from hazards → stalls, bubbles reduce performance.
• Complexity increases with deeper pipelines.
• Not every instruction benefits equally (e.g., branches cause disruptions).
8. Applications
• Used in modern RISC processors (MIPS, ARM).
• Basis for superscalar processors (multiple pipelines in parallel).
• Used in vector processors and VLIW architectures.
2. Definition
Arithmetic Pipeline Design is the process of dividing an arithmetic operation into
sequential stages, where each stage performs a part of the operation, allowing multiple
operands to be processed concurrently.
1. Align exponents
2. Add mantissas
3. Normalize result
4. Round and store
3. Basic Structure
An arithmetic pipeline consists of:
4. Example Designs
(a) Floating-Point Addition Pipeline
5. Performance
• Latency (L): Time taken for one operation to complete.
• Throughput (TP): Number of operations completed per unit time.
o After the pipeline is filled, one arithmetic operation can be completed per
clock cycle.
• Speedup: Pipeline can speed up arithmetic-intensive tasks significantly.
6. Advantages
• High throughput for repetitive arithmetic tasks.
• Efficient hardware utilization.
• Parallelism at the micro-operation level.
• Very useful in scientific and engineering applications requiring floating-point
operations.
7. Limitations
• Complexity of pipeline control increases.
• Pipeline hazards (e.g., data dependency between arithmetic operations).
• Fixed pipeline stages → not flexible for all types of arithmetic operations.
• Uneven stage delays can cause imbalance and reduce efficiency.
8. Applications
• Floating-point units (FPUs) in modern processors.
• Digital Signal Processing (DSP) – Fast FFT, convolution, filtering.
• Graphics Processing Units (GPUs) – Vector/matrix arithmetic.
• Scientific supercomputers – Weather forecasting, simulations, AI computations.
➢ SUPERSCALAR PIPELINE DESIGN:
1. Introduction
• Superscalar architecture improves CPU performance by allowing multiple
instructions to be issued and executed in parallel per clock cycle.
• Unlike scalar processors (which fetch/issue one instruction per cycle), a superscalar
processor may fetch/issue 2, 4, or more instructions per cycle.
• To achieve this, the CPU uses parallel pipelines and advanced hardware techniques
for dependency checking and scheduling.
2. Definition
A Superscalar Pipeline is a processor design that uses multiple parallel execution
pipelines within the CPU to execute more than one instruction per clock cycle, subject to
dependency and resource availability.
3. Key Features
1. Multiple Instruction Fetch & Decode Units – Can fetch/issue several instructions at
once.
2. Parallel Execution Pipelines – ALUs, FPUs, and Load/Store units operate
simultaneously.
3. Dynamic Scheduling – Detects instruction dependencies at runtime (Tomasulo’s
algorithm, scoreboarding).
4. Out-of-Order Execution – Instructions may execute out of program order to
maximize pipeline utilization.
5. Speculative Execution – Predicts branches to avoid pipeline stalls.
6. Performance
• Ideal Speedup: If an n-way superscalar CPU can issue n instructions per cycle →
theoretical speedup = n × scalar performance.
• Practical Limitation: Dependency hazards, cache misses, and branch mispredictions
reduce actual speedup.
7. Advantages
• High Instruction-Level Parallelism (ILP) → better throughput.
• Faster execution of general-purpose programs.
• Transparent to programmers (no need for explicit parallel code).
8. Limitations
• Hardware complexity (dependency checking, branch prediction).
• Diminishing returns – beyond 4–8 pipelines, parallelism is limited by available ILP
in programs.
• Higher power consumption and chip area.
9. Applications
• Almost all modern general-purpose CPUs (Intel Core, AMD Ryzen, ARM Cortex,
Apple M-series).
• High-performance servers and workstations.
• Advanced embedded processors.
UNIT-4
2. Parallel Architecture
Parallel architecture is the organization of a computer system that uses multiple processors
or processing elements to execute tasks concurrently.
Types of Parallelism:
1. Data Parallelism – Same operation performed on different data (e.g., SIMD, GPUs).
2. Task Parallelism – Different tasks executed in parallel (e.g., MIMD
multiprocessors).
3. Pipeline Parallelism – Breaking tasks into pipeline stages (e.g., superscalar, VLIW).
3. Scalable Architecture
A system is scalable if its performance grows with the addition of more processors/resources.
Scalability Aspects:
6. Principles of Scalability
• Minimize communication overhead.
• Avoid bottlenecks (memory, network).
• Balance computation and communication.
• Use load balancing strategies.
7. Applications
• Supercomputers – Weather prediction, molecular modeling.
• AI/ML Systems – Training deep neural networks on GPUs/TPUs.
• Big Data Processing – Hadoop, Spark on distributed clusters.
• Cloud Computing – Scalable virtualization and container systems.
8. Advantages
• Handles large-scale problems.
• High throughput and reduced execution time.
• Scalable to meet future demands.
9. Limitations
• Complex hardware and software design.
• Communication overhead in large-scale systems.
• Amdahl’s Law limits maximum speedup.
➢ MULTIPROCESSORS AND
MULTICOMPUTERS:
1. Introduction
• To achieve parallel processing, computer systems can be designed as either:
1. Multiprocessors → Shared-memory systems.
2. Multicomputers → Distributed-memory systems.
• Both are used to improve performance, scalability, and fault tolerance in advanced
computing.
2. Multiprocessors
A multiprocessor system is a computer with two or more CPUs that share a common
memory and are connected by a high-speed bus or interconnection network.
Characteristics:
Types:
Advantages:
Limitations:
3. Multicomputers
A multicomputer system consists of multiple independent computers (nodes), each with its
own private memory and CPUs, connected via a communication network.
Characteristics:
Examples:
Advantages:
Limitations:
5. Applications
• Multiprocessors: Real-time systems, databases, operating systems.
• Multicomputers: Supercomputing, scientific simulations, big data processing, AI
training.
6. Conclusion
• Multiprocessors → Best for small-to-medium scale systems where ease of
programming is important.
• Multicomputers → Best for large-scale parallel computing requiring scalability and
distributed control.
MULTIPROCESSOR SYSTEM
INTERCONNECTS:
1. Introduction
• In multiprocessor systems, multiple CPUs (and caches) must communicate with
each other and with shared memory.
• The interconnection network defines how processors, memory modules, and I/O
devices are linked.
• The efficiency of these interconnects determines:
o Performance (latency, bandwidth).
o Scalability (number of processors supported).
o Reliability (fault tolerance).
2. Types of Interconnects
A. Shared Bus Interconnect
• All processors, memory modules, and I/O devices are connected to a single common
bus.
• Only one transfer at a time is allowed.
Advantages:
Limitations:
Advantages:
• High bandwidth.
• Eliminates bus bottleneck.
Limitations:
• Expensive (requires P×MP \times MP×M switches for P processors and M memory
modules).
Advantages:
Limitations:
• Blocking can occur if two paths need the same switch simultaneously.
Advantages:
• Scalable.
• Local communication efficient.
Limitations:
E. Hypercube Interconnect
Advantages:
Limitations:
3. Classification
• Static Interconnects: Fixed links (Mesh, Torus, Hypercube).
• Dynamic Interconnects: Use switching elements (Bus, Crossbar, MINs).
4. Performance Metrics
• Bandwidth: Amount of data transferred per unit time.
• Latency: Time to deliver a message.
• Scalability: Ability to support more processors without bottlenecks.
• Fault Tolerance: Ability to reroute in case of failures.
5. Applications
➢ Shared Bus → Small multiprocessor systems (SMPs).
➢ Crossbar / MINs → Medium-sized multiprocessors.
➢ Mesh, Torus, Hypercube → Large-scale parallel computers and supercomputers.
➢ CACHE COHERENCE AND
SYNCHRONIZATION MECHANISM:
1. Introduction
• In multiprocessor systems, each processor often has a local cache to reduce memory
access latency.
• When multiple caches store copies of the same memory block, inconsistencies can
arise if one cache updates the data while others still hold stale values.
• To ensure correct execution of parallel programs, we need:
o Cache coherence protocols → maintain consistency of shared data.
o Synchronization mechanisms → control access to shared variables and
prevent race conditions.
B. Snoopy Protocols
5. Synchronization Mechanisms
Used to control concurrent access to shared resources.
A. Locks
B. Semaphores
C. Barriers
• Force all processors to wait until each has reached a synchronization point before
continuing.
6. Challenges
• Coherence overhead (extra communication).
• False sharing: Different variables in the same cache block cause unnecessary
invalidations.
• Scalability: More processors → higher complexity.
7. Conclusion
• Cache coherence ensures data consistency across multiple caches.
• Synchronization mechanisms prevent race conditions and enable correct parallel
execution.
• Together, they are fundamental for achieving correctness and performance in
multiprocessor systems.
➢ THREE GENERATIONS OF
MULTICOMPUTERS:
1. Introduction
• Multicomputers are message-passing parallel computers where each processor has:
o Private memory (no global shared memory).
o Interconnection network for communication.
• Unlike multiprocessors (shared memory), multicomputers use explicit message
passing for data exchange.
• Their evolution is classified into three generations based on technology and
architecture.
5. Comparison of Generations
Feature 1st Gen (1980s) 2nd Gen (1990s) 3rd Gen (2000s–Now)
Processors Tens Hundreds–Thousands Millions (clusters)
Feature 1st Gen (1980s) 2nd Gen (1990s) 3rd Gen (2000s–Now)
Mesh, Torus,
Interconnect Bus, Ring, Cube High-speed networks
Hypercube
Programming Basic message
MPI, PVM support MPI + Hybrid models
Model passing
Performance GFLOPS TFLOPS PFLOPS → ExaFLOPS
Cosmic Cube, Intel Paragon, nCUBE- IBM Blue Gene, Cray
Examples
iPSC/1 2 XT
6. Conclusion
• First Gen: Prototype and early research machines (limited scalability).
• Second Gen: High-performance interconnects, adoption in scientific computing.
• Third Gen: Commodity-based large-scale clusters, powering today’s
supercomputers.
• Multicomputers evolved from experimental parallelism to practical large-scale
HPC systems.
•
5. Mechanisms
1. Direct Communication
o Processes communicate by naming each other explicitly.
o Example: send(P1, data) → receive(P2, data).
2. Indirect Communication (via Mailboxes/Ports)
o Messages delivered to a mailbox/queue.
o Processes retrieve from it asynchronously.
3. Buffered vs. Unbuffered
o Buffered → messages stored temporarily in queues.
o Unbuffered → sender/receiver must be synchronized.
4. Reliability Mechanisms
o Error detection, acknowledgment, retransmission.
o Essential for large distributed systems.
6. Message-Passing Libraries
• MPI (Message Passing Interface) – standard for scientific computing.
• PVM (Parallel Virtual Machine) – earlier system for heterogeneous clusters.
• Features include point-to-point and collective operations.
7. Applications
• High-performance computing (HPC).
• Distributed simulations.
• Cloud and cluster computing.
• Data-parallel applications like weather forecasting, AI training, etc.
Limitations
9. Conclusion
• Message-passing mechanisms provide a foundation for distributed memory
parallel computing.
• Modern HPC relies on MPI for scalable performance.
• Trade-off: More scalable than shared memory, but harder to program.
Both are widely used for scientific computing, AI, image processing, and HPC
applications.
2. Multivector Computers
• Definition: Multivector processors are systems capable of executing multiple vector
instructions concurrently.
• Vector processing means handling entire arrays (vectors) of data with a single
instruction.
• Example: Instead of adding two numbers, it adds two vectors element-wise.
Characteristics
Advantages
Examples
3. SIMD Computers
• Definition: SIMD stands for Single Instruction, Multiple Data.
• A single instruction is broadcast to multiple processing elements, each working on
different data simultaneously.
Characteristics
Advantages
Examples
5. Applications
• Multivector → Scientific computing, engineering simulations, physics.
• SIMD → Graphics rendering, machine learning, image/video processing.
6. Conclusion
• Both Multivector and SIMD architectures exploit data parallelism.
• Multivector = focuses on vector pipelines and multiple vector ops.
• SIMD = focuses on applying one instruction across multiple data streams.
• Together, they form the backbone of modern HPC, AI, and GPU computing.
UNIT-5
VECTOR PROCESSING PRINCIPLES:
1. Introduction
• Vector Processing is a form of data-level parallelism where a single instruction
operates on a set of data elements (vector) instead of a single scalar value.
• Example: Instead of adding two numbers, a vector processor can add two arrays
element by element in one instruction.
• Widely used in scientific computing, simulations, multimedia, and AI
applications.
(b) Pipelining
• Vector processors use deep pipelines so that multiple elements are processed
concurrently.
• Each pipeline stage handles one element at a time, producing high throughput.
4. Advantages
• High throughput for repetitive numeric computations.
• Reduces instruction fetch/decode overhead.
• Efficient for large datasets in scientific and multimedia applications.
• Simplifies parallel programming compared to message passing.
5. Limitations
• Not suitable for small or irregular data sets.
• Performance depends on vector length (longer vectors = better efficiency).
• More expensive hardware (vector registers, pipelines).
6. Applications
• Scientific computing: Physics simulations, weather forecasting.
• Engineering: Finite element analysis, CAD.
• AI & ML: Matrix-vector multiplications in neural networks.
• Graphics & Multimedia: Image filtering, video encoding.
7. Conclusion
• Vector processing is a cornerstone of advanced computer architecture, exploiting
data-level parallelism.
• Key principles include vector instructions, pipelining, chaining, masking, and
memory optimizations.
• Though challenged by irregular data patterns, vector processors form the foundation
for modern GPUs, SIMD units, and high-performance computing systems.
MULTIVECTOR MULTIPROCESSORS:
. Introduction
• A multivector multiprocessor is a parallel computing system that combines the
power of multiprocessing (multiple CPUs working together) with vector processing
(operating on arrays of data).
• It extends the SIMD (Single Instruction, Multiple Data) and MIMD (Multiple
Instruction, Multiple Data) paradigms by supporting multiple vector processors
working in parallel.
• Goal: Achieve massive data-level and task-level parallelism for scientific, AI, and
big data applications.
2. Architecture
• Consists of multiple vector processors (VPUs), each with:
o Vector registers (to store long arrays).
o Vector pipelines (for arithmetic operations like add, multiply, divide).
o Scalar unit (to handle control and scalar data).
• Processors are connected via a multiprocessor interconnect (shared memory,
NUMA, or message-passing network).
• Supports both vector instructions and multiprocessor coordination.
3. Features
1. Parallel Vector Units: Multiple vector processors execute vector operations
simultaneously.
2. Task + Data Parallelism: Supports MIMD task distribution while exploiting
SIMD vector execution within each task.
3. Chaining Across Processors: Output of one vector unit can be fed to another without
delay.
4. Scalable Interconnects: Uses high-speed interconnects like crossbar switches,
hypercube, mesh, or fat-tree networks.
5. Shared or Distributed Memory Models: Works with UMA, NUMA, or distributed
memory depending on system size.
4. Advantages
• High throughput for scientific and engineering workloads.
• Exploits both fine-grained (vector) and coarse-grained (multiprocessor)
parallelism.
• Scalable to large numbers of processors.
• Reduces instruction overhead (one vector instruction = many operations).
5. Limitations
• Complex interconnection network and synchronization mechanisms needed.
• Performance drops if workloads are not vectorizable.
• High hardware cost (vector pipelines + multiprocessor interconnects).
6. Applications
• Supercomputers (e.g., Cray-2, NEC SX series, Fujitsu VP).
• Scientific simulations (climate modeling, astrophysics, quantum mechanics).
• AI & ML workloads (matrix multiplications, tensor computations).
• Big Data Analytics & Multimedia (image/video processing).
7. Conclusion
• Multivector multiprocessors are a powerful architectural model combining
multiprocessing with vector processing.
• They achieve massive speedup for large-scale scientific and engineering problems
by exploiting data-level parallelism at multiple levels.
• Though expensive and complex, they represent a key step in the evolution toward
modern GPUs, AI accelerators, and exascale supercomputers.
Example:
Instead of executing separately:
VADD V1 = A + B
VMUL V2 = V1 * C
VSUB V3 = V2 - D
3. Features
1. Vector Chaining – Intermediate results pass directly between functional units.
2. Pipelined Compound Operations – Multiple vector operations are executed in a
pipeline fashion.
3. Parallel Execution – Different functional units (adders, multipliers, dividers) work in
parallel.
4. Reduced Instruction Overhead – Fewer instructions need to be fetched and
decoded.
5. High Data Reuse – Avoids unnecessary memory accesses.
4. Advantages
• Higher throughput due to parallelism across multiple vector operations.
• Reduced memory traffic (results are forwarded directly).
• Better performance on scientific workloads (matrix operations, PDE solvers).
• Exploits data locality by reusing operands.
5. Challenges
• Requires complex control logic to manage dependencies between chained operations.
• Limited benefit if workload is not vectorizable.
• Hardware cost is higher due to multiple pipelines and forwarding paths.
6. Applications
• Supercomputers (e.g., Cray vector processors, NEC SX series).
• Scientific computing – weather forecasting, computational physics, fluid dynamics.
• AI & ML – matrix multiplications, tensor contractions.
• Graphics & Multimedia – image filtering, transformations.
7. Conclusion
• Compound Vector Processing enhances traditional vector processing by executing
multiple vector operations together using chaining and pipelining.
• It significantly boosts performance in scientific and AI workloads.
• Though complex and costly, it paved the way for modern GPUs and AI accelerators
that also rely on compound vector-like parallel execution.
b) Vector Processors
• Operates on vectors (arrays of data elements) rather than individual scalar data.
• Uses vector registers to store large datasets and perform operations in a pipelined
manner.
• Example: Cray vector supercomputers.
5. Advantages
• High throughput for data-parallel problems.
• Reduced instruction overhead (one instruction operates on many data).
• Energy-efficient (same control for multiple operations).
• Ideal for multimedia, AI, and scientific computing.
6. Challenges
• Not suitable for control-parallel problems (where different tasks need different
instructions).
• Data alignment issues (data must be structured for parallel access).
• Scalability limitations compared to MIMD (Multiple Instruction, Multiple Data)
systems.
7. Applications
• Scientific computing – matrix multiplication, weather modeling, simulations.
• Multimedia & graphics – video encoding, image filtering, rendering.
• AI & Machine Learning – tensor/matrix computations, deep learning acceleration.
• Cryptography & signal processing – FFT, encryption/decryption.
8. Conclusion
• SIMD computer organizations exploit data parallelism by applying the same
instruction across large datasets.
• Early SIMD systems like array processors influenced modern vector
supercomputers, SIMD CPU extensions, and GPUs.
•
1. Introduction
• The Connection Machine CM-5 is a massively parallel supercomputer developed
by Thinking Machines Corporation (TMC) in the early 1990s.
• It was designed to support scalable parallelism for scientific, AI, and large-scale
computational problems.
• Unlike its predecessors (CM-1 and CM-2, which were SIMD-based), the CM-5
adopted a MIMD architecture but could emulate SIMD when required.
2. Architecture
• Type: Massively Parallel MIMD (Multiple Instruction, Multiple Data).
• Processing Elements (PEs):
o Each PE was a SPARC processor with local memory.
o Could range from 32 to 16,384 processors depending on system
configuration.
• Interconnection Network:
o Used a fat-tree network (highly scalable and low-latency).
o Allowed efficient communication between thousands of processors.
• Memory Model:
o Each node had local memory (distributed memory model).
o Global operations were achieved through the network.
3. Features
• Scalable Design: Performance scaled almost linearly with added processors.
• Hybrid Programming Models:
o Supported both data-parallel (SIMD-like) and task-parallel (MIMD)
execution.
• High Bandwidth Communication: Fat-tree interconnect minimized bottlenecks.
• Peak Performance:
o Initially achieved several GFLOPS (billions of floating-point operations per
second).
o Later upgraded to teraflop-scale performance, making it one of the fastest
systems of its time.
4. Software Environment
• Programming Languages:
o CM Fortran (data-parallel extensions of Fortran).
o C* (an extended version of C for parallel programming).
• OS Support: UNIX-based operating systems.
5. Applications
• Weather forecasting and climate modeling.
• Molecular dynamics and computational chemistry.
• Image processing and computer vision.
• Large-scale simulations in physics and engineering.
• AI research and machine learning (early neural networks).
6. Legacy
• The CM-5 was among the first commercial supercomputers to successfully
combine MIMD parallelism with scalable interconnection networks.
• Influenced the design of later cluster-based supercomputers and parallel
architectures.
• Made famous by appearing in the movie Jurassic Park (1993) as the park’s
"supercomputer".