Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
110 views84 pages

Unit 4

Uploaded by

rishabelixir5492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views84 pages

Unit 4

Uploaded by

rishabelixir5492
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

CHRIST

Deemed to be University

Computer Organization & Architecture


(CS435P)

Excellence and Service


CHRIST
Deemed to be University

Unit-4 PARALLELISM

● Parallel processing challenges – Flynn„s classification – SISD, MIMD,


SIMD, SPMD, and Vector Architectures - Hardware multithreading – Multi-
core processors and other Shared Memory Multiprocessors - Introduction to
Graphics Processing Units, Clusters, Warehouse Scale Computers and other
Message-Passing Multiprocessors.

Excellence and Service


CHRIST
Deemed to be University

What is Parallelism?
• Doing Things Simultaneously

○ Same thing or different things

○ Solving one larger problem


• Serial Computing
○ Problem is broken into stream of instructions that are executed
sequentially one after another on a single processor.
○ One instruction executes at a time.
● Parallel Computing
○ Problem divided into parts that can be solved concurrently.
○ Each part further broken into stream of instructions
○ Instructions from different parts executes simultaneously.

Excellence and Service


CHRIST
Deemed to be University

Serial computation

• Traditionally in serial computation, used only a single computer having a single


Central Processing Unit (CPU).
• In the serial computation, a large problem is broken into smaller parts but these sub
part are executed one by one.
• Only a single instruction may execute at a time. So it takes lot of time for solving a
large problem.

Excellence and Service


CHRIST
Deemed to be University

Serial computation Cont........

Excellence and Service


CHRIST
Deemed to be University

Parallel Computing

Excellence and Service


CHRIST
Deemed to be University

Different forms of parallel computing

• Bit level
• Instruction level
• Data parallelism
• Task parallelism

Excellence and Service


CHRIST
Deemed to be University

Advantages of Parallel Computing

● Solve large problem easily.


● Save money and time.
● Data are transmitted fast.
● Provide concurrency.
● Communicate in the proper way.
● Good performance.
● Choose best hardware and software primitives.

Excellence and Service


CHRIST
Deemed to be University

Use of Parallel Computing

• Atmosphere, Earth, Environment


• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion,
photonics
• Bioscience, Biotechnology, Genetics
• Chemistry, Molecular Sciences
• Geology, Seismology
• Mechanical Engineering - from prosthetics to spacecraft
• Electrical Engineering, Circuit Design, Microelectronics
• Computer Science, Mathematics

Excellence and Service


CHRIST
Deemed to be University

Use of Parallel Computing


● Scientific Computing.
○ Numerically Intensive Simulations
• Database Operations and Information Systems
○ Web based services, Web search engines, Online transaction
processing.

○ Client and inventory database management, Data mining, MIS

○ Geographic information systems, Seismic data Processing


• Artificial intelligence, Machine Learning, Deep Learning
• Real time systems and Control Applications
○ Hardware and Robotic Control, Speech processing, Pattern
Recognition.

Excellence and Service


CHRIST
Deemed to be University

Parallel Computer Architectural Model

Parallel architectural model is classified into two categories as below.


○ Shared memory
○ Distributed memory

Excellence and Service


CHRIST
Deemed to be University

Flynn’s Classification
SISD, MIMD, SIMD, SPMD, and Vector
● SISD or Single Instruction stream, Single Data stream.A
uniprocessor.
● MIMD or Multiple Instruction streams, Multiple Data streams.A
multiprocessor.
● SPMD Single Program, Multiple Data streams. The conventional
MIMD programming model, where a single program runs across all
processors
● SIMD or Single Instruction stream, Multiple Data streams. A
multiprocessor. The same instruction is applied to many data streams,
as in a vector processor or array processor.
● data-level parallelism Parallelism achieved by operating on
independent data.

Excellence and Service


CHRIST
Deemed to be University
A Taxonomy of Parallel Processor Architectures

Excellence and Service


CHRIST
Deemed to be University

SISD (single-instruction single-data streams)

• SISD is a serial computer or it is a non – parallel computer system. This is the most
common type of computer. In this computer system use only single instruction and
single data stream.
• Single Instruction: Only one instruction stream is being used by the CPU.
• Single Data: Only one data stream is being used as input.

Excellence and Service


CHRIST
Deemed to be University

SISD cont.........

● Block Diagram of SISD

Excellence and Service


CHRIST
Deemed to be University

SIMD architecture

Excellence and Service


CHRIST
Deemed to be University
SIMD architecture.........
• A type of parallel computer
• Single instruction: All processing units execute the same instruction at any given
clock cycle
• Multiple data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of regularity,
such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
○ Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2,
ILLIAC IV

○ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP,
NEC SX-2, Hitachi S820, ETA10
Excellence and Service
CHRIST
Deemed to be University

MIMD architecture

Excellence and Service


CHRIST
Deemed to be University

MIMD architecture......

• Currently, the most common type of parallel computer. Most modern computers fall
into this category.
• Multiple Instruction: every processor may be executing a different instruction stream
• Multiple Data: every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non- deterministic
• Examples: most current supercomputers, networked parallel computer clusters and
"grids", multi-processor SMP computers, multi-core PCs.

Excellence and Service


CHRIST
Deemed to be University

MIMD (with shared memory)

Excellence and Service


CHRIST
Deemed to be University

MIMD (with distributed memory)

Excellence and Service


CHRIST
Deemed to be University

Multiple Instruction, Single Data (MISD)

● A single data stream is fed into multiple processing units.


● Each processing unit operates on the data independently via independent instruction
streams.
● Few actual examples of this class of parallel computer have ever existed. One is the
experimental Carnegie-Mellon C.mmp computer (1971).
● Some conceivable uses might be:
○ Multiple frequency filters operating on a single signal stream multiple
cryptography algorithms attempting to crack a single coded message.

Excellence and Service


CHRIST
Deemed to be University

Instruction and Data Streams

Excellence and Service


CHRIST
Deemed to be University
Vector Processors

Excellence and Service


CHRIST
Deemed to be University
Vector Processors
• Highly pipelined function units
• Stream data from/to vector registers (with multiple elements in a vector register) to
units

○ Data collected from memory into registers

○ Results stored from registers to memory


• Example: Vector extension to MIPS

○ 32 × 64-element registers (64-bit elements)

○ Vector instructions

• lv, sv: load/store to /from vector registers

• addv.d: add vectors of double

• addvs.d: add scalar to each element of vector of double


• Significantly reduces instruction-fetch bandwidth
Excellence and Service
CHRIST
Deemed to be University

Vector Processors
• In computing, a vector processor or array processor is a central processing unit (CPU)
that implements an instruction set containing instructions that operate on one-
dimensional arrays of data called vectors,
• Compared to the scalar processors, whose instructions operate on single data items.
• Vector processors can greatly improve performance on certain workloads, notably
numerical simulation and similar tasks.
• Rather than having 64 ALUs perform 64 additions simultaneously, like the old array
processors, the vector architectures pipelined the ALU to get good performance at
lower cost.

Excellence and Service


CHRIST
Deemed to be University

Vector versus Scalar

Vector instructions have several important properties compared to conventional


instruction set architectures, which are called scalar architectures in this context:

• A single vector instruction is equivalent to executing an entire loop. The


instruction fetch and decode bandwidth needed is dramatically reduced.

• Hardware does not have to check for data hazards within a vector instruction.

• Vector architectures and compilers have a reputation of making it much easier


than when using MIMD multiprocessors to write efficient applications when
they contain data-level parallelism.

Excellence and Service


CHRIST
Deemed to be University

Vector versus Scalar

● Hardware need only check for data hazards between two vector instructions once per
vector operand.
● The cost of the latency to main memory is seen only once for the entire vector, rather
than once for each word of the vector.
● Control hazards that would normally arise from the loop branch are non-existent.
● The savings in instruction bandwidth and hazard checking plus the efficient use of
memory bandwidth give vector architectures advantages in power and energy versus
scalar architectures.

Excellence and Service


CHRIST
Deemed to be University
Vector Computation
• Maths problems involving physical processes present different
difficulties for computation
○ Aerodynamics, seismology, meteorology

○ Continuous field simulation

• High precision
• Repeated floating point calculations on large arrays of numbers
• Supercomputers handle these types of problem
○ Hundreds of millions of flops

○ $10-15 million

○ Optimised for calculation rather than multitasking and I/O

○ Limited market

■ Research, government agencies, meteorology


● Array processor
○ Alternative to supercomputer

○ Configured as peripherals to mainframe & mini

○ Just run vector portion of problems

Excellence and Service


CHRIST
Deemed to be University

Vector Addition Example

Excellence and Service


CHRIST
Deemed to be University

Approaches

• General purpose computers rely on iteration to do vector calculations


• In example this needs six calculations
• Vector processing
○ Assume possible to operate on one-dimensional vector of data
○ All elements in a particular row can be calculated in parallel

• Parallel processing
○ Independent processors functioning in parallel

○ Use FORK N to start individual process at location N

○ JOIN N causes N independent processes to join and merge


following JOIN
• O/S Co-ordinates JOINs
• Execution is blocked until all N processes have reached
JOIN

Excellence and Service


CHRIST
Deemed to be University

Vector Processor

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification

• Memory to memory architecture


• Register to register architecture

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification


Memory to memory architecture
• In memory to memory architecture, source operands, intermediate and final results
are retrieved (read) directly from the main memory.
• For memory to memory vector instructions, the information of the base address, the
offset, the increment, and the vector length must be specified in order to enable
streams of data transfers between the main memory and pipelines.
• The processors like TI-ASC, CDC STAR-100, and Cyber-205 have vector
instructions in memory to memory formats.
• The main points about memory to memory architecture are:
• There is no limitation of size

• Speed is comparatively slow in this architecture

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification


Register to register architecture
• In register to register architecture, operands and results are retrieved indirectly from
the main memory through the use of large number of vector registers or scalar
registers.
• The processors like Cray-1 and the Fujitsu VP-200 use vector instructions in register
to register formats.
• The main points about register to register architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory architecture.
• The hardware cost is high in this architecture.

Excellence and Service


CHRIST
Deemed to be University

Symmetric Multiprocessors
● A stand alone computer with the following characteristics
○ Two or more similar processors of comparable capacity
○ Processors share same memory and I/O
○ Processors are connected by a bus or other internal connection
○ Memory access time is approximately the same for each processor
○ All processors share access to I/O
■ Either through same channels or different channels giving paths to
same devices
○ All processors can perform the same functions (hence symmetric)
○ System controlled by integrated operating system

• providing interaction between processors

• Interaction at job, task, file and data element levels

Excellence and Service


CHRIST
Deemed to be University

Symmetric Multiprocessors

Excellence and Service


CHRIST
Deemed to be University

SMP Advantages

● Performance
○ If some work can be done in parallel
● Availability
○ Since all processors can perform the same functions, failure of a single
processor does not halt the system
● Incremental growth
○ User can enhance performance by adding additional processors
● Scaling
○ Vendors can offer range of products based on number of processors

Excellence and Service


CHRIST
Deemed to be University

Block Diagram of Tightly Coupled Multiprocessor

Excellence and Service


CHRIST
Deemed to be University

Symmetric Multiprocessor Organization

Excellence and Service


CHRIST
Deemed to be University

Multithreading: Basics

● Thread
○ Instruction stream with state (registers and memory)
○ Register state is also called “thread context”
● Threads could be part of the same process (program) or from different programs
○ Threads in the same program share the same address space (shared memory
model)
● Traditionally, the processor keeps track of the context of a single thread
● Multitasking: When a new thread needs to be executed, old thread‟s context in
hardware written back to memory and new thread‟s context loaded

Excellence and Service


CHRIST
Deemed to be University

Multithreading: Basics

• The most important measure of performance for a processor is the rate at which it
executes instructions. This can be expressed as
MIPS rate = f * IPC
• where f is the processor clock frequency, in MHz, and IPC (instructions per cycle) is
the average number of instructions executed per cycle.
• Accordingly, designers have pursued the goal of increased performance on two
fronts:

○ increasing clock frequency and

○ increasing the number of instructions executed or, more properly, the number
of instructions that complete during a processor cycle.

Excellence and Service


CHRIST
Deemed to be University

Multithreading: Basics

• Designers have increased IPC by using an instruction pipeline and then by using
multiple parallel instruction pipelines in a superscalar architecture.
• With pipelined and multiple-pipeline designs, the principal problem is to maximize
the utilization of each pipeline stage.
• An alternative approach, which allows for a high degree of instruction- level
parallelism without increasing circuit complexity or power consumption, is called
multithreading.
• In essence, the instruction stream is divided into several smaller streams, known as
threads, such that the threads can be executed in parallel.

Excellence and Service


CHRIST
Deemed to be University

Multithreading: Basics

• Process: An instance of a program running on a computer. A process embodies two


key characteristics:

○ Resource ownership: A process includes a virtual address space to hold the


process image; the process image is the collection of program, data, stack, and
attributes that define the process.

○ Scheduling/execution: The execution of a process follows an execution path


(trace) through one or more programs.
• Process switch: An operation that switches the processor from one process to another,
by saving all the process control data, registers, and other information for the first and
replacing them with the process information for the second

Excellence and Service


CHRIST
Deemed to be University

Multithreading: Basics

• Thread: A dispatchable unit of work within a process. It includes a processor context


(which includes the program counter and stack pointer) and its own data area for a
stack (to enable subroutine branching).
• Thread switch: The act of switching processor control from one thread to another
within the same process

Excellence and Service


CHRIST
Deemed to be University
Hardware Multithreading
● General idea: Have multiple thread contexts in a single processor
○ When the hardware executes from those hardware contexts determines the
granularity of multithreading
● Why?
○ To tolerate latency (initial motivation)

• Latency of memory operations, dependent instructions, branch


resolution

• By utilizing processing resources more efficiently


○ To improve system throughput

• By exploiting thread-level parallelism

• By improving superscalar processor utilization

○ To reduce context switch penalty

Excellence and Service


CHRIST
Deemed to be University

Hardware Multithreading

● Benefit
+ Latency tolerance
+ Better hardware utilization (when?)
+ Reduced context switch penalty
● Cost
- Requires multiple thread contexts to be implemented in hardware
(area, power, latency cost)
- Usually reduced single-thread performance
- Resource sharing, contention
- Switching penalty (can be reduced with additional hardware)

Excellence and Service


CHRIST
Deemed to be University

Types of Multithreading

● Fine-grained (Interleaved multithreading)


○ Cycle by cycle
● Coarse-grained (Blocked multithreading)

○ Switch on event (e.g., cache miss)

○ Switch on quantum/timeout
● Simultaneous multithreading (SMT)
○ Instructions from multiple threads executed concurrently in the same cycle
● Chip multiprocessing
○ In this case, multiple cores are implemented on a single chip and each core
handles separate threads

Excellence and Service


CHRIST
Deemed to be University

Fine-grained Multithreading

• Idea: Switch to another thread every cycle such that no two instructions from the
thread are in the pipeline concurrently
• Improves pipeline utilization by taking advantage of multiple threads
• Alternative way of looking at it: Tolerates the control and data dependency latencies
by overlapping the latency with useful work from other threads

Excellence and Service


CHRIST
Deemed to be University

Fine-grained Multithreading
● Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from different
threads
+ Improved system throughput, latency tolerance, utilization
● Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread selection logic
- Reduced single thread performance (one instruction fetched every N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)

Excellence and Service


CHRIST
Deemed to be University

Coarse-grained Multithreading

● Idea: When a thread is stalled due to some event, switch to a different hardware context
○ Switch-on-event multithreading
● Possible stall events
○ Cache misses
○ Synchronization events (e.g., load an empty location)
○ FP operations

Excellence and Service


CHRIST
Deemed to be University

Fine-grained vs. Coarse-grained MT

● Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch
prediction logic completely
+ Switching need not have any performance overhead (i.e. dead cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline
state
→ Higher performance overhead with deep pipelines and large
windows
● Disadvantages
- Low single thread performance: each thread gets 1/Nth of the bandwidth of the
pipeline

Excellence and Service


CHRIST
Deemed to be University

Simultaneous multithreading (SMT)

● SMT is a variation on hardware multithreading that uses the resources of a multiple-


issue, dynamically scheduled pipelined processor to exploit thread-level parallelism at
the same time it exploits instruction level parallelism.
● The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than most single threads can effectively use.
● Furthermore, with register renaming and dynamic scheduling, multiple instructions
from independent threads can be issued without regard to the dependences among them;
the resolution of the dependences can be handled by the dynamic scheduling capability

Excellence and Service


CHRIST
Deemed to be University
Approaches to Explicit Multithreading
● Interleaved
○ Fine-grained
○ Processor deals with two or more thread contexts at a time
○ Switching thread at each clock cycle
○ If thread is blocked it is skipped
● Blocked
○ Coarse-grained
○ Thread executed until event causes delay
○ E.g.Cache miss
○ Effective on in-order processor
○ Avoids pipeline stall
● Simultaneous (SMT)
○ Instructions simultaneously issued from multiple threads to execution units of
superscalar processor
● Chip multiprocessing
○ Processor is replicated on a single chip
○ Each processor handles separate threads
Excellence and Service
CHRIST
Deemed to be University
Scalar Processor Approaches
• Single-threaded scalar
○ Simple pipeline

○ No multithreading
• Interleaved multithreaded scalar

○ Easiest multithreading to implement

○ Switch threads at each clock cycle

○ Pipeline stages kept close to fully occupied

○ Hardware needs to switch thread context between cycles


• Blocked multithreaded scalar
○ Thread executed until latency event occurs

○ Would stop pipeline

○ Processor switches to another thread


Excellence and Service
CHRIST
Deemed to be University

Scalar Diagrams

Excellence and Service


CHRIST
Deemed to be University

Multiple Instruction Issue Processors (1)

● Superscalar
○ No multithreading
● Interleaved multithreading superscalar:
○ Each cycle, as many instructions as possible issued from single thread
○ Delays due to thread switches eliminated
○ Number of instructions issued in cycle limited by dependencies
● Blocked multithreaded superscalar
○ Instructions from one thread
○ Blocked multithreading used

Excellence and Service


CHRIST
Deemed to be University

Multiple Instruction Issue Diagram (1)

Excellence and Service


CHRIST
Deemed to be University

Multiple Instruction Issue Processors (2)

● Very long instruction word (VLIW)

○ E.g. IA-64

○ Multiple instructions in single word

○ Typically constructed by compiler

○ Operations that may be executed in parallel in same word

○ May pad with no-ops


● Interleaved multithreading VLIW
○ Similar efficiencies to interleaved multithreading on superscalar architecture
● Blocked multithreaded VLIW
○ Similar efficiencies to blocked multithreading on superscalar architecture

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Parallel, Simultaneous Execution of Multiple Threads

● Simultaneous multithreading
○ Issue multiple instructions at a time
○ One thread may fill all horizontal slots
○ Instructions from two or more threads may be issued
● CH○ With enough threads, can issue maximum number of instructions on each cycle
● Chip multiprocessor
○ Multiple processors
○ Each has two-issue superscalar processor
○ Each processor is assigned thread
■ Can issue up to two instructions per cycle per thread

Excellence and Service


CHRIST
Deemed to be University

Parallel Diagram

Excellence and Service


CHRIST
Deemed to be University

Examples

• Some Pentium 4

○ Intel calls it hyperthreading

○ SMT with support for two threads

○ Single multithreaded processor, logically two processors


• IBM Power5
○ High-end PowerPC

○ Combines chip multiprocessing with SMT

○ Chip has two separate processors

○ Each supporting two threads concurrently using SMT

Excellence and Service


CHRIST
Deemed to be University

Intel® Hyper-Threading Technology

• Intel® Hyper-Threading Technology is a hardware innovation that allows more than


one thread to run on each core. More threads means more work can be done in
parallel.
• How does Hyper-Threading work?
• When Intel® Hyper-Threading Technology is active, the CPU exposes two execution
contexts per physical core. This means that one physical core now works like two
“logical cores” that can handle different software threads.
• The ten-core Intel® CoreTM i9-10900K processor, for example, has 20 threads when
Hyper-Threading is enabled.
• Two logical cores can work through tasks more efficiently than a traditional single-
threaded core. By taking advantage of idle time when the core would formerly be
waiting for other tasks to complete, Intel® Hyper-Threading Technology improves
CPU throughput (by up to 30% in server applications).

Excellence and Service


CHRIST
Deemed to be University

Clusters

• Alternative to SMP
• High performance
• High availability
• Server applications
• A group of interconnected whole computers
• Working together as unified resource
• Illusion of being one machine
• Each computer called a node

Excellence and Service


CHRIST
Deemed to be University

Clusters Benefits

• Absolute scalability
• Incremental scalability
• High availability
• Superior price/performance

Excellence and Service


CHRIST
Deemed to be University

Cluster Configurations - Standby Server, No Shared Disk

Excellence and Service


CHRIST
Deemed to be University

Cluster Configurations - Shared Disk

Excellence and Service


CHRIST
Deemed to be University

Clustering Methods: Benefits and Limitations

Excellence and Service


CHRIST
Deemed to be University

Operating Systems Design Issues


● Failure Management
○ High availability
○ Fault tolerant
○ Failover
Switching applications & data from failed system to alternative within
cluster
○ Failback
Restoration of applications and data to original system
After problem is fixed
● Load balancing
○ Incremental scalability
○ Automatically include new computers in scheduling
○ Middleware needs to recognise that processes may switch between machines

Excellence and Service


CHRIST
Deemed to be University

Parallelizing
● Single application executing in parallel on a number of machines in cluster
○ Complier
• Determines at compile time which parts can be executed in parallel
• Split off for different computers
○ Application
• Application written from scratch to be parallel
• Message passing to move data between nodes
• Hard to program
• Best end result
○ Parametric computing
• If a problem is repeated execution of algorithm on different sets of data
• e.g. simulation using different scenarios
• Needs effective tools to organize and run

Excellence and Service


CHRIST
Deemed to be University

Cluster Computer Architecture

Excellence and Service


CHRIST
Deemed to be University

Cluster Middleware

● Unified image to user


○ Single system image
● Single point of entry
● Single file hierarchy
● Single control point
● Single virtual networking
● Single memory space
● Single job management system
● Single user interface
● Single I/O space
● Single process space
● Checkpointing
● Process migration

Excellence and Service


CHRIST
Deemed to be University

Cluster v. SMP

• Both provide multiprocessor support to high demand applications.


• Both available commercially
SMP for longer
• SMP:
○ Easier to manage and control
○ Closer to single processor systems
• Scheduling is main difference
• Less physical space
• Lower power consumption
• Clustering:
○ Superior incremental & absolute scalability
○ Superior availability
■ Redundancy

Excellence and Service


CHRIST
Deemed to be University

Shared Memory Multiprocessor

• Alternative to SMP & clustering


• Uniform memory access (UMA)
○ All processors have access to all parts of memory
■ Using load & store
○ Access time to all regions of memory is the same
○ Access time to memory for different processors same
○ As used by SMP
● Nonuniform memory access (NUMA)
○ All processors have access to all parts of memory
■ Using load & store
○ Access time of processor differs depending on region of memory
○ Different processors access different regions of memory at different speeds
● Cache coherent NUMA (COMA)
○ Cache coherence is maintained among the caches of the various processors
○ Significantly different from SMP and clusters

Excellence and Service


CHRIST
Deemed to be University

Shared Memory Multiprocessors

● SHARED MEMORY MULTIPROCESSOR (SMP) A parallel processor with a


single address space, implying implicit communication with loads and stores.
● Single address space multiprocessors come in two styles.
● UNIFORM MEMORY ACCESS (UMA) A multiprocessor in which accesses to
main memory take about the same amount of time no matter which processor requests
the access and no matter which word is asked.
● NON UNIFORM MEMORY ACCESS (NUMA) A type of single address space
multiprocessor in which some memory accesses are much faster than others
depending on which processor asks for which word.

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Motivation

● SMP has practical limit to number of processors


○ Bus traffic limits to between 16 and 64 processors
● In clusters each node has own memory
○ Apps do not see large global memory
○ Coherence maintained by software not hardware
● NUMA retains SMP flavour while giving large scale multiprocessing
○ e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
● Objective is to maintain transparent system wide memory while permitting
multiprocessor nodes, each with own bus or internal interconnection system

Excellence and Service


CHRIST
Deemed to be University

NUMA Pros & Cons


• Effective performance at higher levels of parallelism than SMP
• No major software changes
• Performance can breakdown if too much access to remote memory
○ Can be avoided by:
■L1 & L2 cache design reducing all memory access
● Need good temporal locality of software
Good spatial locality of software
Virtual memory management moving pages to nodes that are using
them most
● Not transparent
○ Page allocation, process allocation and load balancing changes needed
● Availability?

Excellence and Service


CHRIST
Deemed to be University

CC-NUMA Operation
• Each processor has own L1 and L2 cache
• Each node has own main memory
• Nodes connected by some networking facility
• Each processor sees single addressable memory space
• Memory request order:

○ L1 cache (local to processor)

○ L2 cache (local to processor)

○ Main memory (local to node)

○ Remote memory
■ Delivered to requesting (local to processor) cache
● Automatic and transparent

Excellence and Service


CHRIST
Deemed to be University

CC-NUMA Organization

Excellence and Service


CHRIST
Deemed to be University

some of the key characteristics as to how GPUs vary from


CPUs:
● Graphics processing unit ( G P U ) A processor optimized for 2D and 3D graphics,
video, visual computing, and display.
● GPUs are accelerators that supplement a CPU, so they do not need be able to perform
all the tasks of a CPU.
● The programming interfaces to GPUs are high-level application programming
interfaces (APIs), such as OpenGL and Microsoft's DirectX, coupled with high-level
graphics shading languages, such as NVIDIA's C for Graphics (Cg) and Microsoft's
High Level Shader Language (ITLSL).
● Graphics processing involves drawing vertices of 3D geometry primitives such as
lines and triangles and shading or rendering pixel fragments of geometric primitives.
Video games, for example, draw 20 to 30 times as many pixels as vertices.
● To render millions of pixels per frame rapidly, the GPU evolved to execute many
threads from vertex and pbcel shader programs in parallel.
● The graphics data types are vertices, consisting of (x, y, z, w) coordinates, and pixels,
consisting of (red, green, blue, alpha) color components.
● The working set can be hundreds of megabytes, and it does not show the same
temporal locality as data does in mainstream applications. Moreover, there is a great
deal of data-level parallelism in these tasks.
Excellence and Service
CHRIST
Deemed to be University

These differences led to different styles of architecture:

● Biggest difference is that GPUs do not rely on multilevel caches to overcome the long
latency to memory, as do CPUs.
● GPUs rely on extensive parallelism to obtain high performance, implementing many
parallel processors and many concurrent threads.
● The GPU main memory is thus oriented toward bandwidth rather than latency.

Excellence and Service


CHRIST
Deemed to be University

Why Compute Unified Device Architecture (CUDA) and GPU


Computing?
● Programmability of both the hardware and the programming language is NVIDIA's
CUDA (Compute Unified Device Architecture), which enables the programmer to
write C programs to execute on GPUs
● GPU computing Using a GPU for computing via a parallel programming language
and API.
● GPGPU Using a GPU for general-purpose computation via a traditional graphics
API and graphics pipeline.
● A CUDA program is a unified C/C++ program for a heterogeneous CPU and GPU
system. It executes on the CPU and dispatches parallel work to the GPU.
● Work consists of a data transfer from main memory and a thread dispatch.
● The CUDA compiler allocates registers to each thread, under the constraint that the
registers per thread times threads per thread block does not exceed the 8192 registers
per multiprocessor.
● CUDA A scalable parallel programming model and language based on C/C++. It is a
parallel programming platform for GPUs and multicore CPUs.

Excellence and Service

You might also like