Chapter 8
PARALLEL PROCESSING
+
Multiple Processor Organization
Single instruction, single data Multiple instruction, single data
(SISD) stream (MISD) stream
Single processor executes a A sequence of data is transmitted
single instruction stream to to a set of processors, each of
operate on data stored in a single which executes a different
memory instruction sequence
Uniprocessors fall into this Not commercially implemented
category
Single instruction, multiple data Multiple instruction, multiple
(SIMD) stream data (MIMD) stream
A single machine instruction A set of processors
controls the simultaneous simultaneously execute different
execution of a number of instruction sequences on different
processing elements on a data sets
lockstep basis SMPs, clusters and NUMA systems
Vector and array processors fall fit this category
into this category
Figure 8.1
Figure 8.2
Symmetric Multiprocessor (SMP)
A stand alone computer with
the following characteristics:
Processors All System
share same processors controlled by
memory and share access integrated
I/O facilities to I/O All operating
Two or more devices processors system
• Processors are
similar connected by a • Either through can perform
• Provides
processors of bus or other same channels the same interaction
comparable internal or different functions between
capacity connection channels giving (hence processors and
• Memory access paths to same their programs
time is devices
“symmetric”) at job, task, file
approximately and data
the same for element levels
each processor
Multiprogramming
and
Multiprocessing
Figure 8.3
Figure 8.4
Symmetric Multiprocessor
Organization
Figure 8.5
+
The bus organization has several
attractive features:
Simplicity
Simplest approach to multiprocessor organization
Flexibility
Generally easy to expand the system by attaching more
processors to the bus
Reliability
The bus is essentially a passive medium and the failure of any
attached device should not cause failure of the whole system
+
Disadvantages of the bus organization:
Main drawback is performance
All memory references pass through the common bus
Performance is limited by bus cycle time
Each processor should have cache memory
Reduces the number of bus accesses
Leads to problems with cache coherence
If a word is altered in one cache it could conceivably invalidate a
word in another cache
To prevent this the other processors must be alerted that an
update has taken place
Typically addressed in hardware rather than the operating system
+
Multiprocessor Operating System
Design Considerations
Simultaneous concurrent processes
OS routines need to be reentrant to allow several processors to execute the same IS code simultaneously
OS tables and management structures must be managed properly to avoid deadlock or invalid operations
Scheduling
Any processor may perform scheduling so conflicts must be avoided
Scheduler must assign ready processes to available processors
Synchronization
With multiple active processes having potential access to shared address spaces or I/O resources, care must be
taken to provide effective synchronization
Synchronization is a facility that enforces mutual exclusion and event ordering
Memory management
In addition to dealing with all of the issues found on uniprocessor machines, the OS needs to exploit the available
hardware parallelism to achieve the best performance
Paging mechanisms on different processors must be coordinated to enforce consistency when several processors
share a page or segment and to decide on page replacement
Reliability and fault tolerance
OS should provide graceful degradation in the face of processor failure
Scheduler and other portions of the operating system must recognize the loss of a processor and restructure
accordingly
+
Cache Coherence
Software Solutions
Attempt to avoid the need for additional hardware circuitry
and logic by relying on the compiler and operating system to
deal with the problem
Attractive because the overhead of detecting potential
problems is transferred from run time to compile time, and
the design complexity is transferred from hardware to
software
However, compile-time software approaches generally must make
conservative decisions, leading to inefficient cache utilization
+
Cache Coherence
Hardware-Based Solutions
Generally referred to as cache coherence protocols
These solutions provide dynamic recognition at run time of
potential inconsistency conditions
Because the problem is only dealt with when it actually arises
there is more effective use of caches, leading to improved
performance over a software approach
Approaches are transparent to the programmer and the
compiler, reducing the software development burden
Can be divided into two categories:
Directory protocols
Snoopy protocols
Directory Protocols
Collect and Effective in large
maintain scale systems with
information about complex
copies of data in interconnection
cache schemes
Directory stored in Creates central
main memory bottleneck
Requests are Appropriate
checked against transfers are
directory performed
Snoopy Protocols
Distribute the responsibility for maintaining cache coherence
among all of the cache controllers in a multiprocessor
A cache must recognize when a line that it holds is shared with other
caches
When updates are performed on a shared cache line, it must be
announced to other caches by a broadcast mechanism
Each cache controller is able to “snoop” on the network to observe
these broadcast notifications and react accordingly
Suited to bus-based multiprocessor because the shared bus
provides a simple means for broadcasting and snooping
Care must be taken that the increased bus traffic required for
broadcasting and snooping does not cancel out the gains from the
use of local caches
Two basic approaches have been explored:
Write invalidate
Write update (or write broadcast)
+
Write Invalidate
Multiple readers, but only one writer at a time
When a write is required, all other caches of the line are
invalidated
Writing processor then has exclusive (cheap) access until
line is required by another processor
Most widely used in commercial multiprocessor systems
such as the Pentium 4 and PowerPC
State of every line is marked as modified, exclusive, shared
or invalid
For this reason the write-invalidate protocol is called MESI
+
Write Update
Can be multiple readers and writers
When a processor wishes to update a shared line the word to
be updated is distributed to all others and caches containing
that line can update it
Some systems use an adaptive mixture of both write-
invalidate and write-update mechanisms
+
MESI Protocol
To provide cache consistency on an SMP the data cache
supports a protocol known as MESI:
Modified
The line in the cache has been modified and is available only in
this cache
Exclusive
The line in the cache is the same as that in main memory and is
not present in any other cache
Shared
The line in the cache is the same as that in main memory and may
be present in another cache
Invalid
The line in the cache does not contain valid data
Table 8.1
MESI Cache Line States
MESI State Transition Diagram
Figure 8.6
+
Multithreading and Chip
Multiprocessors
Processor performance can be measured by the rate at which it
executes instructions
MIPS rate = f * IPC
f = processor clock frequency, in MHz
IPC = average instructions per cycle
Increase performance by increasing clock frequency and
increasing instructions that complete during cycle
Multithreading
Allows for a high degree of instruction-level parallelism without
increasing circuit complexity or power consumption
Instruction stream is divided into several smaller streams, known as
threads, that can be executed in parallel
Definitions of Threads
and Processes Thread in multithreaded
processors may or may not be
the same as the concept of
software threads in a
multiprogrammed operating
system
Thread is concerned with
Thread switch scheduling and execution,
whereas a process is
• The act of switching processor control
between threads within the same concerned with both
process scheduling/execution and
• Typically less costly than process resource and resource
switch ownership
Thread:
• Dispatchable unit of work within a Process:
process • An instance of program running on
• Includes processor context (which computer
includes the program counter and • Two key characteristics:
stack pointer) and data area for stack
• Resource ownership
• Executes sequentially and is
interruptible so that the processor can • Scheduling/execution
turn to another thread
Process switch
• Operation that switches the processor
from one process to another by saving all
the process control data, registers, and
other information for the first and
replacing them with the process
information for the second
Implicit and Explicit
Multithreading
All commercial processors and most
experimental ones use explicit multithreading
Concurrently execute instructions from different
explicit threads
Interleave instructions from different threads on
shared pipelines or parallel execution on parallel
pipelines
Implicit multithreading is concurrent execution
+ of multiple threads extracted from single
sequential program
Implicit threads defined statically by compiler or
dynamically by hardware
+ Approaches to Explicit
Multithreading
Interleaved Blocked
Fine-grained Coarse-grained
Processor deals with two or Thread executed until event
more thread contexts at a causes delay
time Effective on in-order
Switching thread at each processor
clock cycle Avoids pipeline stall
If thread is blocked it is
skipped Chip multiprocessing
Processor is replicated on a
Simultaneous (SMT) single chip
Instructions are Each processor handles
simultaneously issued from separate threads
multiple threads to Advantage is that the
execution units of available logic area on a chip
superscalar processor is used effectively
+
Approaches to
Executing Multiple
Threads
Figure 8.7
+
Example Systems
Pentium 4 IBM Power5
More recent models of the Chip used in high-end
Pentium 4 use a multithreading PowerPC products
technique that Intel refers to as
hyperthreading Combines chip
multiprocessing with SMT
Approach is to use SMT with Has two separate processors,
support for two threads each of which is a multithreaded
processor capable of supporting
Thus the single multithreaded two threads concurrently using
processor is logically two SMT
processors Designers found that having two
two-way SMT processors on a
single chip provided superior
performance to a single four-
way SMT processor
Power5 Instruction Data Flow
Figure 8.8
Clusters
Alternative to SMP as an approach to providing
high performance and high availability
Particularly attractive for server applications
Defined as:
A group of interconnected whole computers working
together as a unified computing resource that can
create the illusion of being one machine
(The term whole computer means a system that can run
on its own, apart from the cluster)
Each computer in a cluster is called a node
+ Benefits:
Absolute scalability
Incremental scalability
High availability
Superior price/performance
+
Cluster
Configurations
Figure 8.9
Table 8.2
Clustering Methods: Benefits and Limitations
+
Operating System Design Issues
How failures are managed depends on the clustering method used
Two approaches:
Highly available clusters
Fault tolerant clusters
Failover
The function of switching applications and data resources over from a failed system
to an alternative system in the cluster
Failback
Restoration of applications and data resources to the original system once it
has been fixed
Load balancing
Incremental scalability
Automatically include new computers in scheduling
Middleware needs to recognize that processes may switch between machines
Parallelizing Computation
Effective use of a cluster requires executing
software from a single application in parallel
Three approaches are:
Parallelizing complier Parallelized Parametric computing
• Determines at compile time application • Can be used if the essence of
which parts of an application • Application written from the the application is an
can be executed in parallel outset to run on a cluster and algorithm or program that
• These are then split off to be uses message passing to must be executed a large
assigned to different move data between cluster number of times, each time
computers in the cluster nodes with a different set of starting
conditions or parameters
Cluster Computer Architecture
Figure 8.10
Example
100-Gbps
Ethernet
Configuration
for Massive
Blade Server
Site
Figure 8.11
+
Clusters Compared to SMP
Both provide a configuration with multiple processors to
support high demand applications
Both solutions are available commercially
SMP Clustering
Easier to manage and Far superior in terms of
configure incremental and absolute
scalability
Much closer to the original
single processor model for Superior in terms of
which nearly all applications availability
are written
All components of the system
Less physical space and lower can readily be made highly
power consumption redundant
Well established and stable
+
Nonuniform Memory Access
(NUMA)
Alternative to SMP and clustering
Uniform memory access (UMA)
All processors have access to all parts of main memory using loads and stores
Access time to all regions of memory is the same
Access time to memory for different processors is the same
Nonuniform memory access (NUMA)
All processors have access to all parts of main memory using loads and stores
Access time of processor differs depending on which region of main memory
is being accessed
Different processors access different regions of memory at different speeds
Cache-coherent NUMA (CC-NUMA)
A NUMA system in which cache coherence is maintained among the caches of
the various processors
Motivation
SMP has practical limit to In clusters each node has its
number of processors that own private main memory
can be used • Applications do not see a large
• Bus traffic limits to between 16 and global memory
64 processors • Coherency is maintained by
software rather than hardware
Objective with NUMA is to
maintain a transparent
NUMA retains SMP flavor system wide memory while
while giving large scale permitting multiple
multiprocessing multiprocessor nodes, each
with its own bus or internal
interconnect system
+
CC-NUMA
Organization
Figure 8.12
+
NUMA Pros and Cons
Main advantage of a CC-
NUMA system is that it can
deliver effective performance
at higher levels of parallelism Does not transparently look
than SMP without requiring like an SMP
major software changes
Software changes will be
Bus traffic on any individual required to move an operating
node is limited to a demand system and applications from
that the bus can handle an SMP to a CC-NUMA system
If many of the memory Concern with availability
accesses are to remote nodes,
performance begins to break
down
+
Vector Computation
There is a need for computers to solve mathematical problems of
physical processes in disciplines such as aerodynamics, seismology,
meteorology, and atomic, nuclear, and plasma physics
Need for high precision and a program that repetitively performs
floating point arithmetic calculations on large arrays of numbers
Most of these problems fall into the category known as continuous-field
simulation
Supercomputers were developed to handle these types of problems
However they have limited use and a limited market because of their price tag
There is a constant demand to increase performance
Array processor
Designed to address the need for vector computation
Configured as peripheral devices by both mainframe and minicomputer users
to run the vectorized portions of programs
Vector Addition Example
Figure 8.13
+
Matrix Multiplication
(C = A * B)
Figure 8.14
+
Approaches to
Vector
Computation
Figure 8.15
+
Pipelined Processing
of Floating-Point
Operations
Figure 8.16
A Taxonomy of
Computer Organizations
Figure 8.17
+
IBM 3090 with
Vector Facility
Figure 8.18
+
Alternative
Programs
for Vector
Calculation
Figure 8.19
+
Registers for the IBM
3090 Vector Facility
Figure 8.20
Table 8.3
IBM 3090 Vector Facility:
Arithmetic and Logical Instructions
+ Summary Parallel
Processing
Chapter 8
Multithreading and chip multiprocessors
Implicit and explicit multithreading
Approaches to explicit multithreading
Multiple processor organizations Example systems
Types of parallel processor systems
Clusters
Parallel organizations
Cluster configurations
Symmetric multiprocessors Operating system design issues
Cluster computer architecture
Organization
Blade servers
Multiprocessor operating system
design considerations Clusters compared to SMP
Nonuniform memory access
Cache coherence and the MESI
Motivation
protocol
Organization
Software solutions
NUMA Pros and cons
Hardware solutions
The MESI protocol Vector computation
Approaches to vector computation
IBM 3090 vector facility