Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
32 views25 pages

Chapter 7

The document discusses different types of multiprocessor and parallel computing systems, including their advantages and challenges. It describes multiprocessors that have multiple processors, multicore microprocessors that have multiple processor cores, and computer clusters connected over a network that can function as a single large multiprocessor. The document also discusses shared memory multiprocessors and message passing architectures, noting it can be difficult to write parallel programs that efficiently utilize multiple processors.

Uploaded by

Ali Al-Ramadan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views25 pages

Chapter 7

The document discusses different types of multiprocessor and parallel computing systems, including their advantages and challenges. It describes multiprocessors that have multiple processors, multicore microprocessors that have multiple processor cores, and computer clusters connected over a network that can function as a single large multiprocessor. The document also discusses shared memory multiprocessors and message passing architectures, noting it can be difficult to write parallel programs that efficiently utilize multiple processors.

Uploaded by

Ali Al-Ramadan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

University

CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021

Multicores, Multiprocessors,
and Clusters

1
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Multiprocessor – a computer system with at least 2 processors


• 1 processor breaks  others continue

• Performance, reliability, availability

• Job level parallelism (or process-level parallelism)


• Different programs – different processors

• parallel processing program


• 1 program – different processors

• Cluster – a number of computers connected over a LAN and work


together as a one large multiprocessor

• Multicore microprocessor – a microprocessor that contains multiple


processors (cores) in a single integrated circuit.

• Parallel programming
• Execute efficiently in performance and power

2
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware & Software

Challenge – effective use of parallel hardware

parallel processing program or parallel software =


sequential or concurrent software running on parallel hardware
5

Difficulty of writing Parallel Processing


Programs
• Difficult to write software that uses multiple processors to complete 1
task faster
• The problem gets worse as number of processors increases

• You must get better performance + efficiency OR


just use sequential program on a uniprocessor as it is easier,….
superscalar/out-of-order/… without programmers’ involvement

3
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Parallel processing programs – much harder to write than sequential


programs ?!

• Communication overhead – Scheduling

• Load balancing; Divide work equally

• Time Synchronization

• Scheduling

Even small parts must be parallelized for a program to make


good use of cores

0.1%; 0.001

4
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Getting good speedup on a multiprocessors while keeping the problem


size fixed is harder than it with increasing the problem size

Time for an addition = t 100 by 100 case


Time for an addition = t
Single processor: 10t+100t = 110t
Single processor: 10010t
10 processors: 10t + 100t/10 = 20t
10 processors: 1010t
Speedup = 5.5
Speedup = 9.9
(5.5/10)*100% = 55% of the potential speedup 99% of the potential speedup

100 processors: 10t + 100t/100 = 11t 100 processors: 110t


Speedup = 10 Speedup = 91
(10/100)*100% = 10% of the potential speedup 91% of the potential speedup

10

10

5
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

perfect load balance (previous example):


100 processors, each one 1% of load
Speedup 91
1 processor 2%x10000 = 200 additions 1 processor 5%x10000 = 500 additions
99 processors 9800 additions 99 processors 9500 additions

Time = max(200t, 9800t/99) + 10t = 210t Time = max(500t, 9500t/99) + 10t = 510t
Speedup = 48 Speedup = 20

11

11

Shared Memory Multiprocessors

12

12

6
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

How to simplify the task,… an option: SMP


• A single physical address space – all processors share
• Variables can be available any time to any Processor
• Can run independent jobs in virtual space
• Communicate using shared variables

13

13

2 styles of SMP,…
• UMA – Uniform Memory Access
• the same time to access main memory – any processor requests it
and any word is requested

• NUMA – Non-Uniform Memory Access


• depends on which processor requests which word
• programming challenges harder
• can scale to larger sizes
• can have lower latency to nearby memory

14

14

7
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Synchronization
• Processors should coordinate when sharing data
• Lock is one mechanism – one processor access a shared data at a
time

15

15

• Step 1 – equal subsets

16

16

8
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Step 2 – Reduction; divide to conquer

17

17

Clusters and Other Message-


Passing Multiprocessors

18

18

9
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Each processor –
private physical space

• Communicate with
message passing
• ACK is possible

• Some applications run well


on both shared or private
spaces

19

19

• Disadvantages –
• cost of administration of n machines 
cost of administration of n independent
machines
• cost of administration shared M with n processors  cost of
administration of 1 machine

• processors – interconnected using I/O interconnect of each computer

• Overhead of M division – n machines  n Ms, n OSs

20

20

10
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

21

21

• There are 100 subsets  send one subset to each M


• Each computer find the sum of each subset

22

22

11
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Reduction – add partial sums

23

23

• Better availability
• Much easier to disconnect a machine, reinstall, replace, …

• Whole computers and independent scalable networks  easier to


expand without bringing down the application running on top of the
cluster

• Lower cost, high availability, improved power efficiency, and


rapid, incremental expandability 
• clusters attractive to service providers for the World Wide Web.

24

24

12
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware Multithreading

25

25

• multiple threads – share functional units of a single processor in an


overlapping way
• One thread is stalled  switch to another one quickly
• Keep a copy of the state of each thread
• Memory can be shared through VM mechanisms, which already
support multiprogramming

• 2 approaches
- Interleaving
- Individual threads
- start-up overhead

26

26

13
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Simultaneous Multithreading - SMT


• a variation on hardware multithreading
• uses the resources of a multiple-issue, dynamically scheduled
processor
• to exploit thread-level parallelism at the same time it exploits
instruction-level parallelism.

• Multiple instructions from independent threads – can be issued


regardless of dependencies,….. Register Renaming + Dynamic
Scheduling
• Execute instructions from multiple threads, & leave Hardware to
associate instructions slots and renamed registers with their threads
27

27

Example

28

28

14
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SISD, MIMD, SIMD, SPMD, and


Vector
A characterization of parallel hardware based on:
# of instruction streams, # of data streams

29

29

SISD – a uniprocessor

MIMD – a multiprocessor
• Different programs
• 1 program – conditional statements
• SPMD (Single Program Multiple Data)

30

30

15
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SIMD – single instruction applied to many data


• Vector, array processors
• 1 add  send 64 data streams to 64 ALUs  64 sums in 1 cycle
• All units synchronized, and 1 PC
• Reduce cost of control unit over dozens of execution units
• Reduce size of program memory – 1 copy of code
• Best in array in for loops - identically structured data
• Data level parallelism

31

31

SIMD in x86: Multimedia Extensions


• MMX and SSE instructions
• Improve performance of multimedia programs
• Instructions allow the hardware to have many simultaneous ALUs, or
• split a wide ALU to many simultaneous ALUs
• a 64-bit ALU = two 32-bit ALUs = four 16-bit ALUs = eight 8-bit ALUs
• Stores/Loads – as wide as the widest ALU
• Width of operation and registers – in the opcode

32

32

16
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Vector
• Pipelined ALU
• Get data into registers, operate on them sequentially, store result
back to M
• Vector registers
• Like an entire loop
• Hardware doesn’t have to check for data hazards in the same Vector
• Control hazards in loops are nonexistent
• Number of elements in a separate register

33

33

Introduction to Graphics
Processing Units (GPU)

34

34

17
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Many processors were connected to the graphics displays


• Increasing processing time for graphics
• Improve it
• Controllers – accelerate 2D and 3D graphics
• Rapid growing market game
• Graphic Processing Units

35

35

• A GPU supplements a CPU – no need for all tasks


• Can do some tasks, poorly or not,..
• Heterogenous Combination – CPU-GPU not all identical processors
• Programming interface – high-level application programming
interfaces (APIs) + high-level graphics shading
• OpenGL, DirectX
• VIDIA’s C for Graphics (Cg), Microsoft’s High Level Shader Language (HLSL)
• Drawing of vertices of 3D geometry primitives like lines and shading
or rendering pixel fragments
• Each vertex or pixel – independent drawing/rendering
• Threads
• data types are vertices, consisting of (x, y, z, w) coordinates, and
pixels, consisting of (red, green, blue, alpha) color components.
36

36

18
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Working set can be hundreds of megabytes; not the same temporal


locality as data does in mainstream applications
• a great deal of data-level parallelism

• Do not rely on multilevel caches to overcome M latency


• Rely on having enough threads

• Rely on extensive parallelism for high performance – many parallel


processors and concurrent threads
• Each GPU processor – Highly multi-threaded

37

37

• Main memory oriented toward BW not Latency

• Heterogenous/Identical

• SIMD instructions historically,


• Recently, focusing on scalar instructions – to improve
programmability and efficiency

• Was no support for double precision floating-point arithmetic – no


need in graphics applications

38

38

19
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

General Purpose GPUs or GPGPUs


• Use GPU for applications – performance
• C-inspired programming languages – write directly for the GPUs
• Brook – a streaming language for GPUs
• NVIDIA’s CUDA
• Write C programs to execute on GPUs – some restrictions

• Also for parallel programming

39

39

Introduction to Multiprocessor
Network Topologies
Multicore chips  networks on chips to connect cores

40

40

20
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Cost depends on
• Switches
• Links on a switch
• Width (bits #) per link
• Length of links

• Performance
• Throughput – max # of messages in a time
• Latency to send and receive messages
• Contention
• …
• Fault tolerance
• Power Efficiency
41

41

• Links – bidirectional,…
• Processor-Memory Node

• Bus
• Total BW = BW of the bus
= 2xBWlink
• the bisection bandwidth = BWlink

42

42

21
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Ring
• Total BW = PxBWlink
• the bisection bandwidth = 2xBWlink

43

43

• Fully Connected
• Each P – a bidirectional link to every other P
• Total BW = P × (P - 1)/2
• the bisection bandwidth is (P/2)2

44

44

22
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

45

45

Fallacies and Pitfalls

46

46

23
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

47

47

• Do not forget to try “Check Yourself” sections


• Answers given at end of chapter

48

48

24
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Any questions/comments?

Thank you
49

49

25

You might also like