0% found this document useful (0 votes)

56 views21 pages

Parallel Computing Platforms: Chieh-Sen (Jason) Huang

This document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers limitations of memory system performance due to latency and bandwidth bottlenecks. Caches help improve effective memory latency by exploiting data locality. Increased memory bandwidth through wider memory blocks can also improve computation rates. The layout of data in memory impacts performance, and reordering computations to enhance spatial locality can improve memory bandwidth utilization.

Uploaded by

Balakrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views21 pages

Parallel Computing Platforms: Chieh-Sen (Jason) Huang

Uploaded by

Balakrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Parallel Computing Platforms

Chieh-Sen (Jason) Huang

Department of Applied Mathematics

National Sun Yat-sen University

Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Kumar for providing slides.
Topic Overview

• Implicit Parallelism: Trends in Microprocessor Architectures

• Limitations of Memory System Performance

Scope of Parallelism

• Conventional architectures coarsely comprise of a processor,

memory system, and the datapath.

• Each of these components present significant performance

bottlenecks.

• Parallelism addresses each of these components in significant

ways.

• Different applications utilize different aspects of parallelism

– e.g., data itensive applications utilize high aggregate
throughput, server applications utilize high aggregate network
bandwidth, and scientific applications typically utilize high
processing and memory system performance.

• It is important to understand each of these performance

bottlenecks.
Implicit Parallelism: Trends in Microprocessor
Architectures

• Microprocessor clock speeds have posted impressive gains

over the past two decades (two to three orders of magnitude).

• Higher levels of device integration have made available a

large number of transistors.

• The question of how best to utilize these resources is an

important one.

• Current processors use these resources in multiple functional

units and execute multiple instructions in the same cycle.

• The precise manner in which these instructions are selected

and executed provides impressive diversity in architectures.

• We shall not discuss any further on these topics. (Details, see

Grama’s slides.)
Limitations of Memory System Performance

• Memory system, and not processor speed, is often the

bottleneck for many applications.

• Memory system performance is largely captured by two

parameters, latency and bandwidth.

• Latency is the time from the issue of a memory request to the

time the data is available at the processor.

• Bandwidth is the rate at which data can be pumped to the

processor by the memory system.
Memory System Performance: Bandwidth and
Latency

• It is very important to understand the difference between

latency and bandwidth.

• Consider the example of a fire-hose. If the water comes out

of the hose two seconds after the hydrant is turned on, the
latency of the system is two seconds.

• Once the water starts flowing, if the hydrant delivers water at

the rate of 5 gallons/second, the bandwidth of the system is 5
gallons/second.

• If you want immediate response from the hydrant, it is important

to reduce latency.

• If you want to fight big fires, you want high bandwidth.

Memory Latency: An Example

Consider a processor operating at 1 GHz (1 ns clock)

connected to a DRAM with a latency of 100 ns (no caches).
Assume that the processor has two multiply-add units and is
capable of executing four instructions in each cycle of 1 ns. The
following observations follow:

• The peak processor rating is 4 GFLOPS.

• Since the memory latency is equal to 100 cycles and block

size is one word, every time a memory request is made, the
processor must wait 100 cycles before it can process the data.
Memory Latency: An Example

On the above architecture, consider the problem of

computing a dot-product of two vectors.

• A dot-product computation performs one multiply-add on

a single pair of vector elements, i.e., each floating point
operation requires one data fetch.

• It follows that the peak speed of this computation is limited to

one floating point operation every 100 ns, or a speed of 10
MFLOPS, a very small fraction of the peak processor rating!
Improving Effective Memory Latency Using Caches

• Caches are small and fast memory elements between the

processor and DRAM.

• This memory acts as a low-latency high-bandwidth storage.

• If a piece of data is repeatedly used, the effective latency of

this memory system can be reduced by the cache.

• The fraction of data references satisfied by the cache is called

the cache hit ratio of the computation on the system.

• Cache hit ratio achieved by a code on a memory system often

determines its performance.
Impact of Caches: Example

Consider the architecture from the previous example. In this

case, we introduce a cache of size 32 KB with a latency of 1 ns or
one cycle. We use this setup to multiply two matrices A and B of
dimensions 32 × 32. We have carefully chosen these numbers so
that the cache is large enough to store matrices A and B, as well
as the result matrix C.
Impact of Caches: Example (continued)

The following observations can be made about the problem:

• Fetching the two matrices into the cache corresponds to

fetching 2K words, which takes approximately 200 µs (2000 ×
100ns).

• Multiplying two n × n matrices takes 2n3 operations. For our

problem, this corresponds to 64K operations, which can be
performed in 16K cycles (or 16 µs) at four instructions per cycle.

• The total time for the computation is therefore approximately

the sum of time for load/store operations and the time for the
computation itself, i.e., 200 + 16µs.

• This corresponds to a peak computation rate of 64K FLOP/216µs

or 303 MFLOPS.
Impact of Caches

• In our example, we had O(n2 ) data accesses and O(n3)

computation. This asymptotic difference makes the above
example particularly desirable for caches.

• Repeated references to the same data item correspond to

temporal locality.

Spatial locality
• Aggressive caching
Temperal locality
1. Spatial locality : Access data in blocks.
2. Temperal locality : Reuse data that is already loaded.
Impact of Caches

• Loop unrolling

Do i = 1, n, 4
sum1 = sum1 + a[i]*b[i]
sum2 = sum2 + a[i+1]*b[i+1]
sum3 = sum3 + a[i+2]*b[i+2]
sum4 = sum4 + a[i+3]*b[i+3]
End do

sum = sum1 + sum2 + sum3 + sum4

• Blocked matrix multiplication

A1 A2 B1 B2 A1B1 + A2B3 A1B2 + A2B4
A3 A4 B3 B4 A3B1 + A4B3 A3B2 + A4B4

• Homework: Compute the FLOPS of the loop unrolling and

blocked matrix multiplication examples.
Impact of Memory Bandwidth

The architecture of GPU.

• Memory bandwidth is determined by the bandwidth of the

memory bus as well as the memory units.

• Memory bandwidth can be improved by increasing the size of

memory blocks.
Impact of Memory Bandwidth: Example

Consider the same setup as before, except in this case, the

block size is 4 words instead of 1 word. We repeat the dot-product
computation in this scenario:

• Assuming that the vectors are laid out linearly in memory, eight
FLOPs (four multiply-adds) can be performed in 200 cycles.

• This is because a single memory access fetches four

consecutive words in the vector.

• Therefore, two accesses can fetch four elements of each of

the vectors. This corresponds to a FLOP every 25 ns, for a peak
speed of 40 MFLOPS.
Impact of Memory Bandwidth

• It is important to note that increasing block size does not

change latency of the system.

• Physically, the scenario illustrated here can be viewed as a

wide data bus (4 words or 128 bits) connected to multiple
memory banks.

• In practice, such wide buses are expensive to construct.

• In a more practical system, consecutive words are sent on the

memory bus on subsequent bus cycles after the first word is
retrieved.
Impact of Memory Bandwidth

• The above examples clearly illustrate how increased bandwidth

results in higher peak computation rates.

• The data layouts were assumed to be such that consecutive

data words in memory were used by successive instructions
(spatial locality of reference).

• If we take a data-layout centric view, computations must be

reordered to enhance spatial locality of reference.
Impact of Memory Bandwidth: Example

Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (i = 0; i < 1000; i++)
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into a vector

column_sum.

int a[2][3];
for(int i=0;i<2;i++){
for (int j=0;j<3;j++)
cout<<’\t’<<&a[i][j];
cout<<endl;
}
-----------------------------------------------------------
0x7fff9987c700 0x7fff9987c704 0x7fff9987c708
0x7fff9987c70c 0x7fff9987c710 0x7fff9987c714
Impact of Memory Bandwidth: Example

• The vector column_sum is small and easily fits into the cache

• The matrix b is accessed in a column order.

• The strided access results in very poor performance.

b1 b2 b3 b4
=

+ + +

A A A A

(a) Column major data access

= = = =

A b A b A b A b

(b) Row major data access.

Multiplying a matrix with a vector: (a) multiplying

column-by-column, keeping a running sum; (b) computing each
element of the result as a dot product of a row of the matrix with
the vector.
Impact of Memory Bandwidth: Example

We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];

In this case, the matrix is traversed in a row-order and

performance can be expected to be significantly better.
Memory System Performance: Summary

The series of examples presented in this section illustrate the

following concepts:

• Exploiting spatial and temporal locality in applications is

critical for amortizing memory latency and increasing effective
memory bandwidth.

• The ratio of the number of operations to number of memory

accesses is a good indicator of anticipated tolerance to
memory bandwidth.

• Memory layouts and organizing computation appropriately

can make a significant impact on the spatial and temporal
locality.

Personal Notes Gate Smashers
No ratings yet
Personal Notes Gate Smashers
73 pages
Data Structure MCQ Questions
No ratings yet
Data Structure MCQ Questions
12 pages
Onur Mutlu All Lecs 447
No ratings yet
Onur Mutlu All Lecs 447
503 pages
Limitation of Memory Sys Per
No ratings yet
Limitation of Memory Sys Per
38 pages
Cache Memory: Computer Architecture Unit-1
No ratings yet
Cache Memory: Computer Architecture Unit-1
54 pages
Memory Hierarchy Main Memory Auxiliary Memory Associative Memory Cache Memory Virtual Memory
No ratings yet
Memory Hierarchy Main Memory Auxiliary Memory Associative Memory Cache Memory Virtual Memory
22 pages
CA09 2024S2 New
No ratings yet
CA09 2024S2 New
29 pages
Thi Kien Truc May Tinh Va Hop Ngu
No ratings yet
Thi Kien Truc May Tinh Va Hop Ngu
15 pages
Multilevel Memories
No ratings yet
Multilevel Memories
14 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
So en 228 Tutorial 7 Sol
No ratings yet
So en 228 Tutorial 7 Sol
24 pages
Handout Chapter-5 PBK
No ratings yet
Handout Chapter-5 PBK
21 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
DLCA CH 05 - Memory Organization Part 1
No ratings yet
DLCA CH 05 - Memory Organization Part 1
156 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
Week 11
No ratings yet
Week 11
45 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Advanced Memory Design Guide
No ratings yet
Advanced Memory Design Guide
48 pages
Csa Mod 2
100% (1)
Csa Mod 2
28 pages
OS Memory Lab Tasks No Answers
No ratings yet
OS Memory Lab Tasks No Answers
4 pages
Ca-Module Ii Notes
No ratings yet
Ca-Module Ii Notes
75 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
Growth of Lan Technology
No ratings yet
Growth of Lan Technology
5 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Os It Short Notes-1
No ratings yet
Os It Short Notes-1
19 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Parallel Computing Insights
No ratings yet
Parallel Computing Insights
47 pages
Computer Organization and Architecture Module 3
100% (1)
Computer Organization and Architecture Module 3
34 pages
Memory Organisation
No ratings yet
Memory Organisation
34 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
Cache Memory Organization Guide
No ratings yet
Cache Memory Organization Guide
19 pages
Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
No ratings yet
Library For Matrix Multiplication-Based Data Manipulation On A Mesh-Of-Tori Architecture
8 pages
ACA Unit 2
No ratings yet
ACA Unit 2
45 pages
Document 90
No ratings yet
Document 90
11 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
L-4 (Cache Memory)
No ratings yet
L-4 (Cache Memory)
61 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
Lecture 13 16 Post
No ratings yet
Lecture 13 16 Post
24 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
16 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Questions and Answers On Memory System
No ratings yet
Questions and Answers On Memory System
97 pages
23 Cache Memory Basics 11-03-2024
No ratings yet
23 Cache Memory Basics 11-03-2024
19 pages
ACA Notes
No ratings yet
ACA Notes
60 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
03-Chap4-Cache Memory Mapping
No ratings yet
03-Chap4-Cache Memory Mapping
24 pages
Operating Systems Chapter (1) Review Questions: Memory or Primary Memory
No ratings yet
Operating Systems Chapter (1) Review Questions: Memory or Primary Memory
8 pages
Coa Unit Test QP 1
0% (1)
Coa Unit Test QP 1
7 pages
CPU Cycles and Pipeline Performance
No ratings yet
CPU Cycles and Pipeline Performance
16 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Dynamic Memory Allocation Review
No ratings yet
Dynamic Memory Allocation Review
78 pages
2 Key Concepts: Assignments
No ratings yet
2 Key Concepts: Assignments
18 pages
Cache Memory
No ratings yet
Cache Memory
39 pages
13m Arch PDF
No ratings yet
13m Arch PDF
17 pages
Memory Hierarchy Main Memory Auxiliary Memory Associative Memory Cache Memory Virtual Memory Memory MGT Hardware
No ratings yet
Memory Hierarchy Main Memory Auxiliary Memory Associative Memory Cache Memory Virtual Memory Memory MGT Hardware
8 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
Lecturer Notes Cte 214 Computer Architecture
No ratings yet
Lecturer Notes Cte 214 Computer Architecture
33 pages
Computer Memory & Storage Guide
No ratings yet
Computer Memory & Storage Guide
11 pages
Patterson6e MIPS Ch05 Modified Part2
No ratings yet
Patterson6e MIPS Ch05 Modified Part2
121 pages
Automotive Embedded Question - Rasmi Ranjan Nayak
No ratings yet
Automotive Embedded Question - Rasmi Ranjan Nayak
30 pages
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
No ratings yet
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
38 pages
Co MCQ MergedAll
No ratings yet
Co MCQ MergedAll
44 pages
Wolf and Lam
No ratings yet
Wolf and Lam
38 pages
sc14 HPCG
No ratings yet
sc14 HPCG
11 pages
Computer Architecture: Lecture 1: Introduction and Basics
No ratings yet
Computer Architecture: Lecture 1: Introduction and Basics
28 pages
Parallel Processing Chapter - 2
0% (1)
Parallel Processing Chapter - 2
135 pages
hw6 Circuits
No ratings yet
hw6 Circuits
4 pages
Data and Instruction Caches
No ratings yet
Data and Instruction Caches
6 pages