High Performance Scientific computing
Lecture 4
S. Gopalakrishnan!
Memory Issues
Memory hierarchy
Faster
Costlier
Typical Hierarchy
Memory Latency Problem
Cache/MM virtual memory
Processor-DRAM Memory Performance Gap
Motivation for Memory Hierarchy
C µProc
1000CPU 8B a 32 B Memory 4 KB
CPU Memory disk
60%/yr.
disk
regs c (2X/1.5yr)
Performance
regs
h
100 Processor-Memory
e
Performance Gap:
! Notice 10
that the data width is changing (grows 50% / year)
• Why? DRAM
! Bandwidth: Transfer rate between various levels 5%/yr.
1 (2X/15 yrs)
• CPU-Cache: 24 GBps
1980
1984
1986
1988
1989
1990
1992
1994
1996
1998
1999
1981
1982
1983
1985
1987
1991
1993
1995
1997
2000
• Cache-Main: 0.5-6.4GBps
• Main-Disk: 187MBps (serial ATA/1500)
Time
ECE232: Memory Hierarchy 5 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Source:Ece
ECE232: Memory Hierarchy 12 232 Umass-Amherst
Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Virtual Memory and Paging
Virtual memory Physical
(per process) memory
Another
process's
memory
RAM
Source: www. wikipedia.com
Disk
Memory Hierarchy Terminology
Memory Hierarchy: Terminology
! Hit: data appears in upper level in block X
! Hit Rate: the fraction of memory accesses found in the upper
level
! Miss: data needs to be retrieved from a block in the lower
level (Block Y)
! Miss Rate = 1 - (Hit Rate)
! Hit Time: Time to access the upper level which consists of
Time to determine hit/miss + upper level access time
! Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to the processor
! Note: Hit Time << Miss Penalty Lower Level
To Processor Upper Level
Block Y
From Processor
Block X
Source: ECE 232 Umass-Amherst
ECE232: Memory Hierarchy 15 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Current Memory Hierarchy
Current Memory Hierarchy
Memory Latency Problem
Processor
Processor-DRAM Memory Performance Gap
Motivation for Memory Hierarchy
Control µProc
Secondary
1000
Main 60%/yr.
Memory
(2X/1.5yr)
Performance
L2 Memory
Data-
regs
100 L1 Cache Processor-Memory
path Cache
Performance Gap:
10 (grows 50% / year)
DRAM
Speed(ns): 1ns 2ns 6ns 100ns 10,000,000ns
5%/yr.
Size (MB): 1 0.0005 0.1 1-4 1000-6000 500,000
(2X/15 yrs)
Cost ($/MB): -- $10 $3 $0.01 $0.002
1980
1984
1986
1988
1989
1990
1992
1994
1996
1998
1999
1981
1982
1983
1985
1987
1991
1993
1995
1997
2000
Technology: Regs SRAM SRAM DRAM Disk
• Cache - Main memory: Time
Speed
• Main
ECE232: memory
Memory Hierarchy 5 – Disk (virtual memory): Capacity
Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Source:Ece
ECE232: Memory Hierarchy 16 232 Umass-Amherst
Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Introduction to Parallel Programming
Shared'Memory,Processing,
,Each,processor,can,access,the,en6re,data,space,
,
– Pro’s,
• Easier,to,program,
• Amenable,to,automa6c,parallelism,
• Can,be,used,to,run,large,memory,serial,programs,
– Con’s,
• Expensive,
• Difficult,to,implement,on,the,hardware,level,
• Processor,count,limited,by,conten6on/coherency,(currently,around,512),
• Watch,out,for,“NU”,part,of,“NUMA”,
Distributed*–*Memory*Machines*
! Each*node*in*the*computer*has*a*locally*addressable*memory*space*
! The*computers*are*connected*together*via*some*high:speed*network*
– Infiniband,*Myrinet,*Giganet,*etc..*
• Pros*
– Really*large*machines*
– Size*limited*only*by*gross*physical*
consideraFons:*
• Room*size*
• Cable*lengths*(10’s*of*meters)*
• Power/cooling*capacity*
• Money!*
– Cheaper*to*build*and*run*
• Cons*
– Harder*to*program*
* *Data*Locality*
MPPs$(Massively$Parallel$Processors)$
Distributed$memory$at$largest$scale.$$OTen$shared$memory$
$at$lower$hierarchies.$
• IBM$BlueGene/L$(LLNL)$
– 131,072$700$Mhz$processors$
– 256$MB$or$RAM$per$processor$
– Balanced$compute$speed$with$interconnect$
! Red$Storm$(Sandia$NaJonal$Labs)$
– 12,960$Dual$Core$2.4$Ghz$Opterons$
– 4$GB$of$RAM$per$processor$
– Proprietary$SeaStar$interconnect$
fundamentally different design
Comparison of CPU vs GPU Architecture
philosophies.
ALU ALU
Control
ALU ALU
CPU GPU
Cache
DRAM DRAM
Source: Prof. Wen-mei W. Hwu UIUC
©Wen-mei W. Hwu and David Kirk/NVIDIA, Chile,
G2S3
Parallelization
GPU vs CPU computingGPU CPU Analogy
It is more effective to deliver Pizza’s through light duty scooters
rather than big truck. Similarly effective to use several lightweight
GPU processors for parallel tasks.
GPU Performance
Performance Advantage of GPUs
Peak performance increase
• An enlarging peak performance
Calculation advantage:
~ 1 TFlop on Desktop
– Calculation:
Memory1 TFLOPS vs. 100~GFLOPS
Bandwidth 150 GB/s
– Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s
Courtesy: John Owens
– GPU in every PC and workstation – massive volume and potential
source: top500.org
source: top500.org
Compute Unified Device Architecture
(CUDA)
• CUDA set of APIs (application program interface)
to use GPU’s for general purpose computing
• Developed and released by NVIDIA Inc. Works
only on NVIDIA GPU hardware
• Works on commercial GPU’s and as well as
specialized ones for scientific computing (Tesla)
• CUDA compiler supports C programming
language. Extensions to FORTRAN are possible.
• Opensource alternative is OpenCL.