CUDA
A technology that can make super-
computers personal
Presented by
Kunal Garg
2507276
UIET KU
Kurukshetra,India
SUPERCOMPUTER
A supercomputer is a computer that is at the
frontline of current processing capacity, particularly
speed of calculation.
Supercomputers are used for highly calculation-
intensive tasks.
Space for supercomputer pic
GPU
A graphics processing unit or GPU
(VPU) is a specialized processor that
offloads 3D or 2D graphics rendering
from the microprocessor.
Used in embedded systems,
mobile phones, personal computers,
workstations, and game consoles
GPU Computing
The excellent floating point
performance in GPUs led to
the advent of General Purpose
Computing on GPU’s(GPGPU)
GPU computing is the use
of a GPU to do general
purpose scientific and
engineering computing
The model for GPU
computing is to use a CPU
and GPU together in a
heterogeneous computing
model.
Problems in
GPU Programming
Required graphical languages
Difficult for users to program applications for GPU
CUDA
CUDA is
an acronym for Compute Unified Device
Architecture
a parallel computing architecture
computing engine
CUDA
CUDA with industry-standard C
Write a program for one thread
Instantiate it on many parallel threads
Familiar programming model and language
CUDA is a scalable parallel programming model
Program runs on any number of processors
without recompiling
Advantages of CUDA
CUDA has following advantages over
traditional GPGPU using graphics APIs.
Scattered reads
Shared memory
Faster downloads and readbacks to and from the GPU
Full support for integer and bitwise operations
CUDA Programming Model
Parallel code (kernel) is launched and executed
on a device by many threads
Threads are grouped into thread blocks
Parallel code is written for a thread
Each thread is free to execute a unique code path
Built-in thread and block ID variables
CUDA Architecture
The CUDA Architecture
Consists of several
components
Parallel compute engines
OS kernel-level support
User-mode driver
ISA
Tesla 10 Series
CUDA Computing with Tesla T10
240 SP processors at 1.45 GHz: 1 TFLOPS peak
30 DP processors at 1.44Ghz: 86 GFLOPS peak
128 threads per processor: 30,720 threads total
Thread Hierarchy
Threads launched for a parallel section are section
are partitioned into
thread blocks
Grid = all blocks for a given
launch
Thread block is a group of
threads that can
Synchronize their execution
Communicate via shared
memory
Execution Model
Warps and Half Warps
GPU Memory Allocation / Release
Host (CPU) manages device (GPU) memory:
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t
count)
cudaFree (void* pointer)
Next Generation CUDA Architecture
The next generation CUDA architecture, codenamed
Fermi is the most advanced GPU architecture ever
built. Its features include
• 512 CUDA cores
• 3.2 billion transistors
• Nvidia Parallel Datacache Technology
Nvidia Gigathread Engine
ECC Support
Applications
Accelerated
rendering of
3D graphics
Video Forensic
Molecular Dynamics
Computational Chemistry
Life Sciences
Bioinformatics
Electrodynamics
Medical Imaging
Oil and gas
Weather and Ocean Modeling
Electronic Design Automaton
Video Imaging
Video
Acceleration
Why should I use a GPU as a Processor
When compared to the latest quad-core CPU, Tesla 20-
series GPU computing processors deliver equivalent
performance at 1/20th the power consumption and
1/10th the cost
When computational fluid dynamics problem is
solved it takes
9 minutes on a Tesla S870(4GPUs)
12 hours on one 2.5 GHz CPU core
Double Precision Performance
Intel core i7 980XE is 107.6 GFLOPS
AMD Hemlock 5970 is 928 GFLOPS (GPU)
nVidia's Tesla S2050 & S2070 is 2.1 TFlops - 2.5 Tflops(GPU)
Tesla C1060-933 GFLOPs (GPU)
GeForce 8800 GTX - 346 GFLOPs(GPU)
Core 2 Duo E6600 - 38 GFLOPs
Athlon 64 X2 4600+ - 19 GFLOPs
After all, it’s your personal supercomputer
Bibliography