Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views45 pages

Neural Network Accelerators: CS223 Computer Architecture & Organization

The document discusses the architecture and organization of neural network accelerators, focusing on Tiled Chip Many-Core Processors (TCMP) and various dataflow architectures. It highlights the challenges posed by the Von Neumann bottleneck and the memory wall, emphasizing the need for specialized hardware to efficiently process deep neural networks. The document also outlines future directions for computer architecture, advocating for data-centric designs and the importance of skilled professionals in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Neural Network Accelerators: CS223 Computer Architecture & Organization

The document discusses the architecture and organization of neural network accelerators, focusing on Tiled Chip Many-Core Processors (TCMP) and various dataflow architectures. It highlights the challenges posed by the Von Neumann bottleneck and the memory wall, emphasizing the need for specialized hardware to efficiently process deep neural networks. The document also outlines future directions for computer architecture, advocating for data-centric designs and the importance of skilled professionals in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CS223 Computer Architecture & Organization

Neural Network Accelerators

John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
Tiled Chip Many-Core Processor (TCMP)

PE PE PE PE

R R R R
L2 Cache Bank
PE PE PE PE
R R R R Fetch
Branch
PE PE PE PE L1 I Prediction Reg.
I Cache TLB File
R R R R Decode
Issue
P P PE &
PE PE PE
E E OoO Scheduler
R R R R L1 D Execution Unit CTRL
D Cache TLB Logic
Data
PE Processing Element R Routers Load / Store
Input Buffered NoC Router

PE PE PE PE
R R R R Input Port with Buffers Control Logic
PE PE PE PE VC Identifier Routing Unit (RC)

R R R From East VC 0
R VC Allocator (VA)
VC 1
PE PE PE PE VC 2
Switch Allocator (SA)
R R R R From West

PE PE PE PE
From North To East
R R R R To West
To North
To South
From South
To PE
R
Router Crossbar (5 x 5)

PE From PE
Processing Element Crossbar
(Cores, L2 Banks, Memory Controllers etc)
AI In Daily Life and AI on Hardware
Von Neumann Bottleneck
❖Simple Fetch Cycle
❖ Imagine: MOV R2, [1000]
❖ Problem: Doing this billions of
time!
❖Different speeds of CPU and
Memory

Bottleneck
https://www.pinterest.com/pin/620019073694949552/
Memory Wall

❖ Different Speeds of CPU and Memory


❖ Fetching anything is slow
❖ Doing it over and over is a nightmare !
Computer Architecture: A Quantitative Approach, 5th Ed. https://developer20.com/memory-wall-problem/
Turing Tariff
❖ Cost of performing functions using General Purpose Hardware
❖ GP hardware performs any functions
❖ Not necessarily efficiently

https://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/Lectures/pdfs/
Neural Networks

Neuron
Neural Networks

•Artificial Neuron •Neural Networks


Neural Networks

•Artificial Neuron •MAC, Matrix –Vector product


https://medium.com/@DannyDenenberg/linear-algebra-for-deep-learning-3a4f38a82ba7
Deep Neural Network

[H. Lee et al., Unsupervised learning of hierarchical representations with convolutional deep belief networks.
Communications. ACM 2011]
Deep Neural Network
❖ 2 Phases – Training , Inference
❖ Training - Determine weights and biases
❖ Inference - Apply weights to determine output
Neural Network Training

• Training
Neural Network Inference

• Inference
Tiled Chip Many-Core Processor (TCMP)
PE PE PE PE

R R R R
PE PE PE PE
R R R R
PE PE PE PE

R R R R
PE PE PE PE

R R R R

PE Processing Element R Routers

Most of the system is dedicated to storing and moving data


Deep Neural Network
Convolution

Convolutions account for more than 90% of overall computation, Dominate runtime
and energy consumption
Convolution
Convolution
DNN Computation
Convolution
Convolution
Convolution
Convolution
Neural Network Accelerators

•Specialized Hardware •Neural Networks •NN Accelerators


Neural Network Data Flow Accelerators
❖ Dataflow Architectures
❖ Temporal Architectures
❖ Spatial Architectures
Dataflow Architectures
❖Temporal Architectures
❖ CPUs and GPUs
❖ Centralized Control and Memory
❖ ALUs cannot talk
❖Spatial Architectures
❖ FPGAs and ASICs
❖ Distributed Control and Memory
❖ ALUs / PEs can talk

Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
Dataflow Architectures
❖ Weight Stationary
❖ Weight kept in PE
❖ Input Pixels and Partial Sums move
❖ Input Stationary
❖ Input kept in PE
❖ Weights and Partial Sums move
❖ Output Stationary
❖ Partial Sums kept in PE
❖ Input Pixels and Weights move
Eyeriss

• Eyeriss Accelerator

Yu-Hsin et. al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, IEEE Journal of Sold-State Circuits, 2017
SIMBA

•Chiplet Architecture of SIMBA


Shao et. al., Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture, IEEE MICRO, 2019c
Near / In- Memory Methods
❖ Data Movement costlier than ALU operation
❖ Bring data close to compute
❖ Physically closer – Near-Memory
❖ Hybrid Memory Cube
❖ Same place – In-Memory
❖ Resistive Crossbars

• Data Movement Cost Comparison

Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
In-Memory Computing
❖ Storage and Processing in same place
❖ Data movement eliminated
❖ Saves Time and Energy
❖ Dedicated circuits

❖ In Context of Neural Networks


• Memristor Crossbar for Matrix Vector Multiplication
❖ MAC is performed
❖ Data / Weight is stored
In-Memory Computing

• Artificial Neuron • Equivalent Circuit


In-Memory Computing

Effectively
Weight
Stationary
Dataflow !

• Layer of Neural Network • Equivalent Circuit


SIAM

• Chiplet architecture of SIAM

Krishnan et.al., SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks, ACM Trans. Embedded Computing Systmes, 2021
XR Bench

❖Examples
❖ Multi-Task Multi-Model(MTMM)
❖ Real-Time execution
❖ QoS poses Challenge
Kwon et.al., XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse, Proceedings of Machine Learning and Systems, 2023
Classification of MTMM

❖Cascade / Cas-MTMM
❖ One model after another
❖ Output of a model input to next
❖Concurrent / Con-MTMM
❖ Models execute parallelly
❖ Input may be same or different
HASP

Li et.al., HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks, IEEE Transactions on Computers, 2024
Big-Little-Chiplets

Krishnan et.al., Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture, IEEE ICCAD, 2022
Handle Data Well
❖ Ensure data does not overwhelm the components
❖via intelligent algorithms
❖via intelligent architectures
❖via whole system designs: algorithm-architecture-devices
Data-Centric Architectures
❖ Process data where it resides
❖ Processing in and near memory structures
❖ Low-latency & low-energy data access
❖ Low latency memory
❖ Low energy memory
❖ Low-cost data storage & processing
❖ High capacity memory at low cost: hybrid memory, compression
❖ Intelligent data management
❖ Intelligent controllers handling robustness, security, cost, scaling
The Way Forward
❖ Data-centric system design & intelligence spread around
❖ Do not center everything around traditional computation units
❖ Better cooperation across layers of the system
❖ Careful co-design of components and layers: system/arch/device
❖ Better, richer, more expressive and flexible interfaces
❖ Better-than-worst-case design
❖ Do not optimize for the worst case, look common case
❖ Heterogeneity in design (specialization, asymmetry)
❖ Enables a more efficient design (No one size fits all)
How to explore computer architecture ?
❖ Refer to IEEE/ACM transactions & journals
❖ IEEE TCAD, IEEE-TVLSI, IEEE-TOC,
❖ ACM-TODAES, ACM-TECS, ACM-TACO
❖ JPDC, JSC, JSA, CAL, ESL
❖ Refer to top tier conferences
❖ ISCA, HPCA, MICRO, ASPLOS, PACT, DATE, DAC, ICCAD
❖ ICCD, ISVLSI, ASPDAC, VLSI-SoC, GLSVLSI, NOCS,
NoCArc
❖ HiPC, VLSID, VDAT, ISED
How to explore computer architecture ?
❖ Familiarize open source architectural simulators
❖ gem5, Multi2sim, Sniper, Tejas,
❖ Booksim, DRAMSim, Usimm, GPGPUSim
❖ Cacti, Orion
❖ Model the architecture in simulators and implement them
using HDLs, Verify sub-modules in FPGA kit explore further …
Summary
❖ Multicore processors and on-chip clouds are going to become an
integral part of future digital technologies.

❖ Understanding the hardware of such system will help us to design


with conceptual clarity.

❖ Our country need good computer architects and processor design


engineers with hands on exposure to VLSI design flow to cater the
growing demand of skilled personnel in this domain.
[email protected]
http://www.iitg.ac.in/johnjose/

You might also like