CS223 Computer Architecture & Organization
Neural Network Accelerators
John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
Tiled Chip Many-Core Processor (TCMP)
PE PE PE PE
R R R R
L2 Cache Bank
PE PE PE PE
R R R R Fetch
Branch
PE PE PE PE L1 I Prediction Reg.
I Cache TLB File
R R R R Decode
Issue
P P PE &
PE PE PE
E E OoO Scheduler
R R R R L1 D Execution Unit CTRL
D Cache TLB Logic
Data
PE Processing Element R Routers Load / Store
Input Buffered NoC Router
PE PE PE PE
R R R R Input Port with Buffers Control Logic
PE PE PE PE VC Identifier Routing Unit (RC)
R R R From East VC 0
R VC Allocator (VA)
VC 1
PE PE PE PE VC 2
Switch Allocator (SA)
R R R R From West
PE PE PE PE
From North To East
R R R R To West
To North
To South
From South
To PE
R
Router Crossbar (5 x 5)
PE From PE
Processing Element Crossbar
(Cores, L2 Banks, Memory Controllers etc)
AI In Daily Life and AI on Hardware
Von Neumann Bottleneck
❖Simple Fetch Cycle
❖ Imagine: MOV R2, [1000]
❖ Problem: Doing this billions of
time!
❖Different speeds of CPU and
Memory
Bottleneck
https://www.pinterest.com/pin/620019073694949552/
Memory Wall
❖ Different Speeds of CPU and Memory
❖ Fetching anything is slow
❖ Doing it over and over is a nightmare !
Computer Architecture: A Quantitative Approach, 5th Ed. https://developer20.com/memory-wall-problem/
Turing Tariff
❖ Cost of performing functions using General Purpose Hardware
❖ GP hardware performs any functions
❖ Not necessarily efficiently
https://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/Lectures/pdfs/
Neural Networks
Neuron
Neural Networks
•Artificial Neuron •Neural Networks
Neural Networks
•Artificial Neuron •MAC, Matrix –Vector product
https://medium.com/@DannyDenenberg/linear-algebra-for-deep-learning-3a4f38a82ba7
Deep Neural Network
[H. Lee et al., Unsupervised learning of hierarchical representations with convolutional deep belief networks.
Communications. ACM 2011]
Deep Neural Network
❖ 2 Phases – Training , Inference
❖ Training - Determine weights and biases
❖ Inference - Apply weights to determine output
Neural Network Training
• Training
Neural Network Inference
• Inference
Tiled Chip Many-Core Processor (TCMP)
PE PE PE PE
R R R R
PE PE PE PE
R R R R
PE PE PE PE
R R R R
PE PE PE PE
R R R R
PE Processing Element R Routers
Most of the system is dedicated to storing and moving data
Deep Neural Network
Convolution
Convolutions account for more than 90% of overall computation, Dominate runtime
and energy consumption
Convolution
Convolution
DNN Computation
Convolution
Convolution
Convolution
Convolution
Neural Network Accelerators
•Specialized Hardware •Neural Networks •NN Accelerators
Neural Network Data Flow Accelerators
❖ Dataflow Architectures
❖ Temporal Architectures
❖ Spatial Architectures
Dataflow Architectures
❖Temporal Architectures
❖ CPUs and GPUs
❖ Centralized Control and Memory
❖ ALUs cannot talk
❖Spatial Architectures
❖ FPGAs and ASICs
❖ Distributed Control and Memory
❖ ALUs / PEs can talk
Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
Dataflow Architectures
❖ Weight Stationary
❖ Weight kept in PE
❖ Input Pixels and Partial Sums move
❖ Input Stationary
❖ Input kept in PE
❖ Weights and Partial Sums move
❖ Output Stationary
❖ Partial Sums kept in PE
❖ Input Pixels and Weights move
Eyeriss
• Eyeriss Accelerator
Yu-Hsin et. al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, IEEE Journal of Sold-State Circuits, 2017
SIMBA
•Chiplet Architecture of SIMBA
Shao et. al., Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture, IEEE MICRO, 2019c
Near / In- Memory Methods
❖ Data Movement costlier than ALU operation
❖ Bring data close to compute
❖ Physically closer – Near-Memory
❖ Hybrid Memory Cube
❖ Same place – In-Memory
❖ Resistive Crossbars
• Data Movement Cost Comparison
Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
In-Memory Computing
❖ Storage and Processing in same place
❖ Data movement eliminated
❖ Saves Time and Energy
❖ Dedicated circuits
❖ In Context of Neural Networks
• Memristor Crossbar for Matrix Vector Multiplication
❖ MAC is performed
❖ Data / Weight is stored
In-Memory Computing
• Artificial Neuron • Equivalent Circuit
In-Memory Computing
Effectively
Weight
Stationary
Dataflow !
• Layer of Neural Network • Equivalent Circuit
SIAM
• Chiplet architecture of SIAM
Krishnan et.al., SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks, ACM Trans. Embedded Computing Systmes, 2021
XR Bench
❖Examples
❖ Multi-Task Multi-Model(MTMM)
❖ Real-Time execution
❖ QoS poses Challenge
Kwon et.al., XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse, Proceedings of Machine Learning and Systems, 2023
Classification of MTMM
❖Cascade / Cas-MTMM
❖ One model after another
❖ Output of a model input to next
❖Concurrent / Con-MTMM
❖ Models execute parallelly
❖ Input may be same or different
HASP
Li et.al., HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks, IEEE Transactions on Computers, 2024
Big-Little-Chiplets
Krishnan et.al., Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture, IEEE ICCAD, 2022
Handle Data Well
❖ Ensure data does not overwhelm the components
❖via intelligent algorithms
❖via intelligent architectures
❖via whole system designs: algorithm-architecture-devices
Data-Centric Architectures
❖ Process data where it resides
❖ Processing in and near memory structures
❖ Low-latency & low-energy data access
❖ Low latency memory
❖ Low energy memory
❖ Low-cost data storage & processing
❖ High capacity memory at low cost: hybrid memory, compression
❖ Intelligent data management
❖ Intelligent controllers handling robustness, security, cost, scaling
The Way Forward
❖ Data-centric system design & intelligence spread around
❖ Do not center everything around traditional computation units
❖ Better cooperation across layers of the system
❖ Careful co-design of components and layers: system/arch/device
❖ Better, richer, more expressive and flexible interfaces
❖ Better-than-worst-case design
❖ Do not optimize for the worst case, look common case
❖ Heterogeneity in design (specialization, asymmetry)
❖ Enables a more efficient design (No one size fits all)
How to explore computer architecture ?
❖ Refer to IEEE/ACM transactions & journals
❖ IEEE TCAD, IEEE-TVLSI, IEEE-TOC,
❖ ACM-TODAES, ACM-TECS, ACM-TACO
❖ JPDC, JSC, JSA, CAL, ESL
❖ Refer to top tier conferences
❖ ISCA, HPCA, MICRO, ASPLOS, PACT, DATE, DAC, ICCAD
❖ ICCD, ISVLSI, ASPDAC, VLSI-SoC, GLSVLSI, NOCS,
NoCArc
❖ HiPC, VLSID, VDAT, ISED
How to explore computer architecture ?
❖ Familiarize open source architectural simulators
❖ gem5, Multi2sim, Sniper, Tejas,
❖ Booksim, DRAMSim, Usimm, GPGPUSim
❖ Cacti, Orion
❖ Model the architecture in simulators and implement them
using HDLs, Verify sub-modules in FPGA kit explore further …
Summary
❖ Multicore processors and on-chip clouds are going to become an
integral part of future digital technologies.
❖ Understanding the hardware of such system will help us to design
with conceptual clarity.
❖ Our country need good computer architects and processor design
engineers with hands on exposure to VLSI design flow to cater the
growing demand of skilled personnel in this domain.
[email protected]http://www.iitg.ac.in/johnjose/