0% found this document useful (0 votes)

20 views45 pages

Neural Network Accelerators: CS223 Computer Architecture & Organization

The document discusses the architecture and organization of neural network accelerators, focusing on Tiled Chip Many-Core Processors (TCMP) and various dataflow architectures. It highlights the challenges posed by the Von Neumann bottleneck and the memory wall, emphasizing the need for specialized hardware to efficiently process deep neural networks. The document also outlines future directions for computer architecture, advocating for data-centric designs and the importance of skilled professionals in the field.

Uploaded by

rishikeshmachamalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views45 pages

Neural Network Accelerators: CS223 Computer Architecture & Organization

Uploaded by

rishikeshmachamalla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

CS223 Computer Architecture & Organization

Neural Network Accelerators

John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
Tiled Chip Many-Core Processor (TCMP)

PE PE PE PE

R R R R
L2 Cache Bank
PE PE PE PE
R R R R Fetch
Branch
PE PE PE PE L1 I Prediction Reg.
I Cache TLB File
R R R R Decode
Issue
P P PE &
PE PE PE
E E OoO Scheduler
R R R R L1 D Execution Unit CTRL
D Cache TLB Logic
Data
PE Processing Element R Routers Load / Store
Input Buffered NoC Router

PE PE PE PE
R R R R Input Port with Buffers Control Logic
PE PE PE PE VC Identifier Routing Unit (RC)

R R R From East VC 0
R VC Allocator (VA)
VC 1
PE PE PE PE VC 2
Switch Allocator (SA)
R R R R From West

PE PE PE PE
From North To East
R R R R To West
To North
To South
From South
To PE
R
Router Crossbar (5 x 5)

PE From PE
Processing Element Crossbar
(Cores, L2 Banks, Memory Controllers etc)
AI In Daily Life and AI on Hardware
Von Neumann Bottleneck
❖Simple Fetch Cycle
❖ Imagine: MOV R2, [1000]
❖ Problem: Doing this billions of
time!
❖Different speeds of CPU and
Memory

Bottleneck
https://www.pinterest.com/pin/620019073694949552/
Memory Wall

❖ Different Speeds of CPU and Memory

❖ Fetching anything is slow
❖ Doing it over and over is a nightmare !
Computer Architecture: A Quantitative Approach, 5th Ed. https://developer20.com/memory-wall-problem/
Turing Tariff
❖ Cost of performing functions using General Purpose Hardware
❖ GP hardware performs any functions
❖ Not necessarily efficiently

https://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/Lectures/pdfs/
Neural Networks

Neuron
Neural Networks

•Artificial Neuron •Neural Networks

Neural Networks

•Artificial Neuron •MAC, Matrix –Vector product

https://medium.com/@DannyDenenberg/linear-algebra-for-deep-learning-3a4f38a82ba7
Deep Neural Network

[H. Lee et al., Unsupervised learning of hierarchical representations with convolutional deep belief networks.
Communications. ACM 2011]
Deep Neural Network
❖ 2 Phases – Training , Inference
❖ Training - Determine weights and biases
❖ Inference - Apply weights to determine output
Neural Network Training

• Training
Neural Network Inference

• Inference
Tiled Chip Many-Core Processor (TCMP)
PE PE PE PE

R R R R
PE PE PE PE
R R R R
PE PE PE PE

R R R R
PE PE PE PE

R R R R

PE Processing Element R Routers

Most of the system is dedicated to storing and moving data

Deep Neural Network
Convolution

Convolutions account for more than 90% of overall computation, Dominate runtime
and energy consumption
Convolution
Convolution
DNN Computation
Convolution
Convolution
Convolution
Convolution
Neural Network Accelerators

•Specialized Hardware •Neural Networks •NN Accelerators

Neural Network Data Flow Accelerators
❖ Dataflow Architectures
❖ Temporal Architectures
❖ Spatial Architectures
Dataflow Architectures
❖Temporal Architectures
❖ CPUs and GPUs
❖ Centralized Control and Memory
❖ ALUs cannot talk
❖Spatial Architectures
❖ FPGAs and ASICs
❖ Distributed Control and Memory
❖ ALUs / PEs can talk

Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
Dataflow Architectures
❖ Weight Stationary
❖ Weight kept in PE
❖ Input Pixels and Partial Sums move
❖ Input Stationary
❖ Input kept in PE
❖ Weights and Partial Sums move
❖ Output Stationary
❖ Partial Sums kept in PE
❖ Input Pixels and Weights move
Eyeriss

• Eyeriss Accelerator

Yu-Hsin et. al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, IEEE Journal of Sold-State Circuits, 2017
SIMBA

•Chiplet Architecture of SIMBA

Shao et. al., Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture, IEEE MICRO, 2019c
Near / In- Memory Methods
❖ Data Movement costlier than ALU operation
❖ Bring data close to compute
❖ Physically closer – Near-Memory
❖ Hybrid Memory Cube
❖ Same place – In-Memory
❖ Resistive Crossbars

• Data Movement Cost Comparison

Vivienne Sze et.al., Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proceedings of IEEE 2017
In-Memory Computing
❖ Storage and Processing in same place
❖ Data movement eliminated
❖ Saves Time and Energy
❖ Dedicated circuits

❖ In Context of Neural Networks

• Memristor Crossbar for Matrix Vector Multiplication
❖ MAC is performed
❖ Data / Weight is stored
In-Memory Computing

• Artificial Neuron • Equivalent Circuit

In-Memory Computing

Effectively
Weight
Stationary
Dataflow !

• Layer of Neural Network • Equivalent Circuit

SIAM

• Chiplet architecture of SIAM

Krishnan et.al., SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks, ACM Trans. Embedded Computing Systmes, 2021
XR Bench

❖Examples
❖ Multi-Task Multi-Model(MTMM)
❖ Real-Time execution
❖ QoS poses Challenge
Kwon et.al., XRBench: An extended reality (XR) machine learning benchmark suite for the metaverse, Proceedings of Machine Learning and Systems, 2023
Classification of MTMM

❖Cascade / Cas-MTMM
❖ One model after another
❖ Output of a model input to next
❖Concurrent / Con-MTMM
❖ Models execute parallelly
❖ Input may be same or different
HASP

Li et.al., HASP: Hierarchical Asynchronous Parallelism for Multi-NN Tasks, IEEE Transactions on Computers, 2024
Big-Little-Chiplets

Krishnan et.al., Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture, IEEE ICCAD, 2022
Handle Data Well
❖ Ensure data does not overwhelm the components
❖via intelligent algorithms
❖via intelligent architectures
❖via whole system designs: algorithm-architecture-devices
Data-Centric Architectures
❖ Process data where it resides
❖ Processing in and near memory structures
❖ Low-latency & low-energy data access
❖ Low latency memory
❖ Low energy memory
❖ Low-cost data storage & processing
❖ High capacity memory at low cost: hybrid memory, compression
❖ Intelligent data management
❖ Intelligent controllers handling robustness, security, cost, scaling
The Way Forward
❖ Data-centric system design & intelligence spread around
❖ Do not center everything around traditional computation units
❖ Better cooperation across layers of the system
❖ Careful co-design of components and layers: system/arch/device
❖ Better, richer, more expressive and flexible interfaces
❖ Better-than-worst-case design
❖ Do not optimize for the worst case, look common case
❖ Heterogeneity in design (specialization, asymmetry)
❖ Enables a more efficient design (No one size fits all)
How to explore computer architecture ?
❖ Refer to IEEE/ACM transactions & journals
❖ IEEE TCAD, IEEE-TVLSI, IEEE-TOC,
❖ ACM-TODAES, ACM-TECS, ACM-TACO
❖ JPDC, JSC, JSA, CAL, ESL
❖ Refer to top tier conferences
❖ ISCA, HPCA, MICRO, ASPLOS, PACT, DATE, DAC, ICCAD
❖ ICCD, ISVLSI, ASPDAC, VLSI-SoC, GLSVLSI, NOCS,
NoCArc
❖ HiPC, VLSID, VDAT, ISED
How to explore computer architecture ?
❖ Familiarize open source architectural simulators
❖ gem5, Multi2sim, Sniper, Tejas,
❖ Booksim, DRAMSim, Usimm, GPGPUSim
❖ Cacti, Orion
❖ Model the architecture in simulators and implement them
using HDLs, Verify sub-modules in FPGA kit explore further …
Summary
❖ Multicore processors and on-chip clouds are going to become an
integral part of future digital technologies.

❖ Understanding the hardware of such system will help us to design

with conceptual clarity.

❖ Our country need good computer architects and processor design

engineers with hands on exposure to VLSI design flow to cater the
growing demand of skilled personnel in this domain.
[email protected]
http://www.iitg.ac.in/johnjose/

Advanced Topics For AI
No ratings yet
Advanced Topics For AI
30 pages
Efficient Processing of Deep Neural Networks
No ratings yet
Efficient Processing of Deep Neural Networks
341 pages
Onur Ddca 2023 Lecture2a Tradeoffs Metrics Mindset Afterlecture
No ratings yet
Onur Ddca 2023 Lecture2a Tradeoffs Metrics Mindset Afterlecture
111 pages
IMCA An Efficient in Memory Convolution Accelerator For Artificial Intelligence Applications
No ratings yet
IMCA An Efficient in Memory Convolution Accelerator For Artificial Intelligence Applications
15 pages
Architectures For Neuromorphic Computing (Autosaved)
No ratings yet
Architectures For Neuromorphic Computing (Autosaved)
44 pages
Efficient Processing of Deep Neural Networks
No ratings yet
Efficient Processing of Deep Neural Networks
19 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Neurmorphic Architectures: Kenneth Rice and Tarek Taha Clemson University
No ratings yet
Neurmorphic Architectures: Kenneth Rice and Tarek Taha Clemson University
17 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (IJAIA)
No ratings yet
May 2025 - Top 10 Read Articles in Artificial Intelligence and Applications (IJAIA)
36 pages
Efficient Processing of Deep Neural Networks
No ratings yet
Efficient Processing of Deep Neural Networks
58 pages
Neural Network Architectures Guide
No ratings yet
Neural Network Architectures Guide
6 pages
NOCS Rethinking Upload
No ratings yet
NOCS Rethinking Upload
42 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
(Survey) Memory Devices and Applications For In-Memory Computing
No ratings yet
(Survey) Memory Devices and Applications For In-Memory Computing
16 pages
A Review of In-Memory Computing Architectures For Machine Learning Applications
No ratings yet
A Review of In-Memory Computing Architectures For Machine Learning Applications
6 pages
Neuromorphic Analog Signal Processing White Paper
No ratings yet
Neuromorphic Analog Signal Processing White Paper
9 pages
MajorProject - PPT
No ratings yet
MajorProject - PPT
31 pages
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
No ratings yet
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
42 pages
Unit 5
No ratings yet
Unit 5
14 pages
High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory
No ratings yet
High-Throughput Near-Memory Processing On Cnns With 3D Hbm-Like Memory
20 pages
10.1038@s41563 019 0291 X
No ratings yet
10.1038@s41563 019 0291 X
15 pages
Memory Devices and Applications For In-Memory Computing
No ratings yet
Memory Devices and Applications For In-Memory Computing
16 pages
Esm2024 Mizrahi Slides
No ratings yet
Esm2024 Mizrahi Slides
77 pages
Artificial Intelligence Hardware Design 1st Edition Albert Chun-Chen Liu Full
100% (1)
Artificial Intelligence Hardware Design 1st Edition Albert Chun-Chen Liu Full
147 pages
Computer Architecture John L. Hennessy Instant Download
No ratings yet
Computer Architecture John L. Hennessy Instant Download
94 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Week 4a - Computer Architecture Fundamentals - Part 1
No ratings yet
Week 4a - Computer Architecture Fundamentals - Part 1
45 pages
Rocket
No ratings yet
Rocket
93 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
10.1515 - Nanoph 2020 0297
No ratings yet
10.1515 - Nanoph 2020 0297
12 pages
Computer Architecture John L. Hennessy PDF Download
No ratings yet
Computer Architecture John L. Hennessy PDF Download
88 pages
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
No ratings yet
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
21 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning
No ratings yet
Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning
70 pages
VLSI Applications in Neural Networks
No ratings yet
VLSI Applications in Neural Networks
12 pages
Architectural and Operating System Support For Virtual Memory
No ratings yet
Architectural and Operating System Support For Virtual Memory
177 pages
Technology Prospects For Data-Intensive Computing
No ratings yet
Technology Prospects For Data-Intensive Computing
21 pages
00 Introduction
No ratings yet
00 Introduction
41 pages
Computer Architecture Ebook
No ratings yet
Computer Architecture Ebook
443 pages
Computer Architecture John L. Hennessy Download
100% (3)
Computer Architecture John L. Hennessy Download
65 pages
DNN Accelerators
No ratings yet
DNN Accelerators
29 pages
Basics Computer Architecture by Pooyan Jamshidi 1731311297
No ratings yet
Basics Computer Architecture by Pooyan Jamshidi 1731311297
266 pages
In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das
No ratings yet
In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das
14 pages
Tensor Streaming Processor Unveiled
No ratings yet
Tensor Streaming Processor Unveiled
14 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
No ratings yet
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
21 pages
Chips For Artificial Intelligence
No ratings yet
Chips For Artificial Intelligence
3 pages
AI Processor Electronics Basic Technology of Artificial Intelligence
No ratings yet
AI Processor Electronics Basic Technology of Artificial Intelligence
399 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
HC2023 Qualcomm Hexagon NPU
No ratings yet
HC2023 Qualcomm Hexagon NPU
19 pages
Lec5 Tpu
No ratings yet
Lec5 Tpu
44 pages
THESIS LucasHuijbregts Final
No ratings yet
THESIS LucasHuijbregts Final
86 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Microprocessor Architecture Guide
No ratings yet
Microprocessor Architecture Guide
3 pages
Chapter 2-Development of Computers
No ratings yet
Chapter 2-Development of Computers
7 pages
Principle and Operation of A CNC Machine
No ratings yet
Principle and Operation of A CNC Machine
11 pages
Embedded System
No ratings yet
Embedded System
25 pages
Mobile App Development Guide
No ratings yet
Mobile App Development Guide
46 pages
Icf Lesson Plan
No ratings yet
Icf Lesson Plan
8 pages
ACA Answer Key Best of All v2
No ratings yet
ACA Answer Key Best of All v2
155 pages
Byzantine Generals Problem Solutions
No ratings yet
Byzantine Generals Problem Solutions
58 pages
MICREX-SX Programmable Controllers MICREX-SX Series SPH Catalog PDF
No ratings yet
MICREX-SX Programmable Controllers MICREX-SX Series SPH Catalog PDF
126 pages
Ee660 2017 Spring Materials Week 04 Slides
No ratings yet
Ee660 2017 Spring Materials Week 04 Slides
40 pages
Dem 22421
No ratings yet
Dem 22421
6 pages
Tabla Comparativa de Procesadores
No ratings yet
Tabla Comparativa de Procesadores
12 pages
D SPACESimulator Mid Size Features
No ratings yet
D SPACESimulator Mid Size Features
216 pages
Lab 4
0% (1)
Lab 4
21 pages
8086 Minimum Mode & Timing Diagram
No ratings yet
8086 Minimum Mode & Timing Diagram
26 pages
Technical Data of CPU 315-2DP
No ratings yet
Technical Data of CPU 315-2DP
4 pages
Unit 5 Memory Organization
No ratings yet
Unit 5 Memory Organization
48 pages
Intel 8096 Microcontroller Guide
0% (1)
Intel 8096 Microcontroller Guide
9 pages
Disconnect 2025 03 30 - 20.50.18 Client
No ratings yet
Disconnect 2025 03 30 - 20.50.18 Client
17 pages
OS Notes
No ratings yet
OS Notes
119 pages
Microprocessor Slide 1
No ratings yet
Microprocessor Slide 1
38 pages
Ovonix Unified Memory Seminar Report
No ratings yet
Ovonix Unified Memory Seminar Report
21 pages
Biostar G41D3 (Ver 7.0) PDF
No ratings yet
Biostar G41D3 (Ver 7.0) PDF
32 pages
Computers in Daily Life: Module 3
No ratings yet
Computers in Daily Life: Module 3
11 pages
Introduction To OS - CH 1
No ratings yet
Introduction To OS - CH 1
70 pages
EREW SIMD Search Algorithm Analysis
No ratings yet
EREW SIMD Search Algorithm Analysis
3 pages
Ax PDF
No ratings yet
Ax PDF
20 pages
Concepts and Notations For Concurrent Programming PDF
No ratings yet
Concepts and Notations For Concurrent Programming PDF
41 pages
It 6
No ratings yet
It 6
34 pages
Lab2 - Scheduling Simulation
No ratings yet
Lab2 - Scheduling Simulation
7 pages

Neural Network Accelerators: CS223 Computer Architecture & Organization

Uploaded by

Neural Network Accelerators: CS223 Computer Architecture & Organization

Uploaded by

CS223 Computer Architecture & Organization

Neural Network Accelerators

❖ Different Speeds of CPU and Memory

•Artificial Neuron •Neural Networks

•Artificial Neuron •MAC, Matrix –Vector product

PE Processing Element R Routers

Most of the system is dedicated to storing and moving data

•Specialized Hardware •Neural Networks •NN Accelerators

•Chiplet Architecture of SIMBA

• Data Movement Cost Comparison

❖ In Context of Neural Networks

• Artificial Neuron • Equivalent Circuit

• Layer of Neural Network • Equivalent Circuit

• Chiplet architecture of SIAM

❖ Understanding the hardware of such system will help us to design

❖ Our country need good computer architects and processor design

You might also like