Building Smart SoCs
Using Virtual Prototyping for the Design and SoC Integration of Deep
Learning Accelerators
Holger Keding
Solutions Architect
© Accellera Systems Initiative 1
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
Increasing number of AI Accelerators
Source: Qualcomm AI Day Speaker Presentation 2019
Deep Learning Technology Trends
New Neural Network algorithms
– Higher accuracy, lower size and less processing Neural
– But: less data re-use, less cycles per byte Network
Neural Network Compiler optimizations Neural
– Loop-tiling, -unrolling, and -parallelization Network
– Splitting and fusing of Neural Network layers Compiler
– Memory layout optimization across layers Deep
– Optimized code generation to utilize available Learning DDR/
Interconnect
hardware accelerators Accelerator HBM
Multi-core AI
Deep Learning Accelerator optimizations CPU SRAM SoC
IO SRAM
– Schedule workload on parallel hardware engines IO
– Optimize and reduce data transfers IO
to and from memory
AI SoC Design Challenges
Brute-force Processing of Huge Data Sets
• Choosing the right algorithm and architecture: CPU, GPU, FPGA, vector DSP, ASIP
– CNN graphs evolving fast, need short time to market, cannot optimize for one single graph
– Joint design of algorithm, compiler, and target architecture
– Joint optimization of power, performance, accuracy, and cost
• Highly parallel compute drives memory requirements
– High on-chip and chip to chip bandwidth at low latency
– High memory bandwidth requirements for parameters and layer to layer communication
• Performance analysis requires realistic workloads to consider dynamic effects
– Scheduling of AI operators on parallel processing elements
– Unpredictable interconnect and memory access latencies
Large Design Space drives Differentiation by
AI Algorithm & Architecture
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
How to design a DLA?
validate
validate High-Level Architecture back-annotate
back-annotate
+ Good for hardware exploration
Analytical Models RTL Simulation
+ Simulations in minutes/hours
+ Good first order ~ Varying Accuracy + Perfect accuracy
+ Results within minutes Refine Refine - High computational needs
- Omits dynamic effects - High turn-around costs
Functional LT Model (VDK)
+ Good for SW development
+ Simulations in minutes/hours
Refine
+ Trace Ops, Memory accesses
- Low Timing Accuracy
validate
back-annotate
Analytical Performance Models
Simple Example: Amdahl’s Law [1]
[1] Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities (1967)
• Simple insightful formula, with restricted applicability, though.
• “All models are wrong but some are useful” (George Box, 1978)
Analytical Models – Roofline Models (1)
𝑝𝑝(𝑓𝑟𝑒𝑞𝑐𝑙𝑘 , #𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠)
Theoretical maximum
compute power
ILP or SIMD
observed
performance
Only Thread-Level Parallelism
2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
= 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (2)
Theoretical maximum
compute power
slope =
maximum
memory
bandwidth 2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 ⋅ 𝑚𝑒𝑚_𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝑝𝑒𝑎𝑘 = 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (3)
compute bound
memory bound
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Example: Analytical Model for CNN Convolutional Layer (1)
Conv1 of AlexNet
for(row=0; row<oh; row++){
for(col=0; col<ow; col++){ Maths Textbook 𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐
for(k=0; k<oc; k++){
for(ti=0; ti<ic; t i ++){ Convolution algorithm = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3
for(i=0; i<kh; i++){
for(j=0; j<kw; j++){ = 105,415,200
L : outputfm [ k ] [ row ] [ col ] +=
kernels[ k ][ ti ][ i ][ j ]∗
inputfm[ ti ][ sw∗row+i ][ sh∗col+j ];
}}}}}}
Example: Analytical Model for CNN Convolutional Layer (2)
Conv1 of AlexNet
But: here we assume unlimited
amount of local memory
𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200
𝑑𝑀𝐴𝐶 = 𝑑𝑖𝑓𝑚𝑎𝑝 + 𝑑𝑘𝑒𝑟𝑛𝑒𝑙 = (𝑖𝑤 ⋅ 𝑖ℎ ⋅ 𝑖𝑐 + 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 ⋅ 𝑘) ⋅ 𝐵𝑖 ≈ 0.38𝑀𝑖𝐵
𝑛
⇒ 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝐼 = 𝑑𝑀𝐴𝐶 ≈ 278 𝑜𝑝𝑠/𝐵
𝑀𝐴𝐶
Example: Analytical Model for CNN Convolutional Layer (3)
Conv1 of AlexNet
Opposite extreme: we assume no
local memory
𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200
𝑑𝑀𝐴𝐶 = 2 ⋅ 𝑛𝑀𝐴𝐶 ⋅ 𝐵𝑖 ≈ 420𝑀𝑖𝐵
𝑛 1
⇒ 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝐼 = 𝑑𝑀𝐴𝐶 ≈ 4 𝑜𝑝𝑠/𝐵
𝑀𝐴𝐶
Example: Analytical Model for CNN Convolutional Layer (4)
Conv1 of AlexNet
Practical setup: limited amount
of local memory
for(row=0; row<oh; row++){
for(col=0; col<ow; col++){
Maths Textbook
for(k=0; k<oc; k++){ Convolution algorithm
for(ti=0; ti<ic; t i ++){
for(i=0; i<kh; i++){
for(j=0; j<kw; j++){
L : outputfm [ k ] [ row ] [ col ] +=
kernels[ k ][ ti ][ i ][ j ]∗inputfm[ti][sw∗row+i][sh∗col+j];
}}}}}}
Example: Analytical Model for CNN Convolutional Layer (5)
Conv1 of AlexNet – with very simple tiling
Practical setup: limited amount
of local memory
Width + Height
+ Channel
+ Kernel Tiling
Example: Analytical Model for CNN Convolutional Layer (6)
Conv1 with tiling
Source: Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks, Cheng Zhang, 2015
Example: Analytical Model for CNN Convolutional Layer (6)
Conv1 with tiling
Now it gets more tricky: Taking into acount non-integer
relations of tiling parameters and channel dimensions:
Tiling also brings the operational intensity
closer to the optimum HW utilization point
Example: Analytical Model, Mapping Conv to HW Resources
#MAC cells can be configured to scale
up/down peak performance
Tiling parameters and MAC Cell number and depth should
match tiling parameters
Roofline model
Operational Intensity (Operations/Byte)
Roofline model
Operational Intensity (Operations/Byte)
Analytical Model as Python Generated Spreadsheet
Expressions represent both Algorithmic and HW -> calculate attainable performance
Exploring different numbers of MAC cells and their depth
Analytical Model Summary
What is achieved and what comes next?
What we have seen:
+ Good first order analysis of static effects
+ Results within minutes
~ Requires deep understanding of
both algorithm and architecture
What is not covered
- Implementation overhead is hard
to predict and not ‚priced in‘ in
first round
- Omits dynamic effects
- Joint performance and power
is difficult
How to design a DLA?
validate
validate High-Level Architecture back-annotate
back-annotate
+ Good for hardware exploration
Analytical Models RTL Simulation
+ Simulations in minutes/hours
+ Good first order ~ Varying Accuracy + Perfect accuracy
+ Results within minutes Refine Refine - High computational needs
- Omits dynamic effects - High turn-around costs
Functional LT Model (VDK)
+ Good for SW development
+ Simulations in minutes/hours
Refine
+ Trace Ops, Memory accesses
- Low Timing Accuracy
validate
back-annotate
Shift Left Architecture Analysis and Optimization
translate Neural
Network
explore
Power,
Performance Neural
Network
NN Workload Model Compiler
explore
map
results Deep
Deep Learning DDR/
Interconnect
Learning DDR/ Accelerator HBM
Interconnect
Accelerator AI
HBM AI Multi-core
Multi-core SRAM SoC
SoC CPU
CPU SRAM
Model IO SRAM
IO SRAM
IO IO
IO IO
Platform Architect Ultra
Providing a Comprehensive Library of Generic and Vendor Specific Models
Capture Workload Model
Capture Architecture Model Interconnect Models
Generic:
•SBL-TLM2-FT (AXI)
•SBL-GCCI (ACE, CHI)
Memory Subsystems IP Specific:
Analyze Power & Performance •Arteris FlexNoC & Ncore
• Generic multiport •Arm AHB/APB
memory controller •Arm PL300
Traffic, Processors, RTL (GMPMC) •Arm SBL-301
• DesignWare uMCTL2 •Arm SBL-400
• Task-based and trace-based
memory controller •Synopsys DW AXI
workload models
• DesignWare LPDDR5
• Cycle accurate processor for
User Traffic, memory controller
ARM, ARC, Tensilica,
for CEVA
Scenarios
• Co-simulate with RTL
• RTL Co-simulation/emulation
Exploration
Workload Modeling and Mapping
• Workload Model cycles: 0 cycles: 2000
rd_bytes: 0x200 rd_bytes: 0
– Task level parallelism and dependencies wr_bytes: 0 wr_bytes: 0
– Characterized with processing cycles and
memory accesses Task B
(read image)
• SoC Platform Model Task D
– Accurate SystemC Transaction level models of Task A (proc conv)
processing elements, interconnect and memory Task C
(read kernel)
• Map workload to platform
• Analyze performance metrics
– End-to-end constraints ACC
interconnect
– Workload activity Memory
subsystem
– Utilization of resources record
DMA
– Interconnect metrics
• Latency, Throughput, Contention Virtual Prototype
• Outstanding transactions
• …
System Level Power Modeling
• Workload Model Task B
(read image)
– Task level parallelism and dependencies Task D
Task A
– Characterized with processing cycles (proc conv)
Task C
and memory accesses (read kernel)
• SoC Platform Model
– Accurate SystemC Transaction level
models of processing elements, ACC
interconnect
interconnect and memory Memory
• System-level Power Overlay Model subsystem
DMA
– Define power state machine per
component Virtual Prototype
– Bind power models to
Virtual Prototype records
sleep
– Measure power and Energy/Power
recording sleep idle idle
performance based idle
on real activity and utilization active active page page
miss hit
IP Power Models
Platform Architect Ultra AI Exploration Pack (XP)
Exploration & optimization of AI designs
• Automated generation of workloads from AI
frameworks
– AI Operator Library for Neural Network modeling
• E.g. Convolution, Matmul, MaxPool, BatchNorm etc.
– Example workload model of ResNet50 Neural Network
– Utility to convert prototxt description to workload model
CNN
using AI operator library Operator Library workload model
• AI centric HW architecture model library
– VPUs configured to represent AI compute and DMA engines
– Interconnect and memory subsystem models
– Example performance model of
NVIDIA Deep Learning Accelerator (NVDLA)
NVDLA Performance
• AI centric analysis views: memory + processing Model Example
utilization
Workload Model of One Convolution Layer
AI algorithm params Mapping params
read
input
calculate write output Scaling parameters reflect
convolutions feature maps the DLA architecture – can
read be taken from analytical
coefficients model.
Workload params
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
Example: Resnet-18 (Inference) with NV-DLA
Resnet18 task graph
Resnet18
Import prototxt
Neural
Network
map Goals:
100 ms latency, minimize power, minimize energy
Optimize Hardware configuration:
– SIMD width
– Burst size, outstanding transactions
NVDLA platform
– speed of DDR memory and of data path
ResNet-18 Workload model generated with AI-XP
Example: Brief Overview of NVDLA
Convolution Engine (CONV_CORE)
• Works on two sets of data: offline-trained kernels (weights) and input
features (images)
• configurable MAC units and convolutional buffer (RAM)
• Executes operations such as tf.nn.conv2d
Single Data Point Processor (SDP)
• Applies linear and non-linear (activation) functions onto individual data points.
• Executes e.g. tf.nn.batch_normalization, tf.nn.bias_add, tf.nn.elu, tf.nn.relu,
tf.sigmoid, tf.tanh, and more.
Planar Data Processor (PDP)
• Applies common CNN spatial operations such as min/max/avg pooling
• Executes e.g. tf.nn.avg_pool, tf.nn.max_pool, tf.nn.pool.
Cross-channel Data Processor (CDP)
• Processes data from different channels/features, e.g. local response normalization
(LRN) function
• Executes e.g. tf.nn.local_response_normalization
Data Reshape Engine (RUBIK)
• Performs data format transformations (splitting, slicing, merging, …)
• Executes e.g. tf.nn.conv2d_transpose, tf.concat, tf.slice, etc.
VP Simulation Results of Initial Configuration
Task trace
Transaction trace
DDR utilization
Resource utlization
Throughput
Outstanding
transactions
Performance limited by processing, use wider SIMD data path
Simulation Reveals Implementation Effects… (1)
Differences between calculated and measured data read/write amount
AlexNet (Norm1):
Expected: 580,800 Bytes
Measured: 654,720 Bytes
Inflation by ~12.72%
“Dark Bandwidth”
Simulation Reveals Implementation Effects… (2)
Differences between calculated and measured execution time
Convolutional Layers 1&2 of LeNet on NVDLA
Back-Annotate Simulation Findings To Analytical Model
Caffe .prototxt
Platform Architect / Simulation Model Spreadsheet / Analytical Model
Impact of SIMD Width on Performance
Resource Utilization of CONV Datapath (yellow), CONV DMA (red) and other components
SIMD
8
processing bound
SIMD
16
SIMD
32
performance
Diminishing
gains
SIMD
64
memory
SIMD CONV DMA load bandwidth
128 bound
CONV PE load
DDR Memory Bandwidth and Power Improvement
DMA
SIMD-
128
Utilization
Resource
Conv PE
25% faster
SIMD-
64
SIMD-128
Conv PE Power
Power consumption
10% lower
DDR Power
total energy
SIMD-64
20% lower
average power
Resnet 18 Example Sweep
Goal: 100 ms latency, minimize power & energy
Sweep parameters
– Burst size: 16, 32
– Outstanding transactions: 4, 8
– DDR memory speed: DDR4-1866, DDR4-2400
– Clock frequency of data path: 1, 1.33, 2GHz
– SIMD width: 64, 128 operations per cycle
Sensitivity
Root-Cause
Analysis
Sweep Over Hardware Parameters, Latency
Outstanding
transactions
GHz
SIMD
DDR4 speed
Burst size
Power/Performance/Energy Trade-off Analysis
Optimal
solution
Outstanding Tx
Datapath GHz
SIMD
Burst size DDR
Example: Resnet-18 with NV-DLA
Resnet18 task graph
Resnet18
generate
Neural
Network
map
Goal:
– 100 ms latency, minimize energy
Optimize Hardware configuration:
– SIMD width: 128 operations per cycle
– Burst size: 32 bytes
– outstanding transactions: 8
NVDLA platform – speed of DDR memory: DDR4-1866
– speed of data path: 1GHz
Summary
• Be fast and get it right!
• Shift Left with Virtual Prototyping
• Joint Optimization of Algorithm,
Architecture, and Compiler
Task graph
Neural
generate Network
Analytical Model
Explore & Refine map
analyze
Sensitivity
Virtual HW Platform
Power/Performance
Thank You
Questions
© Accellera Systems Initiative 49