Accelerating Design Cycles with Software Workload
Models and Cycle Accurate Hardware Models
A Home Gateway SoC Case Study
Vikrant Kapila Systems Architect
Ingo Volkening Systems Architect
Anant Raj Gupta Systems Architect
Intel, Singapore
June 26-27, 2019
SNUG India – Bangalore
SNUG 2019 1
SoC Design Evaluation
Methods of Predicting Performance and Power
Task Graphs
SNUG 2019 2
Agenda
• Predicting SoC Performance using Software Workload Models
✓ Introduction to Task Graphs
✓ Introduction to HW-SW co-design Methodology
• Home Gateway SoC Case Study
✓ Task Graph Creation and Validation
✓ Performance Simulations for Next Generation Architecture
✓ Limitations of Current HW-SW co-design methodology
SNUG 2019 • Q&A 3
Predicting SoC Performance Using Software Workload Models
SNUG 2019 4
Defining Application Task Graph
TASK GRAPH - Definition
A task graph is execution graph where
each node is an atomic task and
edges represents dependencies between
their input and output.
TASK GRAPH - Purpose
A task graph for an application is
created with the purpose of providing the
designer with reliable early design estimates for
the expected system performance.
TASK GRAPH - Granularity
Each Atomic Task in Task Graph can represent a
function, a thread or unique stack function call
depending on use-case requirements.
SNUG 2019 5
HW-SW Co-Design Methodology
cycles: 500 cycles: 2000 Pkt Size:64
Rate:10 Gbps WAN Traffic
INPUTS
load: 20% load: 10%
Micro- store: 10% store: 5% Task C
Internal Architecture
WIFI Traffic
S Independent Task A Task B
LAN Traffic
Task D
OS Trace O
F SW- Filters Software Driven Stochastic
CPU Load Data Path Load
Linux, Windows, T
Android, Qnx
W TGG
SW-HW Mapping + Data Flows Core Affinities
ap
A
R Multi-core CPU
E Workload Generation L2
DDR
Cycle Accurate
CNN Accelerator HW Platform
T Workload Model DMA
NOC
NOC
x86 Program R buffer SRAM
A ✓ Task level parallelism and
NOC
WAV
dependencies SRAM SNPS VPU
C WAV
ARM VDKS E Gfast characterized
✓ Characterized with processing cycles
S as Intel CPU
Multi-Processor SoC Platform Model
and memory accesses
✓ Activation rates
NOTE: INPUTS/SOFTWARE traces are generated using from N-1 Design Platform.
SNUG 2019 6
Home Gateway SoC Case Study
SNUG 2019 7
Software Workload Creation for Linux SoC’s
✓ Standard Packet Processing Flow for NIC Interface Card
NAS Transfer Rate of X MiBps is Observed on Reference SoC
INPUT
NAS Application Transfer from Client to NAS Device
Linux 1. Linux Ftrace
Ethernet Connection
PMU Sampling Driver 2. PMU Statistics
Home Gateway Unit
OUTPUT
Ftrace + PMU 1. SAMBA Task Graph
2. Background Task Graph
Task Graph Generation
SNUG 2019 8
Validating Workload Model
Experiment 1: Standalone Workload validation
Standalone workload should correctly capture processing and communication requirements from input traces.
Validation Trace Task Graph Error
Metric
Execution 12.259 sec 11.859 sec 3.26 %
Time
DDR Read 5337.54 MB 5350 MB 0.2 %
DDR Write 2002.10 MB 2014 MB 0.5 %
CPU Utilization of 61.25 % is observed
SNUG 2019 9
Understanding Trace
Tracer is "function".
The difference is the number of entries that were
lost due to the buffer filling up (250280 – 140080)
Task name "bash", the task PID "1977“.
CPU that it was running on "000", the timestamp in
<secs>.<usecs> format
Function name that was traced "sys_close" and the parent
function that called this function "system_call_fastpath".
The timestamp is the time at which the function was entered.
SNUG 2019 10
Example Recipe : ftrace
SNUG 2019 11
EMON TRACE
LONGEST_LAT_CACHE.REFERENCE 4,001,560,225 127,915,172 15,890,489
SNUG 2019 12
Flame Graph – Quick Analysis
• perf record -F 99 -p 13204 -g -- sleep 30
•
• Each box represents a function in the stack (a
"stack frame").
• The y-axis shows stack depth (number of frames
on the stack). The top box shows the function
that was on-CPU. Everything beneath that is
ancestry. The function beneath a function is its
parent, just like the stack traces shown earlier.
• The x-axis spans the sample population. It
does not show the passing of time from left to
right, as most graphs do. The width of the box
shows the total time it was on-CPU or part of an
ancestry that was on-CPU (based on sample
count).
• Functions with wide boxes may consume more
CPU per execution than those with narrow boxes,
or, they may simply be called more often. The call
count is not shown (or known via sampling).
SNUG 2019 13
Task Graph Generation Tools
Input Trace Type Workload SPEED TGG Comments
Abstraction (Internal) (3rd Party)
Thread
OS Trace SPEED: Linux, Windows
Thread + Function TGG: Linux, Android,
QNX
X86 Binary Function, Instruction Intel PIN Based
Software Analysis Thread, Function Virtualizer Software
(VDK) Development Kits
Custom Performance Workload Instruction Count, Cache
Statistics Characterization statistics
Ptrace Function, Instruction Linux Ptrace
HW-SW partitioning use-case for SAMBA transfer workload model required kernel tracing
support which is available with SPEED.
SNUG 2019 14
Task Graph Validation & Exploration Experiments
Configurable Workload Model
1. BASE_TRANSFER_RATE_MBPS
NAS rate observed on Host Reference Platform. We
observed X MiBps for our platform.
Map
2. INPUT_TRANSFER_RATE_MBPS
NAS rate required for next generation SoC. We will input
2.5X MiBps for our next generation SoC.
Mapping
3. NAS Calibration Factor
Based on underlying extrapolation function. For example,
linear extrapolation.
HUGE MEMORY BANDWIDTH
Core Affinity
For next experiment we assume huge
Setting Specific Core Affinities memory bandwidth.
SNUG 2019 15
System Configuration
Elasticity of Workload CPU Frequency
INPUT TRANSFER RATE
F GHZ
X MiBps
Varying CPU Cores
Experiment 2: Workload Elasticity
Workload should behave in a realistic manner to change of number hardware resources
1
Core NAS RATE 100 %
0.8X MiBps Utilization
2
Cores NAS RATE 120 %
Reference X MiBps Utilization
Configuration
4
Cores NAS RATE
X MiBps
OBSERVATIONS: Increased Number of Cores
✓ Single core is almost 100 % utilized and NAS transfer rate 0.8 X MiBps is limited by compute load.
✓ Dual core shows better performance. Verifies realistic behaviour of task graph to compute resources.
✓ Reference design simulation shows strong correlation in terms of CPU utilization (60 % vs 61.25 %).
✓ Quad core doesn’t improve NAS rate further. Verifies INPUT TRANSFER RATE Configuration.
SNUG 2019 16
Elasticity of Workload System Configuration
CPU Frequency 1.25*F GHZ
Increasing CPU Frequency INPUT TRANSFER RATE X MBps
Experiment 3: Workload Elasticity
Workload should behave in a realistic manner to change of hardware compute power.
60 % Avg Core
utilization
48 % Avg Core
utilization
OBSERVATIONS: Increased CPU Frequency
✓ Single core is almost 98 % utilized on an average and NAS RATE improved to X MiBps against 0.8 MiBps in F Ghz Core Configuration.
✓ Dual core average utilization decreased from 60 % to 48 % for same NAS file transfer rate of X MiBps.
SNUG 2019 17
Performance Simulations for Next Gen Architecture
SAMBA Workload is mapped to Cycle Accurate Next Generation SoC Platform
Simulation Sweep
Parameter Name Values Core Affinities: 100 %
INPUT_NAS_RATE X MiBps, 2.5 X MiBps
CPU_FREQUENCY F Ghz, 1.25*F Ghz
Real Cycle Accurate
NUMBER_OF_CORES 1, 2, 4 HW Platform not shown here.
Reference Image
SNUG 2019 18
Simulation Time
INPUT NAS RATE = X MiBps INPUT NAS RATE = 2.5 X MiBps
OBSERVATIONS
✓ With INPUT_NAS_RATE of X MiBps, increasing frequency ✓ With INPUT_NAS_RATE of 2.5 X MiBps, the performance
does not impact performance of multi-core scenarios as improves with increased compute power, be it frequency or
workload is not compute limited. number of cores.
SNUG 2019 19
NAS TRANSFER RATE X MiBps
TOTAL CPU UTILIZATION Single Core Single Core
100 % utilization 0.8 X MiBps
NAS TRANSFER RATE
Quad Core
Dual Core 1.25 F Ghz
120 %Utilization 98 % Utilization
Quad Core
Dual Core X MiBps
X MiBps
OBSERVATIONS
✓ The core utilization of multi-core scenarios decreases with higher frequencies. This indicates that multi-core
scenarios are not compute limited.
✓ Only single core configuration is compute limited.
SNUG 2019 20
NAS TRANSFER RATE 2.5X MiBps
TOTAL CPU UTILIZATION Quad Core
1.25 F Ghz
NAS TRANSFER RATE Best Performance Quad Core
Dual Core 260 % Utilization
100 % Utilization
Limited Software
Single Core Parallelism
100 % Utilization
Quad Core
2 X MiBps
OBSERVATIONS
✓ With an INPUT_NAS_RATE of 2.5X MiBps, the NAS rate improves with frequency and number of cores for all the scenarios.
✓ NAS rate is not limited by resource parallelism while task level parallelism is insufficient to consume quad cores completely.
SNUG 2019 21
Results Summary
Hardware Architecture Feedback
➢ The maximum NAS speed of 2.5 X MiBps is observed on a quad-core
configuration with a total CPU utilization of 265 % at 1.25 F Ghz Frequency.
➢ We can derive a requirement for 4 cores to reach the targeted performance
speeds of our next architecture design.
Software Architecture Feedback
➢ The Software architecture limits the full utilization of available compute
resources.
➢ After discussions with the software architecture team we found that this speed
will be further limited due to specific core affinities.
SNUG 2019 22
Limitations of Methodology
• Cache Size Extrapolation
✓ No definitive way to account for change in cache hierarchy.
✓ Limited by availability of exact memory access pattern.
• VPU CPI Characterization
✓ Generic Virtual Processing Unit needs to be manually characterized to mimic
real hardware.
✓ For example, software performance model comprising of tasks is characterized
using software instructions, while CPI for underlying hardware needs to come
from Virtual Processing Unit.
SNUG 2019 23
Conclusion
HW-SW co-design methodology enables
early Analysis of Linux SoC Architectures
• Linux Workload Creation & Validation
• Platform Mapping and KPI Analysis
Benefits
• Predict SoC Performance Early
including Software Workload Load.
• Define Better Products and SoC’s
• Reduce Schedule Risk
SNUG 2019 24
Thank You
SNUG 2019 25