GDC AMD Ryzen Processor Software Optimization
GDC AMD Ryzen Processor Software Optimization
SOFTWARE OPTIMIZATION
KEN MITCHELL
AGENDA
• Abstract
• Speak Biography
• Products
• Microarchitecture
• Data Flow
• Best Practices
• Optimizations
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 2
ABSTRACT
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 3
SPEAKER BIOGRAPHY
• Ken Mitchell is a Principal Member of
Technical Staff in the AMD Game
Engineering team where he focuses on
helping game developers utilize AMD
processors efficiently. His previous work
includes automating & analyzing PC
applications for performance projections of
future AMD products as well as developing
benchmarks. Ken studied computer science
at the University of Texas at Austin.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 4
PRODUCTS
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 5
AMD RYZEN™ 6000 SERIES MOBILE PROCESSORS
MAX. BOOST BASE DEFAULT
MODEL GRAPHICS MODEL CORES THREADS CLOCK CLOCK TDP
AMD Ryzen™ 9 6980HX AMD Radeon™ 680M 8 16 Up to 5.0GHz 3.3GHz 45W
AMD Ryzen™ 9 6980HS AMD Radeon™ 680M 8 16 Up to 5.0GHz 3.3GHz 35W
AMD Ryzen™ 9 6900HX AMD Radeon™ 680M 8 16 Up to 4.9GHz 3.3GHz 45W
AMD Ryzen™ 9 6900HS AMD Radeon™ 680M 8 16 Up to 4.9GHz 3.3GHz 35W
AMD Ryzen™ 7 6800H AMD Radeon™ 680M 8 16 Up to 4.7GHz 3.2GHz 45W
AMD Ryzen™ 7 6800HS AMD Radeon™ 680M 8 16 Up to 4.7GHz 3.2GHz 35W
AMD Ryzen™ 7 6800U AMD Radeon™ 680M 8 16 Up to 4.7GHz 2.7GHz 15-28W
AMD Ryzen™ 5 6600H AMD Radeon™ 660M 6 12 Up to 4.5GHz 3.3GHz 45W
AMD Ryzen™ 5 6600HS AMD Radeon™ 660M 6 12 Up to 4.5GHz 3.3GHz 35W
AMD Ryzen™ 5 6600U AMD Radeon™ 660M 6 12 Up to 4.5GHz 2.9GHz 15-28W
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 6
AMD RYZEN™ 5000 SERIES DESKTOP PROCESSORS
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 7
AMD RYZEN™ THREADRIPPER™ PRO 5000WX SERIES PROCESSORS
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 8
MICROARCHITECTURE
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 9
“ZEN 3”
• +19% IPC Improvement
• Unified 8-Core CCD
• 32MB L3$ per CCD
• Improved Load Store Unit
• Wider FP & Int
• New Instructions
• Improved SMT fairness
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 10
SIMULTANEOUS MULTI-THREADING
Program Threads • High performance cores have gaps in utilization
A B which may be filled by additional hardware
threads—this is Simultaneous Multi-Threading
Program Core Program (SMT)
Counter #1 Counter #2
Thread Thread
• Although each hardware thread has its own
#1 #2
program counter and architectural register set,
Architectural Architectural they share core resources
Register Set #1 Register Set #2
Scheduler
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 11
CORE RESOURCE SHARING DEFINITIONS
Category Definition
Competitively shared Resource entries are assigned on demand. A thread may use all resource
entries.
Statically partitioned Resource entries are partitioned when entering two-threaded mode. A thread
may not use more resource entries than are available in its partition.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 12
CORE RESOURCE SHARING EVOLUTION
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 13
DESKTOP CACHE HIERARCHY EVOLUTION
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 14
INSTRUCTION SET EVOLUTION
CLFLUSHOPT
WBNOINVD
MONITORX
XSAVEOPT
FSGSBASE
VPCLMUL
OSXSAVE
PCLMUL
RDSEED
XSAVES
XSAVEC
XGETBV
CLZERO
MOVBE
RDRND
SSE4.2
XSAVE
SSE4.1
SSSE3
SMAP
CLWB
SMEP
VAES
AVX2
BMI2
FMA
F16C
ADX
AVX
SHA
AES
BMI
Core
“Zen 3” 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
“Zen 2” 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
“Zen 1” 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
“Jaguar” 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 15
SOFTWARE PREFETCH INSTRUCTIONS
Prefetch T0|T1|T2|NTA
• Load a cache line from the specified memory
address into the data-cache level specified by Fill lines
L1 Aggressively
the locality reference hint T0, T1, T2, or NTA. 32 KB
Evict Prefetch
NTA lines
• Lines filled into the L2 cache with L2
PREFETCHNTA are marked for quicker eviction 512 KB
Memory
Gigabytes
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 16
HARDWARE PREFETCHERS L1
Category Definition
L1 Stream Uses history of memory access patterns to fetch additional sequential lines in ascending or
descending order.
L1 Stride Uses memory access history of individual instructions to fetch additional lines when each
access is a constant distance from the previous.
L1 Region Uses memory access history to fetch additional lines when the data access for a given
instruction tends to be followed by a consistent pattern of other accesses within a localized
region.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 17
HARDWARE PREFETCHERS L2
Category Definition
L2 Stream Uses history of memory access patterns to fetch additional sequential lines in ascending or
descending order.
L2 Up/Down Uses memory access history to determine whether to fetch the next or previous line for all
memory accesses.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 18
AMD PREFERRED CORE
PerformanceSchedulingClass • Some AMD products have cores which are faster
(higher is better) than other cores.
0
• The system BIOS describes the CPPC Highest
2 Performance ranking for each logical processor.
4 • The Windows Kernel creates a
6 PerformanceSchedulingClass ranking based on this
8 information and uses it during scheduling.
10
• Logical processor 0 and CCD0 may not be the
Logical processor
12
14 fastest.
16
• Testing done by AMD performance labs February 12,
18
2022 on an AMD reference motherboard equipped
20
with 16GB DDR4-3200MHz, Ryzen™ 9 5950X with
22
Radeon™ RX 6900 XT, Win11 Pro x64 22000.493.
24
Hypothetic example shown. Actual results may vary.
26
28
30
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 19
DATA FLOW
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 20
AMD RYZEN™ 7 6800U MOBILE PROCESSOR
• AMD Ryzen™ 7 6800U, 15W TDP, 8 Cores, 16 Threads, up to 4.7 GHz max boost clock, 2.7 GHz base clock,
integrated GPU.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 21
AMD RYZEN™ 9 5950X DESKTOP PROCESSOR CCD CCD
IOD
• AMD Ryzen™ 9 5950X, 105W TDP, 16 Cores, 32 Threads, up to 4.9 GHz max boost clock, 3.4 GHz base clock.
• Two Core Complex Die (CCD). Each CCD has one 32M L3 Cache Cluster.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 22
AMD RYZEN™ THREADRIPPER™ PRO 5995WX PROCESSOR 0 1
2 3
IOD
4 5
6 7
• AMD Ryzen™ Threadripper™ Pro 5995WX, 280W TDP, 64 Cores, 128 Threads, up to 4.5 GHz boost, 2.7 GHz
base.
• Two CCDs per Data Fabric Quadrant shown.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 23
BEST PRACTICES
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 24
REDUCE BUILD TIMES
Msbuild.exe UE4.sln • Performance of UE4.27.2 binaries compiled
-target:Engine\UE4:Rebuild with Microsoft Visual Studio.
-property:Configuration=Shipping • Testing done by AMD technology labs, February
-property:Platform=Win64 5, 2022 on the following system. Test
(less is better) configuration: AMD Ryzen™ Threadripper™ PRO
240 231 5995WX, Enermax LIQTECH TR4 II series
360mm liquid cooler, 256GB (8 x 32GB 2R
180 RDDR4-3200 at 24-22-22-52) memory, AMD
Radeon™ RX 6800 XT GPU with driver 21.10.2
seconds
119
120 (October 25, 2021), 2TB M.2 NVME SSD, AMD
Reference Motherboard, Windows® 11 x64 build
60 21H2, 1920x1080 resolution. Actual results may
vary.
0
VS2017, Without Virus VS2022, With Virus
Exclusion Folders Exclusion Folders
System Configuration
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 25
USE THE LATEST COMPILER AND WINDOWS® SDK
Msbuild.exe UE4.sln • Get the latest build and link time improvements.
-target:Engine\UE4:Rebuild • Get the latest library and runtime optimizations.
-property:Configuration=Shipping • Performance of UE4.27.2 binaries compiled with
-property:Platform=Win64 Microsoft Visual Studio.
(less is better)
• Testing done by AMD technology labs, February 5,
240
205 2022 on the following system. Test configuration:
AMD Ryzen™ Threadripper™ PRO 5995WX,
180 Enermax LIQTECH TR4 II series 360mm liquid
cooler, 256GB (8 x 32GB 2R RDDR4-3200 at 24-
seconds
121 119
120 22-22-52) memory, AMD Radeon™ RX 6800 XT
GPU with driver 21.10.2 (October 25, 2021), 2TB
60 M.2 NVME SSD, AMD Reference Motherboard,
Windows® 11 x64 build 21H2, 1920x1080
0
resolution. Actual results may vary.
2017 v15.9.43 2019 v16.11.9 2022 v17.05
Visual Studio Build Tools
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 26
ADD VIRUS AND THREAT PROTECTION EXCLUSIONS
Msbuild.exe UE4.sln • WARNING: Not recommended for CI/CD systems.
Exclusions may make your device vulnerable to threats.
-target:Engine\UE4:Rebuild
-property:Configuration=Shipping • Add project folders to virus and threat protection
settings exclusions for faster build times.
-property:Platform=Win64
(less is better) • Faster rebuild time after optimization!
180 150 • Performance of UE4.27.2 binaries compiled with
119 119 Microsoft Visual Studio 2022 v17.0.5
seconds
120
• Testing done by AMD technology labs, February 5, 2022
60 on the following system. Test configuration: AMD
Ryzen™ Threadripper™ PRO 5995WX, Enermax LIQTECH
0 TR4 II series 360mm liquid cooler, 256GB (8 x 32GB 2R
C:\
None
e-4.27.2-release
C:\UnrealEngin
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 27
PREFER SHIPPING CONFIGURATION BUILDS FOR CPU PROFILING
UE4.27.2 InfiltratorDemo DX12 1080p • Debug and development builds may greatly reduce performance.
• Stats collection may cause cache pollution.
(higher is better)
• Logging may create serialization points.
240
• Debug builds may disable multi-threading optimizations.
193 192 • While investigating open issues, developers may submit change
180 requests which enable debug features on Test and Shipping
158 configurations. Be sure to disable debug features before you ship!
Average FPS
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 28
DISABLE ANTI-TAMPER WHILE CPU PROFILING
• Build a binary similar-to Shipping configuration but without Anti-Tamper or Anti-Cheat which may
prevent CPU profiling tools from properly loading symbols.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 29
AUDIT CONTENT
• Ask artists to recommend profiling scenes of interest!
• For example, an indoor dungeon, an outdoor city, an outdoor forest, large crowds, or a specific time of day.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 30
TEST COLD SHADER CACHE FIRST TIME USER EXPERIENCE
rem Run as administrator
rem Disable Steam Shader Pre-Caching before running this script
rem Reboot after running this script to clear any shaders still in system memory
setlocal enableextensions
cd /d "%~dp0"
rmdir /s /q "%LOCALAPPDATA%\D3DSCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"
rmdir /s /q "%ProgramData%\NVIDIA Corporation\NV_Cache"
rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 31
OPTIMIZATIONS
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 32
TOPICS
• Use the AMD Core Counts Sample
• Use Modern Sync APIs
• Avoid False Sharing
• Prefer data access patterns matching hardware prefetcher behaviors
• Use Software Prefetch instructions for linked data structures experiencing cache misses
• Align Memcpy source and destination pointers
• Avoid Penalties while mixing SSE and AVX instructions
• Support Hybrid Graphics
• Use Preferred Video and Audio Codecs
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 33
USE THE AMD CORE COUNTS SAMPLE
• This advice is specific to AMD processors and is not general guidance for all processor vendors
• Many applications show SMT benefits and use of all logical processors is recommended
• However, games often suffer from SMT and cache contention on the main or render threads during
gameplay
• Creating the thread pool based on physical core count rather than logical processor count may reduce this
contention
• Profile your game to determine the ideal thread count
• Game initialization—including decompressing assets and compiling/warming shaders—may benefit
from logical processors using SMT dual-thread mode
• Game play may prefer physical core count using SMT single-thread mode
• See https://gpuopen.com/learn/cpu-core-counts/
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 34
USE MODERN SYNC APIS
Sync API Test • Prefer std::mutex which has good performance
(less is better) and low cpu utilization.
25,000 • Performance of binaries compiled with Microsoft
20,000 Visual Studio 2022 v17.0.4.
milliseconds
API
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 35
USE MODERN SYNC APIS
Sync API Test • Prefer std::mutex which has good performance
(less is better) and low cpu utilization.
100% • Performance of binaries compiled with Microsoft
Total CPU Utilization
API
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 36
USE MODERN SYNC APIS: SHARED CODE
#include "intrin.h" int main(int argc, char* argv[]) {
#include <chrono> using namespace std::chrono;
#include <numeric>
float b0 = (argc > 1) ? strtof(argv[1], NULL) : 1.0f;
#include <thread>
#include <vector> float c0 = (argc > 2) ? strtof(argv[2], NULL) : 2.0f;
#include <mutex> std::fill((float*)b, (float*)(b + LEN), b0);
#include <Windows.h> std::fill((float*)c, (float*)(c + LEN), c0);
#define LEN 128 int num_threads = std::thread::hardware_concurrency();
std::vector<std::thread> threads = {};
alignas(64) float b[LEN][4][4]; auto t0 = high_resolution_clock::now();
alignas(64) float c[LEN][4][4]; for (size_t i = 0; i < num_threads; ++i) {
threads.push_back(std::thread(fn));
}
for (size_t i = 0; i < num_threads; ++i) {
threads[i].join();
}
auto t1 = high_resolution_clock::now();
wprintf(L"time (ms): %lli\n", \
duration_cast<milliseconds>(t1 - t0).count());
return EXIT_SUCCESS;
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 37
USE MODERN SYNC APIS: BAD USER SPIN LOCK
namespace MyLock { void fn() {
typedef unsigned LOCK, *PLOCK; alignas(64) float a[LEN][4][4];
enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; std::fill((float*)a, (float*)(a + LEN), 0.0f);
void Lock(PLOCK pl) { float r = 0.0;
while (LOCK_IS_TAKEN == \ for (size_t iter = 0; iter < 100000; iter++) {
_InterlockedCompareExchange(\ MyLock::Lock(&gLock);
reinterpret_cast<long*>(pl), \ for (int m = 0; m < LEN; m++)
LOCK_IS_TAKEN, LOCK_IS_FREE)) { for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
} for (int k = 0; k < 4; k++)
void Unlock(PLOCK pl) { a[m][i][j] += b[m][i][k] * c[m][k][j];
_InterlockedExchange(reinterpret_cast<long*>(pl),\ r += std::accumulate((float*)a, \
LOCK_IS_FREE); (float*)(a + LEN), 0.0f);
} MyLock::Unlock(&gLock);
} }
wprintf(L"result: %f\n", r);
MyLock::LOCK gLock; }
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 38
USE MODERN SYNC APIS: IMPROVED USER SPIN LOCK
namespace MyLock { void fn() {
typedef unsigned LOCK, *PLOCK; alignas(64) float a[LEN][4][4];
enum { LOCK_IS_FREE = 0, LOCK_IS_TAKEN = 1 }; std::fill((float*)a, (float*)(a + LEN), 0.0f);
void Lock(PLOCK pl) { float r = 0.0;
while ((LOCK_IS_TAKEN == *pl) || \ for (size_t iter = 0; iter < 100000; iter++) {
(LOCK_IS_TAKEN == \ MyLock::Lock(&gLock);
_InterlockedExchange(pl, LOCK_IS_TAKEN))) { for (int m = 0; m < LEN; m++)
_mm_pause(); for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
} for (int k = 0; k < 4; k++)
void Unlock(PLOCK pl) { a[m][i][j] += b[m][i][k] * c[m][k][j];
_InterlockedExchange(reinterpret_cast<long*>(pl),\ r += std::accumulate((float*)a, \
LOCK_IS_FREE); (float*)(a + LEN), 0.0f);
} MyLock::Unlock(&gLock);
} }
wprintf(L"result: %f\n", r);
alignas(64) MyLock::LOCK gLock; }
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 39
USE MODERN SYNC APIS: WAITFORSINGLEOBJECT
// MyLock not required. Let the OS do the work! void fn() {
alignas(64) float a[LEN][4][4];
HANDLE hMutex; std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0;
int main(int argc, char* argv[]) { for (size_t iter = 0; iter < 100000; iter++) {
hMutex = CreateMutex(NULL,FALSE,NULL); WaitForSingleObject(hMutex, INFINITE);
// otherwise main is the same as before. for (int m = 0; m < LEN; m++)
// ... for (int i = 0; i < 4; i++)
} for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \
(float*)(a + LEN), 0.0f);
ReleaseMutex(hMutex);
}
wprintf(L"result: %f\n", r);
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 40
USE MODERN SYNC APIS: STD::MUTEX
// MyLock not required. Let the OS do the work! void fn() {
std::mutex mutex; alignas(64) float a[LEN][4][4];
std::fill((float*)a, (float*)(a + LEN), 0.0f);
float r = 0.0;
for (size_t iter = 0; iter < 100000; iter++) {
mutex.lock();
for (int m = 0; m < LEN; m++)
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
a[m][i][j] += b[m][i][k] * c[m][k][j];
r += std::accumulate((float*)a, \
(float*)(a + LEN), 0.0f);
mutex.unlock();
}
wprintf(L"result: %f\n", r);
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 41
AVOID FALSE SHARING
False Sharing Test • Reduced execution time by 90% after
(less is better) optimization!
35,000 • Performance of binaries compiled with
Microsoft Visual Studio 2022 v17.0.5.
30,000 28,598
• Testing done by AMD technology labs, February
25,000 5, 2022 on the following system. Test
milliseconds
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 42
AVOID FALSE SHARING
#include <chrono> int main(int argc, char* argv[]) {
#include <numeric> int numThreads = std::thread::hardware_concurrency();
#include <thread> ThreadData* a = static_cast<ThreadData*>(_aligned_malloc(
#include <vector> numThreads*sizeof(ThreadData), 64));
if (nullptr == a) return EXIT_FAILURE;
#if defined (APPLY_OPTIMIZATION) std::vector<std::thread> threads = {};
/* 64 bytes */ auto t0 = high_resolution_clock::now();
struct alignas(64) ThreadData { unsigned long sum; }; for (size_t i = 0; i < numThreads; ++i) {
#else threads.push_back(std::thread(fn, &a[i], i));
/* 4 bytes */ }
struct ThreadData { unsigned long sum; }; for (size_t i = 0; i < numThreads; ++i) {
#endif threads[i].join();
}
using namespace std::chrono; auto t1 = high_resolution_clock::now();
#define NUM_ITER 100000000 wprintf(L"time (ms): %lli\n",
duration_cast<milliseconds>(t1 - t0).count());
void fn(ThreadData* p, size_t seed) { for (size_t i = 0; i < numThreads; ++i) {
srand(static_cast<unsigned int>(seed)); wprintf(L"sum[%llu] = %lu\n", i, (* (a + i)).sum);
p->sum = 0; }
for (int i = 0; i < NUM_ITER; i++) { _aligned_free(a);
p->sum += rand() % 2; return EXIT_SUCCESS;
} }
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 43
PREFER DATA ACCESS PATTERNS MATCHING HARDWARE PREFETCHER
BEHAVIORS
Streaming
Stride
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 44
STREAMING HARDWARE PREFETCHER
Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 45
STRIDE HARDWARE PREFETCHER
Uses memory access history of individual instructions to fetch additional lines when each access is a constant
distance from the previous.
struct S { double x1, y1, z1, w1; char name[256]; double x2, y2, z2, w2; };
alignas(64) S a[LEN];
// …
double sumX1 = 0.0f, sumX2 = 0.0f;
for (size_t i = 0; i < LEN; i++) {
sumX1 += a[i].x1; // stride prefetch 0
sumX2 += a[i].x2; // stride prefetch 1
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 46
USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA
Nvidia PhysX 4.1 KaplaDemo • Over 60% faster after optimization!
AMD Ryzen™ 7 4700G, NVidia GeForce RTX™ 2080
(higher is better) • Performance of binaries compiled with
250 Microsoft Visual Studio 2019 v16.8.3.
210 • Testing done by AMD technology labs, January
200 4, 2021 on the following system. Test
configuration: AMD Ryzen™ 7 4700G, AMD
Wraith Spire Cooler, 16GB (2 x 8GB DDR4-3200
At start of demo
Average FPS
150
125 at 22-22-22-52) memory, NVidia GeForce RTX™
2080 GPU with driver 460.89 (December 15,
100 2020), 512GB M.2 NVME SSD, AMD Ryzen™
Reference Motherboard, Windows® 10 x64
50 build 20H2, 1920x1080 resolution. Actual
results may vary
0
before after
optimization
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 47
USE SOFTWARE PREFETCH INSTRUCTIONS FOR LINKED DATA…
// Copyright (c) 2021 NVIDIA Corporation. All rights reserved PxMat44 pose(c->getGlobalPose());
// ConvexRenderer.cpp from https://github.com/NVIDIAGameWorks/PhysX/tree/4.1/physx float* mp = (float*)pose.front();
void ConvexRenderer::updateTransformations() float* ta = tt;
{ for (int k = 0; k < 16; k++) {
for (int i = 0; i < (int)mGroups.size(); i++) { *(tt++) = *(mp++);
ConvexGroup *g = mGroups[i]; }
if (g->texCoords.empty()) PxVec3 matOff = c->getMaterialOffset();
continue; ta[3] = matOff.x;
float* tt = &g->texCoords[0]; ta[7] = matOff.y;
for (int j = 0; j < (int)g->convexes.size(); j++) { ta[11] = matOff.z;
const Convex* c = g->convexes[j]; int idFor2DTex = c->getSurfaceMaterialId();
#if defined(APPLY_OPTIMIZATION) int idFor3DTex = c->getMaterialId();
int distance = 4; // TODO find ideal number const int MAX_3D_TEX = 8;
size_t future = (j + distance) % g->convexes.size(); ta[15] = (float)(idFor2DTex*MAX_3D_TEX + idFor3DTex);
_mm_prefetch(0x0F8 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mPxActor }
_mm_prefetch(0x100 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mLocalPose glBindTexture(GL_TEXTURE_2D, g->matTex);
_mm_prefetch(0x148 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.x glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, g->texSize,
_mm_prefetch(0x14C + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.y g->texSize, GL_RGBA, GL_FLOAT, &g->texCoords[0]);
_mm_prefetch(0x150 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialOffset.z glBindTexture(GL_TEXTURE_2D, 0);
_mm_prefetch(0x164 + (char*)(g->convexes[future]), _MM_HINT_NTA); //mSurfaceMaterialId }
_mm_prefetch(0x160 + (char*)(g->convexes[future]), _MM_HINT_NTA); // mMaterialId }
#endif
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 48
ALIGN MEMCPY SOURCE AND DESTINATION POINTERS
• Update the compiler for the latest memcpy , memset , and other C runtime optimizations!
• Memcpy behavior is undefined if dest and src overlap.
• The compiler may generate Rep Move String instructions which have defined overlapping behavior.
• Alignas(64) may allow faster rep movs microcode.
• Alignas(4096) may reduce store-to-load conflicts.
• The processor uses linear address bits 0 thru 11 to determine Store-To-Load-Forward eligibility.
• PMCx024 LsBadStatus2 StliOther counts store-to-load conflicts where a load was unable to complete
due to a non-forwardable conflict with an older store.
• Alignas(4096) may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.
• Aligning to the bit_floor may provide a good balance of cache hits and alignment:
• std::clamp(std::bit_floor(count), 4, 4096);
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 49
AVOID PENALTIES WHILE MIXING SSE AND AVX INSTRUCTIONS
mesh_to_sdf.exe --maxload • There is a significant penalty for mixing SSE and AVX
instructions when the upper 128 bits of the YMM
AVX2(8-wide) registers contain non-zero data.
(less is better)
• Benchmark execution time was reduced by 60% after
35,000
31,512 VZeroUpper optimization.
30,000 • Performance of binaries compiled with Microsoft Visual
25,000 Studio 2022 v17.0.5.
milliseconds
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 50
AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS
• Use PMCx00E Floating Point Dispatch Faults > 0 to find code which may be missing VZeroUpper or
VZeroAll instructions during AVX to SSE and SSE to AVX transitions.
• Optimization 1:
• Use the /arch:AVX compiler flag.
• AVX is supported by 94% of users according to the January 2022 Steam Hardware & Software Survey.
• Optimization 2:
• Return a __m256 value using pass-by-reference in the function parameter list rather than the
function return type.
• Optimization 3:
• Use __forceinline on the function definition.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 51
AVOID PENALTY FOR MIXING SSE AND AVX INSTRUCTIONS
// Before Optimization // After Optimization
__m256 udTriangle_sq_precalc_SIMD_8grid( void udTriangle_sq_precalc_SIMD_8grid(
const __m256 p_x, const __m256 p_y, const __m256 p_x, const __m256 p_y,
const __m256 p_z, const tri_precalc_t &pc ) const __m256 p_z, const tri_precalc_t& pc,
{ __m256 &ret )
// ... {
__m256 res = _mm256_blendv_ps( res1, res0, // ...
cmp ); ret = _mm256_blendv_ps( res1, res0,
cmp );
return res;
}
}
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 52
Before the optimization,
FP_DISPATCH_FAULTS may occur because
there is no VZeroUpper or VZeroAll
instruction during the AVX to SSE transition.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 53
After the optimization,
FP_DISPATCH_FAULTS have been reduced
because there is a VZeroUpper instruction
during the AVX to SSE transition.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 54
SUPPORT HYBRID GRAPHICS
• Use
IDXGIFactory6::EnumAdapterByGpuPreference
DXGI_GPU_PREFERENCE_HIGH_PERFORMANC
E for game applications.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 55
USE PREFERRED VIDEO AND AUDIO CODECS
• Prefer H264 video and AAC audio codecs as recommended
by the Unreal Engine Electra Plugin.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 56
SOFTWARE OPTIMIZATION GUIDES
• AMD Family 19h is “Zen 3”
• See
https://developer.amd.com/resources/develop
er-guides-manuals/
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 57
Design faster. Render faster. Iterate faster.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 59
DISCLAIMER AND NOTICES
Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and
typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security
vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this
information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without
obligation of AMD to notify any person of such revisions or changes. THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO
REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF
ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD is not responsible for any electronic virus or damage or losses therefrom that may be caused by changes or modifications that you make to
your system, including but not limited to antivirus software. Changes to your system configurations and settings, including but not limited to
antivirus software, is done at your sole discretion and under no circumstances will AMD be liable to you for any such changes. You assume all risk
and are solely responsible for any damages that may arise from or are related to changes that you make to your system, including but not limited
to antivirus software.
AMD, the AMD Arrow logo, Ryzen™, Threadripper™, Radeon™, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other
product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Microsoft,
Windows, and Visual Studio are registered trademarks of Microsoft Corporation in the US and/or other countries. Unreal® is a trademark or
registered trademark of Epic Games, Inc. in the United States of America and elsewhere. NVIDIA is a trademark and/or registered trademark of
NVIDIA Corporation in the U.S. and/or other countries. Steam is a trademark and/or registered trademark of Valve Corporation. PCIe is a
registered trademark of PCI-SIG.
AMD products or technologies may include hardware to accelerate encoding or decoding of certain video standards but require the use of
additional programs/applications.
2022 Advanced Micro Devices, Inc. All rights reserved.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 60
DISCLAIMER AND NOTICES
Code sample on slide 48 is modified.
Copyright (c) 2022 NVIDIA Corporation. All rights reserved. Code Sample is licensed subject to the following:
“Redistribution and use in source and binary forms, with or without modification, are permitted provided that the
following conditions are met: Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or other materials provided with the
distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or
promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED
BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.”
MeshToSDF, Copyright 2022 Mikkel Gjoel under MIT License. https://github.com/pixelmager/MeshToSDF
Infiltrator Demo uses the Unreal® Engine. Unreal® is a trademark or registered trademark of Epic Games, Inc. in the
United States of America and elsewhere.
Unreal® Engine, Copyright 1998 – 2022, Epic Games, Inc. All rights reserved.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 61
DISCLAIMER AND NOTICES
• Claim “Zen 3” +19% IPC uplift
• Testing by AMD performance labs as of 09/01/2020. IPC evaluated with a selection of 25 workloads running at a locked 4GHz
frequency on 8-core "Zen 2" Ryzen™ 7 3800XT and "Zen 3" Ryzen™ 7 5800X desktop processors configured with Windows® 10,
NVIDIA GeForce RTX 2080 Ti (451.77), Samsung 860 Pro SSD, and 2x8GB DDR4-3600. Results may vary. R5K-003
• Design faster. Render faster. Iterate faster. Create more, faster with AMD Ryzen™ processors
• Testing by AMD Performance Labs as of September 23, 2020 using a Ryzen™ 9 5950X and Intel Core i9-10900K configured with
DDR4-3600C16 and NVIDIA GeForce RTX 2080 Ti. Results may vary. R5K-039
• The information contained herein is for informational purposes only and is subject to change without notice. Timelines, roadmaps,
and/or product release dates shown herein are plans only and subject to change. "Zen 2" and "Zen 3" are codenames for AMD
architectures, and are not product names. GD-122
• Engineering projections are not a guarantee of final performance. Performance projections by AMD engineering staff based on
expected Ryzen™ Threadripper™ Pro 5000 WX series processors vs Ryzen™ Threadripper™ Pro 3000 WX series processors. Specific
projections are based on reference design platforms and are subject to change when final products are released in market.
AMD PUBLIC | GDC22 | AMD Ryzen™ Processor Software Optimization | March 2022 62