0% found this document useful (0 votes)

174 views22 pages

ch16 High Performance PDF

The document discusses various techniques for optimizing assembly code performance, including: 1) using better algorithms than O(n^2) ones; 2) using C/C++ in most cases since compilers implement optimizations well; and 3) optimizing for efficient cache use by working on data in blocks that fit in cache. It also provides examples of specific low-level optimizations like loop unrolling, reducing branches, and using specialized SIMD instructions.

Uploaded by

Anonymous yFkyqi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

174 views22 pages

ch16 High Performance PDF

Uploaded by

Anonymous yFkyqi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

High Performance Assembly Programming

Ray Seyfarth

June 29, 2012

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Outline

Optimizations common to C/C++ and assembly

Optimizations the compiler can do in C, but you only in assembly

For assembly only

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Use a better algorithm

A highly efficient insertion sort is still O(n2 )

Using qsort from C is generally faster
Using the C++ STL sort is faster still
A hash table is O(1) for lookup
In you need an ordered dictionary, perhaps the STL map is best
Tuning an O(n2 ) algorithm in assembly will not convert it to O(n lg n)

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Use C or C++

An optimizing compiler will implement nearly all of the general

optimizations
It will do them tirelessly, missing nearly nothing
Most of a program is not time-critical
Perhaps 10% of a program is worth optimizing
You must usually find a non-obvious technique to get better
performance than the compiler
Use the -S option to get an assembly listing
Learn the compilers tricks
Perhaps you can do the compilers tricks better

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Efficient use of cache

The CPU operates at about 3 GHz
Main memory can provide perhaps 7 bytes per machine cycle
Cache is much faster than main memory
Organize your algorithm to work on data in blocks which fit in cache
The plot below shows time versus array size for computing 10 billion
exclusive or operations
Time to Compute XOR
6

Seconds to process 80 GB

64 Bit Intel Assembly Language

10000

1e+05

1e+06
1e+07
Array Size in Bytes

1e+08

1e+09

1e+10

c
2011
Ray Seyfarth

Efficient use of cache(2)

The plot below illustrates a dramatic performance gain through better
use of cache
The task was to compute a 1024 1024 matrix multiplication
The code was written in C using 6 nested loops
The 3 inner-most loops multiplied one block by another
1024x1024 Matrix Multiplication

2000

MFLOPS

1500

1000

500

64 Bit Intel Assembly Language

500
Block size

1000

c
2011
Ray Seyfarth

Common sub-expression elimination

The compiler will probably do this better than you

You can examine its generated code and perhaps notice something
you have overlooked
I would bet my money of the compiler with this trick

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Strength reduction

This refers to using a simpler mathematical technique

Dividing an integer by 8 could be a shift right 3 bits
Getting a remainder after division by 1024, can be done using and
Rather than using pow(x,3) use x*x*x
Computer x 4 by computing x 2 and then squaring that
Avoid division by a floating point number x, but computing 1/x and
use multiplication instead
Again the compiler will do this tirelessly

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Use registers efficiently

The compiler will do this automatically

Place commonly-used values in registers
If you unroll a loop, use different registers to allow parallel execution
of parts of your computation

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Use fewer branches

Branches interrupt the instruction pipeline

The compiler will frequently re-order blocks of code to reduce
branches
Study the compilers generated code
Use conditional moves for simple computations

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Convert loops to branch at the bottom

The compiler generally does this to reduce the number of instructions
in a loop and, especially, the number of branches
Here is a C for loop
for ( i = 0; i < n; i++ ) {
x[i] = a[i] + b[i];
}
By adding an if at the start you can loop with a branch at the bottom
Dont do this in C. The compiler will handle this.
if ( n > 0 ) {
i = 0;
do {
x[i] = a[i] + b[i];
i++;
} while ( i < n );
}
64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Unroll loops

Use -funroll-all-loops to have gcc unroll loops

Unrolling means repeated occurrences of the loop body with multiple
parts of the data being processed
Try to make each unrolling use different registers to reduce
instruction dependence
This frees up the CPU to do out-of-order execution
It can do more pipelining and more parallel execution

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Assembly code adding numbers in an array, unrolled

The addition is done as 4 sub-sums which are added later
The four unrolled parts accumulate into 4 different registers
.add_words:
add
add
add
add
add
sub
jg
add
add
add

64 Bit Intel Assembly Language

rax, [rdi]
rbx, [rdi+8]
rcx, [rdi+16]
rdx, [rdi+16]
rdi, 32
rsi, 4
.add_words
rcx, rdx
rax, rbx
rax, rcx

c
2011
Ray Seyfarth

Merge loops
If 2 loops have some loop limits, consider merging the bodies
There will be less loop overhead
The following 2 loops can be profitably merged
for ( i = 0; i < 1000; i++ ) a[i] = b[i] + c[i];
for ( j = 0; j < 1000; j++ ) d[j] = b[j] - c[j];
After merging values for b[i] and c[i] can be used twice
for ( i = 0; i < 1000; i++ ) {
a[i] = b[i] + c[i];
d[i] = b[i] - c[i];
}

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Split loops

Didnt I just suggest merging loops?

Sometimes the data is unrelated and merging doesnt help
Perhaps splitting uses cache better
Test your code

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Interchange loops
for ( j = 0; j < n; j++ ) {
for ( i = 0; i < n; i++ ) {
x[i][j] = 0;
}
}
The previous loop steps through the x array in large increments
The loop below steps through the array one element after the other
Cache fetches are better used
for ( i = 0; i < n; i++ ) {
for ( j = 0; j < n; j++ ) {
x[i][j] = 0;
}
}
64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Move loop-invariant code outside the loop

You can do this in C, but the compiler will do it for you

The assembler does not move loop-invariant code
Again, study the generated code

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Remove recursion

Eliminating tail-recursion is generally useful

If you have to simulate a stack like recursion gives you, recursion
will probably be faster

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Eliminate stack frames

Use -fomit-frame-pointers with gcc

Use this for debugged code
Using the rbp register is optional
Leaf functions dont even need to worry about stack alignment
Unless you are using some local data requiring 16 byte alignment

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Inline functions

The compiler can do this painlessly

In assembly you will make your code less readable
Explore using macros

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Reduce dependencies to allow super-scalar execution

Use different registers to try to reduce dependencies

The CPU has multiple computational units in 1 core
You can benefit from out-of-order execution
You can get more out of pipelines
You can keep more computational units busy

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Use specialized instructions

The compiler will have a harder time doing this than you
There are SIMD integer instructions
There are also SIMD floating point instructions
The AVX instructions are a new feature which allow twice as many
floating point values in the SIMD registers

64 Bit Intel Assembly Language

c
2011
Ray Seyfarth

Computer Architecture
No ratings yet
Computer Architecture
667 pages
APU CSLLT - 5 - Registers
No ratings yet
APU CSLLT - 5 - Registers
51 pages
AltivecPresentation 6up
No ratings yet
AltivecPresentation 6up
3 pages
Optimizing C Code for Microcontrollers
No ratings yet
Optimizing C Code for Microcontrollers
21 pages
CHAPTER 28: Digital Signal Processor: C Versus Assembly
No ratings yet
CHAPTER 28: Digital Signal Processor: C Versus Assembly
21 pages
Program Optimization Techniques
No ratings yet
Program Optimization Techniques
35 pages
Machine-Level Programming I: Basics: 15-213/18-213: Introduction To Computer Systems 5 Lecture, Sep. 15, 2015
No ratings yet
Machine-Level Programming I: Basics: 15-213/18-213: Introduction To Computer Systems 5 Lecture, Sep. 15, 2015
44 pages
8086 Assemblyprogramming
No ratings yet
8086 Assemblyprogramming
300 pages
Embedded C Programming Guide
100% (1)
Embedded C Programming Guide
57 pages
Introduction To Assembly Language
100% (6)
Introduction To Assembly Language
65 pages
Basic Programming Concepts
No ratings yet
Basic Programming Concepts
8 pages
Computer Architecture Lesson 2 (Instruction Set Architecture)
No ratings yet
Computer Architecture Lesson 2 (Instruction Set Architecture)
9 pages
X86 Assembly
No ratings yet
X86 Assembly
123 pages
Chap 1-3 - Patients Monitoring Record System For Ignacio Dental Clinic
No ratings yet
Chap 1-3 - Patients Monitoring Record System For Ignacio Dental Clinic
29 pages
Emu8086 Tutorial
No ratings yet
Emu8086 Tutorial
53 pages
Amiga C For Beginners 1990 Abacus PDF
No ratings yet
Amiga C For Beginners 1990 Abacus PDF
308 pages
Writing Optimized C Code For Microcontroller Applications
No ratings yet
Writing Optimized C Code For Microcontroller Applications
21 pages
What Is Assembly Language?: Introduction To The GNU/Linux Assembler and Linker For Intel Pentium Processors
No ratings yet
What Is Assembly Language?: Introduction To The GNU/Linux Assembler and Linker For Intel Pentium Processors
28 pages
Previous Home Next
No ratings yet
Previous Home Next
9 pages
Instruction Set Architecture
No ratings yet
Instruction Set Architecture
37 pages
Es Module 2 Notes PDF
No ratings yet
Es Module 2 Notes PDF
11 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
82 pages
Lecture Notes
No ratings yet
Lecture Notes
122 pages
Course Overview: Computer Organization and Assembly Languages Yung-Yu Chuang 2008/09/15
No ratings yet
Course Overview: Computer Organization and Assembly Languages Yung-Yu Chuang 2008/09/15
76 pages
Microprocessor Systems Handouts 2013
No ratings yet
Microprocessor Systems Handouts 2013
85 pages
MIPS Instructions & Performance Analysis
No ratings yet
MIPS Instructions & Performance Analysis
22 pages
AT - Better C Code For ARM Devices
No ratings yet
AT - Better C Code For ARM Devices
30 pages
C Program Optimization Guide
No ratings yet
C Program Optimization Guide
2 pages
05 CSA Registers
No ratings yet
05 CSA Registers
51 pages
Alpha Assembly Language Guide: Randal E. Bryant Carnegie Mellon University
No ratings yet
Alpha Assembly Language Guide: Randal E. Bryant Carnegie Mellon University
18 pages
CS 312 Lecture - 7a - Machine Level Programming-Basics
No ratings yet
CS 312 Lecture - 7a - Machine Level Programming-Basics
41 pages
IA-32 Architecture: Computer Organization and Assembly Language Dr. Aiman El-Maleh
No ratings yet
IA-32 Architecture: Computer Organization and Assembly Language Dr. Aiman El-Maleh
38 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Lecture 9 Using C
No ratings yet
Lecture 9 Using C
28 pages
Unit 5 Bard
No ratings yet
Unit 5 Bard
8 pages
Lec 1
No ratings yet
Lec 1
12 pages
Intel
No ratings yet
Intel
24 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
119 pages
Lecture 06
No ratings yet
Lecture 06
76 pages
Assembly
No ratings yet
Assembly
3 pages
PowerPC Assembly for Experts
No ratings yet
PowerPC Assembly for Experts
9 pages
6 Machine - Intro v2
No ratings yet
6 Machine - Intro v2
29 pages
CH01 COA11e
No ratings yet
CH01 COA11e
45 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
Lesson 01
No ratings yet
Lesson 01
33 pages
2016defcon Intro To Disassembly Workshop PDF
No ratings yet
2016defcon Intro To Disassembly Workshop PDF
324 pages
Instruction Set Architecture (ISA)
No ratings yet
Instruction Set Architecture (ISA)
41 pages
Embedded Systems Lecture Notes
No ratings yet
Embedded Systems Lecture Notes
144 pages
WPI Hotkeys & Commands Guide
No ratings yet
WPI Hotkeys & Commands Guide
10 pages
Intel P5 (Microarchitecture) : Navigation Search
No ratings yet
Intel P5 (Microarchitecture) : Navigation Search
13 pages
Microprocessor Basics & Functions
100% (1)
Microprocessor Basics & Functions
68 pages
MIPS
No ratings yet
MIPS
27 pages
Lecture 5
No ratings yet
Lecture 5
68 pages
Lab 11
No ratings yet
Lab 11
13 pages
Micro Processor and Micro Controller Lab Manual
100% (1)
Micro Processor and Micro Controller Lab Manual
39 pages
CSCI 232: Introduction To Assembly
No ratings yet
CSCI 232: Introduction To Assembly
59 pages
GNSS Software Receiver Optimization
No ratings yet
GNSS Software Receiver Optimization
18 pages
Power Struggles: Revisiting The RISC vs. CISC Debate On Contemporary ARM and x86 Architectures
No ratings yet
Power Struggles: Revisiting The RISC vs. CISC Debate On Contemporary ARM and x86 Architectures
12 pages
Class Ans Q
No ratings yet
Class Ans Q
24 pages
8086 Internal Block Diagram Enotes
100% (2)
8086 Internal Block Diagram Enotes
7 pages
1 CPE 413 Overview of x86 Architecture-1
No ratings yet
1 CPE 413 Overview of x86 Architecture-1
60 pages
Intro To C - Module 1
No ratings yet
Intro To C - Module 1
6 pages
Assembly Programming Basics
No ratings yet
Assembly Programming Basics
44 pages
Gs04l00l00-00e (11) - Daqwork
No ratings yet
Gs04l00l00-00e (11) - Daqwork
34 pages
Various Addressing Modes of 8086 - 8088
No ratings yet
Various Addressing Modes of 8086 - 8088
3 pages
IT3106E SP 01 Machine Level Programming
No ratings yet
IT3106E SP 01 Machine Level Programming
296 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Emu Log
No ratings yet
Emu Log
4 pages
Micro Assignment - 01
No ratings yet
Micro Assignment - 01
3 pages
Assembly Language Assignment 1
No ratings yet
Assembly Language Assignment 1
14 pages
Asm 8086 14
No ratings yet
Asm 8086 14
6 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
ComputerOrganization - Architecture Regular HO
No ratings yet
ComputerOrganization - Architecture Regular HO
8 pages
Understanding System Programming Basics
No ratings yet
Understanding System Programming Basics
54 pages
Intel x86 Microcode Backdoor Analysis
No ratings yet
Intel x86 Microcode Backdoor Analysis
33 pages
C Programming & Assembly Language
No ratings yet
C Programming & Assembly Language
14 pages
Why C in Hardware
No ratings yet
Why C in Hardware
18 pages
3.1 Machine Basics
No ratings yet
3.1 Machine Basics
55 pages
ARM MC Module 03
No ratings yet
ARM MC Module 03
21 pages
Unit - 1 8086
No ratings yet
Unit - 1 8086
12 pages
Introduction To Embedded Systems
No ratings yet
Introduction To Embedded Systems
45 pages
Ift 212 Computer Architecture Lecture Notes 2
No ratings yet
Ift 212 Computer Architecture Lecture Notes 2
38 pages
Lecture 10
No ratings yet
Lecture 10
20 pages
Lec02 03 Updated
No ratings yet
Lec02 03 Updated
63 pages
OPOP2
No ratings yet
OPOP2
51 pages
Reliability of Languages
No ratings yet
Reliability of Languages
15 pages
3 1
No ratings yet
3 1
12 pages

ch16 High Performance PDF

Uploaded by

ch16 High Performance PDF

Uploaded by

High Performance Assembly Programming

June 29, 2012

64 Bit Intel Assembly Language

Optimizations common to C/C++ and assembly

Optimizations the compiler can do in C, but you only in assembly

For assembly only

64 Bit Intel Assembly Language

Use a better algorithm

A highly efficient insertion sort is still O(n2 )

64 Bit Intel Assembly Language

An optimizing compiler will implement nearly all of the general

64 Bit Intel Assembly Language

Efficient use of cache

64 Bit Intel Assembly Language

Efficient use of cache(2)

64 Bit Intel Assembly Language

Common sub-expression elimination

The compiler will probably do this better than you

64 Bit Intel Assembly Language

This refers to using a simpler mathematical technique

64 Bit Intel Assembly Language

Use registers efficiently

The compiler will do this automatically

64 Bit Intel Assembly Language

Use fewer branches

Branches interrupt the instruction pipeline

64 Bit Intel Assembly Language

Convert loops to branch at the bottom

Use -funroll-all-loops to have gcc unroll loops

64 Bit Intel Assembly Language

Assembly code adding numbers in an array, unrolled

64 Bit Intel Assembly Language

64 Bit Intel Assembly Language

Didnt I just suggest merging loops?

64 Bit Intel Assembly Language

Move loop-invariant code outside the loop

You can do this in C, but the compiler will do it for you

64 Bit Intel Assembly Language

Eliminating tail-recursion is generally useful

64 Bit Intel Assembly Language

Eliminate stack frames

Use -fomit-frame-pointers with gcc

64 Bit Intel Assembly Language

The compiler can do this painlessly

64 Bit Intel Assembly Language

Reduce dependencies to allow super-scalar execution

Use different registers to try to reduce dependencies

64 Bit Intel Assembly Language

Use specialized instructions

64 Bit Intel Assembly Language

You might also like