Cortex®-A72 Software
Optimization Guide
Date of Issue: March 10, 2015
Copyright ARM Limited 2015. All rights reserved.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 1 of 42
Cortex®-A72
Software Optimisation Guide
Copyright © 2015 ARM. All rights reserved.
Release Information
The following changes have been made to this Software Optimisation Guide.
Change History
Date Issue Confidentiality Change
1 June 2015 Non-confidential First release
Proprietary Notice
Words and logos marked with ™ or ® are registered trademarks or trademarks of ARM® in the EU and other
countries except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein
may be the trademarks of their respective owners.
Neither the whole nor any part of the information contained in, or the product described in, this document may be
adapted or reproduced in any material form except with the prior written permission of the copyright holder.
The product described in this document is subject to continuous developments and improvements. All particulars
of the product and its use contained in this document are given by ARM in good faith. However, all warranties
implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are
excluded.
This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss
or damage arising from the use of any information in this document, or any error or omission in such information,
or any incorrect use of the product.
Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”.
Confidentiality Status
This document is Non-Confidential. This document has no restriction on distribution.
Product Status
The information in this document is final, that is for a developed product .
Web Address
http://www.arm.com
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 2 of 42
Contents
1
ABOUT THIS DOCUMENT 5
1.1
References 5
1.2
Terms and abbreviations 5
1.3
Document Scope 5
2
INTRODUCTION 6
2.1
Pipeline Overview 6
3
INSTRUCTION CHARACTERISTICS 7
3.1
Instruction Tables 7
3.2
Branch Instructions 7
3.3
Arithmetic and Logical Instructions 7
3.4
Move and Shift Instructions 8
3.5
Divide and Multiply Instructions 9
3.6
Saturating and Parallel Arithmetic Instructions 10
3.7
Miscellaneous Data-Processing Instructions 11
3.8
Load Instructions 12
3.9
Store Instructions 15
3.10
FP Data Processing Instructions 16
3.11
FP Miscellaneous Instructions 18
3.12
FP Load Instructions 19
3.13
FP Store Instructions 20
3.14
ASIMD Integer Instructions 22
3.15
ASIMD Floating-Point Instructions 26
3.16
ASIMD Miscellaneous Instructions 28
3.17
ASIMD Load Instructions 30
3.18
ASIMD Store Instructions 33
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 3 of 42
3.19
Cryptography Extensions 34
3.20
CRC 35
4
SPECIAL CONSIDERATIONS 36
4.1
Dispatch Constraints 36
4.2
Conditional Execution 36
4.3
Conditional ASIMD 37
4.4
Register Forwarding Hazards 37
4.5
Load/Store Throughput 38
4.6
Load/Store Alignment 39
4.7
Branch Alignment 39
4.8
Setting Condition Flags 39
4.9
Special Register Access 39
4.10
AES Encryption/Decryption 40
4.11
Fast literal generation 41
4.12
PC-relative address calculation 41
4.13
FPCR self-synchronization 42
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 4 of 42
1 ABOUT THIS DOCUMENT
1.1 References
This document refers to the following documents.
Title Location
ARM Cortex-A72 MPCore Processor Technical Reference Manual Infocenter.arm.com
1.2 Terms and abbreviations
This document uses the following terms and abbreviations.
Term Meaning
ALU Arithmetic/Logical Unit
ASIMD Advanced SIMD
µop Micro-Operation
VFP Vector Floating Point
1.3 Document Scope
This document provides high-level information about the Cortex®-A72 processor pipeline, instruction performance
characteristics, and special performance considerations. This information is intended to aid those who are
optimizing software and compilers for the Cortex®-A72 processor. For a more complete description of the
Cortex®-A72 processor, please refer to the ARM Cortex®-A72 MPCore Processor Technical Reference Manual,
available at infocenter.arm.com.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 5 of 42
2 INTRODUCTION
2.1 Pipeline Overview
The following diagram describes the high-level Cortex®-A72 instruction processing pipeline. Instructions are first
fetched, then decoded into internal micro-operations (µops). From there, the µops proceed through register
renaming and dispatch stages. Once dispatched, µops wait for their operands and issue out-of-order to one of
eight execution pipelines. Each execution pipeline can accept and complete one µop per cycle.
Branch
Integer 0
Integer 1
Decode, Integer Multi-Cycle
Issue
Fetch Rename,
Dispatch FP/ASIMD 0
FP/ASIMD 1
Load
Store
IN ORDER OUT OF ORDER
The execution pipelines support different types of operations, as shown below.
Pipeline (mnemonic) Supported functionality
Branch (B) Branch µops
Integer 0/1 (I) Integer ALU µops
Multi-cycle (M) Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences
µops
Load (L) Load and register transfer µops
Store (S) Store and special memory µops
FP/ASIMD-0 (F0) ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc,
FP add, FP multiply, FP divide, crypto µops
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 6 of 42
FP/ASIMD-1 (F1) ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, FP sqrt, ASIMD
shift µops
3 INSTRUCTION CHARACTERISTICS
3.1 Instruction Tables
This chapter describes high-level performance characteristics for most ARMv8 A32, T32 and A64 instructions. It
includes a series of tables to summarize the effective execution latency and throughput, pipelines utilized, and
special behaviors associated with each group of instructions. Utilized pipelines correspond to the execution
pipelines described in chapter 2.
In the following tables:
• Exec Latency is defined as the minimum latency seen by an operation dependent on an instruction in the
described group.
• Execution Throughput is defined as the maximum throughput (in instructions / cycle) of the specified
instruction group that can be achieved in the entirety of the Cortex®-A72 microarchitecture.
3.2 Branch Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Branch,
immed
B
1
1
B
Branch,
register
BX
1
1
B
Branch
and
link,
immed
BL,
BLX
1
1
I0/I1,
B
Branch
and
link,
register
BLX
1
1
I0/I1,
B
Compare
and
branch
CBZ,
CBNZ
1
1
B
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Branch,
immed
B
1
1
B
Branch,
register
BR,
RET
1
1
B
Branch
and
link,
immed
BL
1
1
I0/I1,
B
Branch
and
link,
register
BLR
1
1
I0/I1,
B
Compare
and
branch
CBZ,
CBNZ,
TBZ,
TBNZ
1
1
B
3.3 Arithmetic and Logical Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ALU,
basic
ADD{S},
ADC{S},
ADR,
AND{S},
1
2
I0/I1
BIC{S},
CMN,
CMP,
EOR{S},
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 7 of 42
ORN{S},
ORR{S},
RSB{S},
RSC{S},
SUB{S},
SBC{S},
TEQ,
TST
ALU,
shift
by
immed
(same
as
above)
2
1
M
ALU,
shift
by
register,
(same
as
above)
2
1
M
unconditional
ALU,
shift
by
register,
(same
as
above)
2
1
I0/I1
conditional
ALU,
branch
forms
+2
1
+B
1
Note:
1. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds 2 cycles to the latency.
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ALU,
basic
ADD{S},
ADC{S},
AND{S},
1
2
I0/I1
BIC{S},
EON,
EOR,
ORN,
ORR,
SUB{S},
SBC{S}
ALU,
extend
and/or
shift
ADD{S},
AND{S},
BIC{S},
EON,
2
1
M
EOR,
ORN,
ORR,
SUB{S}
Conditional
compare
CCMN,
CCMP
1
2
I0/I1
Conditional
select
CSEL,
CSINC,
CSINV,
CSNEG
1
2
I0/I1
3.4 Move and Shift Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Move,
basic
MOV{S},
MOVW,
MVN{S}
1
2
I0/I1
1
Move,
shift
by
immed,
no
ASR,
LSL,
LSR,
ROR,
RRX,
MVN
1
2
I0/I1
setflags
Move,
shift
by
immed,
ASRS,
LSLS,
LSRS,
RORS,
RRXS,
2
1
M
setflags
MVNS
Move,
shift
by
register,
no
ASR,
LSL,
LSR,
ROR,
RRX,
MVN
1
2
I0/I1
setflags,
unconditional
Move,
shift
by
register,
no
ASR,
LSL,
LSR,
ROR,
RRX,
MVN
2
1
I0/I1
setflags,
conditional
Move,
shift
by
register,
ASRS,
LSLS,
LSRS,
RORS,
RRXS,
2
1
M
setflags,
unconditional
MVNS
Move,
shift
by
register,
ASRS,
LSLS,
LSRS,
RORS,
RRXS,
2
1
I0/I1
setflags,
conditional
MVNS
Move,
top
MOVT
1
2
I
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 8 of 42
(Move,
branch
forms)
+2
1
+B
2
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Address
generation
ADR,
ADRP
1
2
I0/I1
3
Move
immed
MOVN,
MOVK,
MOVZ
1
2
I0/I1
1
Variable
shift
ASRV,
LSLV,
LSRV,
RORV
1
2
I0/I1
Note:
1. Sequential MOVW/MOVT (AArch32) instruction pairs and certain MOVZ/MOVK, MOVK/MOVK (AArch64)
instruction pairs can be executed with one-cycle execute latency and four-instruction/cycle execution
throughput in I0/I1. See Section 4.11 for more details on the instruction pairs that can be merged.
2. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.
3. Sequential ADRP/ADD instruction pairs can be executed with one-cycle execute latency and four
instruction/cycle execution throughput in I0/I1. See Section 4.12 for more details on the instruction pairs
that can be merged.
3.5 Divide and Multiply Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Divide
SDIV,
UDIV
4
-‐
12
1/12
-‐
1/4
M
1
Multiply
MUL,
SMULBB,
SMULBT,
3
1
M
SMULTB,
SMULTT,
SMULWB,
SMULWT,
SMMUL{R},
SMUAD{X},
SMUSD{X}
Multiply
accumulate
MLA,
MLS,
SMLABB,
SMLABT,
3
(1)
1
M
2
SMLATB,
SMLATT,
SMLAWB,
SMLAWT,
SMLAD{X},
SMLSD{X},
SMMLA{R},
SMMLS{R}
Multiply
accumulate
long
SMLAL,
SMLALBB,
SMLALBT,
4
(2)
1/2
M
2,
3
SMLALTB,
SMLALTT,
SMLALD{X},
SMLSLD{X},
UMAAL,
UMLAL
Multiply
long
SMULL,
UMULL
4
1/2
M
3
(Multiply,
setflags
forms)
+1
(Same
as
+I0/I1
4
above)
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 9 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput pipelines
Divide,
W-‐form
SDIV,
UDIV
4-‐12
1/12
–
1/4
M
1
Divide,
X-‐form
SDIV,
UDIV
4-‐20
1/20
-‐
1/4
M
1
Multiply
accumulate,
W-‐ MADD,
MSUB
3
(1)
1
M
2
form
Multiply
accumulate,
X-‐ MADD,
MSUB
5
(3)
1/3
M
2,5
form
Multiply
accumulate
long
SMADDL,
SMSUBL,
UMADDL,
3
(1)
1
M
2
UMSUBL
Multiply
high
SMULH,
UMULH
6
[3]
1/4
M
6
Note:
1. Integer divides are performed using a iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops, allowing
a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency N shown
in parentheses).
3. Long-form multiplies (which produce two result registers) stall the multiplier pipeline for one extra cycle.
4. Multiplies that set the condition flags require an additional integer µop.
5. X-form multiply accumulates stall the multiplier pipeline for two extra cycles.
6. Multiply high operations stall the multiplier pipeline for N extra cycles before any other type M µop can be
issued to that pipeline, with N shown in parentheses.
3.6 Saturating and Parallel Arithmetic Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput pipelines
Parallel
arith,
SADD16,
SADD8,
SSUB16,
2
1
M
unconditional
SSUB8,
UADD16,
UADD8,
USUB16,
USUB8
Parallel
arith,
conditional
SADD16,
SADD8,
SSUB16,
2
(4)
1/2
M,
I0/I1
1
SSUB8,
UADD16,
UADD8,
USUB16,
USUB8
Parallel
arith
with
SASX,
SSAX,
UASX,
USAX
3
1
I0/I1,
M
exchange,
unconditional
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 10 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput pipelines
Parallel
arith
with
SASX,
SSAX,
UASX,
USAX
3
(5)
1/2
I0/I1,
M
1
exchange,
conditional
Parallel
halving
arith
SHADD16,
SHADD8,
2
1
M
SHSUB16,
SHSUB8,
UHADD16,
UHADD8,
UHSUB16,
UHSUB8
Parallel
halving
arith
with
SHASX,
SHSAX,
UHASX,
3
1
I0/I1,
M
exchange
UHSAX
Parallel
saturating
arith
QADD16,
QADD8,
QSUB16,
2
1
M
QSUB8,
UQADD16,
UQADD8,
UQSUB16,
UQSUB8
Parallel
saturating
arith
QASX,
QSAX,
UQASX,
UQSAX
3
1
I0/I1,
M
with
exchange
Saturate
SSAT,
SSAT16,
USAT,
USAT16
2
1
M
Saturating
arith
QADD,
QSUB
2
1
M
Saturating
doubling
arith
QDADD,
QDSUB
3
1
I0/I1,
M
Note:
1. Conditional GE-setting instructions require three extra µops and two additional cycles to conditionally
update the GE field (GE latency shown in parentheses).
3.7 Miscellaneous Data-Processing Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Bit
field
extract
SBFX,
UBFX
1
2
I0/I1
Bit
field
insert/clear
BFI,
BFC
2
1
M
Count
leading
zeros
CLZ
1
2
I0/I1
Pack
halfword
PKH
2
1
M
Reverse
bits/bytes
RBIT,
REV,
REV16,
REVSH
1
2
I0/I1
Select
bytes,
unconditional
SEL
1
2
I0/I1
Select
bytes,
conditional
SEL
2
1
I0/I1
Sign/zero
extend,
normal
SXTB,
SXTH,
UXTB,
UXTH
1
2
I0/I1
Sign/zero
extend,
parallel
SXTB16,
UXTB16
2
1
M
Sign/zero
extend
and
add,
SXTAB,
SXTAH,UXTAB,
UXTAH
2
1
M
normal
Sign/zero
extend
and
add,
SXTAB16,
UXTAB16
4
1/2
M
parallel
Sum
of
absolute
USAD8,
USADA8
3
1
M
differences
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 11 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Bitfield
extract,
one
reg
EXTR
1
2
I0/I1
Bitfield
extract,
two
regs
EXTR
3
1
I0/I1,
M
Bitfield
move,
basic
SBFM,
UBFM
1
2
I0/I1
Bitfield
move,
insert
BFM
2
1
M
Count
leading
CLS,
CLZ
1
2
I0/I1
Reverse
bits/bytes
RBIT,
REV,
REV16,
REV32
1
2
I0/I1
3.8 Load Instructions
The latencies shown in the following table assume the memory access hits in the Level 1 Data Cache.
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load,
immed
offset
LDR{T},
LDRB{T},
LDRD,
4
1
L
LDRH{T},
LDRSB{T},
LDRSH{T}
Load,
register
offset,
plus
LDR,
LDRB,
LDRD,
LDRH,
4
1
L
LDRSB,
LDRSH
Load,
register
offset,
LDR,
LDRB,
LDRD,
LDRH,
5
1
I0/I1,
L
minus
LDRSB,
LDRSH
Load,
scaled
register
LDR,
LDRB
4
1
L
offset,
plus
LSL2
Load,
scaled
register
LDR,
LDRB,
LDRH,
LDRSB,
5
1
I0/I1,
L
offset,
other
LDRSH
Load,
immed
pre-‐indexed
LDR,
LDRB,
LDRD,
LDRH,
4
(1)
1
L,
I0/I1
1
LDRSB,
LDRSH
Load,
register
pre-‐indexed
LDR,
LDRB,
LDRH,
LDRSB,
4
(1)
1
L,
I0/I1
1
LDRSH
Load,
register
pre-‐indexed
LDRD
5
(2)
1
I0/I1,
L
1
Load,
scaled
register
pre-‐ LDR,
LDRB
4
(2)
1
I0/I1,
L
1
indexed,
plus
LSL2
Load,
scaled
register
pre-‐ LDR,
LDRB
5
(2)
1
I0/I1,
L
1
indexed,
other
Load,
immed
post-‐indexed
LDR{T},
LDRB{T},
LDRD,
4
(1)
1
L,
I0/I1
1
LDRH{T},
LDRSB{T},
LDRSH{T}
Load,
register
post-‐ LDR,
LDRB,
LDRH{T},
4
(1)
1
L,
I0/I1
1
indexed
LDRSB{T},
LDRSH{T}
Load,
register
post-‐ LDRD
4(2)
1
I0/I1,
L
1
indexed
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 12 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load,
register
post-‐ LDRT,
LDRBT
4(3)
1
I0/I1,
L,
M
1
indexed
Load,
scaled
register
post-‐ LDR,
LDRB
4
(2)
1
I0/I1,
L
1
indexed
Load,
scaled
register
post-‐ LDRT,
LDRBT
4
(3)
1
I0/I1,
L,
M
1
indexed
Preload,
immed
offset
PLD,
PLDW
4
1
L
Preload,
register
offset,
PLD,
PLDW
4
1
L
plus
Preload,
register
offset,
PLD,
PLDW
5
1
I0/I1,
L
minus
Preload,
scaled
register
PLD,
PLDW
4
1
L
offset,
plus
LSL2
Preload,
scaled
register
PLD,
PLDW
5
1
I0/I1,
L
offset,
other
Load
multiple,
no
LDMIA,
LDMIB,
LDMDA,
3
+
N
1/N
L
2
writeback,
base
reg
not
in
LDMDB
list
Load
multiple,
no
LDMIA,
LDMIB,
LDMDA,
4
+
N
1/N
I0/I1,
L
2
writeback,
base
reg
in
list
LDMDB
Load
multiple,
writeback
LDMIA,
LDMIB,
LDMDA,
3
+
N
(1)
1/N
L,
I0/I1
1,
2
LDMDB,
POP
Load,
branch
forms
with
LDR
4(2)
1
L,
M
1
addressing
mode
as
register
post-‐indexed
(scaled
or
unscaled)
or
scaled,
register
pre-‐
indexed,
plus,
LSL2
Load,
branch
forms
with
LDR
5(2)
1
I0/I1,
L
1
addressing
modeas
scaled
register,
pre-‐indexed,
other
(Load,
branch
forms)
+2
+B
3
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load
register,
literal
LDR,
LDRSW,
PRFM
4
1
L
Load
register,
unscaled
LDUR,
LDURB,
LDURH,
4
1
L
immed
LDURSB,
LDURSH,
LDURSW,
PRFUM
Load
register,
immed
post-‐ LDR,
LDRB,
LDRH,
LDRSB,
4
(1)
1
L,
I0/I1
1
index
LDRSH,
LDRSW
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 13 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Load
register,
immed
pre-‐ LDR,
LDRB,
LDRH,
LDRSB,
4
(1)
1
L,
I0/I1
1
index
LDRSH,
LDRSW
Load
register,
immed
LDTR,
LDTRB,
LDTRH,
4
1
L
unprivileged
LDTRSB,
LDTRSH,
LDTRSW
Load
register,
unsigned
LDR,
LDRB,
LDRH,
LDRSB,
4
1
L
immed
LDRSH,
LDRSW,
PRFM
Load
register,
register
LDR,
LDRB,
LDRH,
LDRSB,
4
1
L
offset,
basic
LDRSH,
LDRSW,
PRFM
Load
register,
register
LDR,
LDRSW,
PRFM
4
1
L
offset,
scale
by
4/8
Load
register,
register
LDRH,
LDRSH
5
1
I0/I1,
L
offset,
scale
by
2
Load
register,
register
LDR,
LDRB,
LDRH,
LDRSB,
4
1
L
offset,
extend
LDRSH,
LDRSW,
PRFM
Load
register,
register
LDR,
LDRSW,
PRFM
4
1
L
offset,
extend,
scale
by
4/8
Load
register,
register
LDRH,
LDRSH
5
1
I0/I1,
L
offset,
extend,
scale
by
2
Load
pair,
immed
offset,
LDP,
LDNP
4
1
L
normal
Load
pair,
immed
offset,
LDPSW
5
1/2
I0/I1,
L
signed
words,
base
!=
SP
Load
pair,
immed
offset,
LDPSW
5
1/2
L
signed
words,
base
=
SP
Load
pair,
immed
post-‐ LDP
4
(1)
1
L,
I0/I1
1
index,
normal
Load
pair,
immed
post-‐ LDPSW
5
(1)
1/2
L,
I0/I1
1
index,
signed
words
Load
pair,
immed
pre-‐ LDP
4
(1)
1
L,
I0/I1
1
index,
normal
Load
pair,
immed
pre-‐ LDPSW
5
(1)
1/2
L,
I0/I1
1
index,
signed
words
Note:
1. Base register updates are typically completed in parallel with the load operation and with shorter latency
(update latency shown in parentheses).
2. For load multiple instructions, N=floor((num_regs+1)/2).
3. Branch forms are possible when the instruction destination register is the PC. For those cases, an
additional branch µop is required. This adds two cycles to the latency.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 14 of 42
3.9 Store Instructions
The following table describes performance characteristics for standard store instructions. Store µops can issue
after their address operands become available and do not need to wait for data operands. After they are executed,
stores are buffered and committed in the background.
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Store,
immed
offset
STR{T},
STRB{T},
STRD,
1
1
S
STRH{T}
Store,
register
offset,
plus
STR,
STRB,
STRD,
STRH
1
1
S
Store,
register
offset,
STR,
STRB,
STRD,
STRH
3
1
I0/I1,
S
minus
Store,
scaled
register
STR,
STRB
1
1
S
offset,
plus
LSL2
Store,
scaled
register
STR,
STRB
3
1
I0/I1,
S
offset,
other
Store,
immed
pre-‐indexed
STR,
STRB,
STRD,
STRH
1
(1)
1
S,
I0/I1
1
Store,
register
pre-‐ STR,
STRB,
STRD,
STRH
1
(1)
1
S,
I0/I1
1
indexed,
plus
Store,
register
pre-‐ STR,
STRB,
STRD,
STRH
3
(2)
1
I0/I1,
S
1
indexed,
minus
Store,
scaled
register
pre-‐ STR,
STRB
1
(2)
1
S,
M
1
indexed,
plus
LSL2
Store,
scaled
register
pre-‐ STR,
STRB
3
(2)
1
I0/I1,
S
1
indexed,
other
Store,
immed
post-‐ STR{T},
STRB{T},
STRD,
1
(1)
1
S,
I0/I1
1
indexed
STRH{T}
Store,
register
post-‐ STRH{T},
STRD
1
(1)
1
S,
I0/I1
1
indexed
Store,
register
post-‐ STR{T},
STRB{T}
1
(2)
1
S,
M
1
indexed
Store,
scaled
register
post-‐ STR{T},
STRB{T}
1
(2)
1
S,
M
1
indexed
Store
multiple,
no
STMIA,
STMIB,
STMDA,
N
1/N
S
1,
2
writeback
STMDB
Store
multiple,
writeback
STMIA,
STMIB,
STMDA,
N
(1)
1/N
S,
I0/I1
1,
2
STMDB,
PUSH
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Store
register,
unscaled
STUR,
STURB,
STURH
1
1
S
immed
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 15 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Store
register,
immed
STR,
STRB,
STRH
1
(1)
1
S,
I0/I1
1
post-‐index
Store
register,
immed
pre-‐ STR,
STRB,
STRH
1
(1)
1
S,
I0/I1
1
index
Store
register,
immed
STTR,
STTRB,
STTRH
1
1
S
unprivileged
Store
register,
unsigned
STR,
STRB,
STRH
1
1
S
immed
Store
register,
register
STR,
STRB,
STRH
1
1
S
offset,
basic
Store
register,
register
STR
1
1
S
offset,
scaled
by
4/8
Store
register,
register
STRH
3
1
I0/I1,
S
offset,
scaled
by
2
Store
register,
register
STR,
STRB,
STRH
1
1
S
offset,
extend
Store
register,
register
STR
1
1
S
offset,
extend,
scale
by
4/8
Store
register,
register
STRH
3
1
I0/I1,
S
offset,
extend,
scale
by
1
Store
pair,
immed
offset,
STP,
STNP
1
1
S
W-‐form
Store
pair,
immed
offset,
STP,
STNP
2
1/2
S
X-‐form
Store
pair,
immed
post-‐ STP
1
(1)
1
S,
I0/I1
1
index,
W-‐form
Store
pair,
immed
post-‐ STP
2
(1)
1/2
S,
I0/I1
1
index,
X-‐form
Store
pair,
immed
pre-‐ STP
1
(1)
1
S,
I0/I1
1
index,
W-‐form
Store
pair,
immed
pre-‐ STP
2
(1)
1/2
S,
I0/I1
1
index,
X-‐form
Note:
1. Base register updates are typically completed in parallel with the store operation and with shorter latency
(update latency is shown in parentheses).
2. For store multiple instructions, N=floor((num_regs+1)/2).
3.10 FP Data Processing Instructions
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 16 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
absolute
value
VABS
3
2
F0/F1
FP
arith
VADD,
VSUB
4
2
F0/F1
FP
compare,
unconditional
VCMP,
VCMPE
3
1
F1
FP
compare,
conditional
VCMP,
VCMPE
6
1/6
F0/F1,
F1
FP
convert
VCVT{R},
VCVTB,
VCVTT,
3
1
F0
VCVTA,
VCVTM,
VCVTN,
VCVTP
FP
round
to
integral
VRINTA,
VRINTM,
VRINTN,
3
1
F0
VRINTP,
VRINTR,
VRINTX,
VRINTZ
FP
divide,
S-‐form
VDIV
6-‐11
2/9-‐1/2
F0
1
FP
divide,
D-‐form
VDIV
6-‐18
1/16-‐1/4
F0
1
FP
max/min
VMAXNM,
VMINNM
3
2
F0/F1
FP
multiply
VMUL,
VNMUL
4
2
F0/F1
2
FP
multiply
accumulate
VFMA,
VFMS,
VFNMA,
7
(3)
2
F0/F1
3
VFNMS,
VMLA,
VMLS,
VNMLA,
VNMLS
FP
negate
VNEG
3
2
F0/F1
FP
select
VSELEQ,
VSELGE,
VSELGT,
3
2
F0/F1
VSELVS
FP
square
root,
S-‐form
VSQRT
6-‐17
2/15-‐1/2
F1
1
FP
square
root,
D-‐form
VSQRT
6-‐32
1/30-‐1/4
F1
1
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
absolute
value
FABS
3
2
F0/F1
FP
arithmetic
FADD,
FSUB
4
2
F0/F1
FP
compare
FCCMP{E},
FCMP{E}
3
1
F1
FP
divide,
S-‐form
FDIV
6-‐11
2/9-‐1/2
F0
1
FP
divide,
D-‐form
FDIV
6-‐18
1/16-‐1/4
F0
1
FP
min/max
FMIN,
FMINNM,
FMAX,
3
2
F0/F1
FMAXNM
FP
multiply
FMUL,
FNMUL
4
2
F0/F1
2
FP
multiply
accumulate
FMADD,
FMSUB,
FNMADD,
7
(3)
2
F0/F1
3
FNMSUB
FP
negate
FNEG
3
2
F0/F1
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 17 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
round
to
integral
FRINTA,
FRINTI,
FRINTM,
3
1
F0
FRINTN,
FRINTP,
FRINTX,
FRINTZ
FP
select
FCSEL
3
2
F0/F1
FP
square
root,
S-‐form
FSQRT
6-‐17
2/15-‐1/2
F1
1
FP
square
root,
D-‐form
FSQRT
6-‐32
1/30-‐1/4
F1
1
Note:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent
similar operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µops to the
accumulate operands of an FP multiply-accumulate µop. The latter can potentially be issued one cycle
after the FP multiply µop has been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of multiply-accumulate µops to issue one every N cycles (accumulate latency
N is shown in parentheses).
3.11 FP Miscellaneous Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
move,
immed
VMOV
3
2
F0/F1
FP
move,
register
VMOV
3
2
F0/F1
FP
transfer,
vfp
to
core
reg
VMOV
5
1
L
FP
transfer,
core
reg
to
VMOV
8
1
L,
F0/F1
upper
or
lower
half
of
vfp
D-‐reg
FP
transfer,
core
reg
to
vfp
VMOV
5
1
L
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
convert,
from
vec
to
FCVT,
FCVTXN
3
1
F0
vec
reg
FP
convert,
from
gen
to
SCVTF,
UCVTF
8
1
L,
F0
vec
reg
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 18 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
convert,
from
vec
to
FCVTAS,
FCVTAU,
FCVTMS,
8
1
L,
F0
gen
reg
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU
FP
move,
immed
FMOV
3
2
F0/F1
FP
move,
register
FMOV
3
2
F0/F1
FP
transfer,
from
gen
to
FMOV
5
1
L
vec
reg
FP
transfer,
from
vec
to
FMOV
5
1
L
gen
reg
3.12 FP Load Instructions
The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an
extra cycle is required to forward results to FP/ASIMD pipelines.
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
load,
register
VLDR
5
1
L
FP
load
multiple,
unconditional
VLDMIA,
VLDMDB,
4
+
N
1/N
L
1
VPOP
FP
load
multiple,
conditional
VLDMIA,
VLDMDB,
4
+
N
1/N
L
2
VPOP
(FP
load,
writeback
forms)
(1)
Same
as
before
+I0/I1
3
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Load
vector
reg,
literal
LDR
5
1
L
Load
vector
reg,
unscaled
immed
LDUR
5
1
L
Load
vector
reg,
immed
post-‐index
LDR
5
(1)
1
L,
I0/I1
3
Load
vector
reg,
immed
pre-‐index
LDR
5
(1)
1
L,
I0/I1
3
Load
vector
reg,
unsigned
immed
LDR
5
1
L
Load
vector
reg,
register
offset,
basic
LDR
5
1
L
Load
vector
reg,
register
offset,
scale,
LDR
5
1
L
S/D-‐form
Load
vector
reg,
register
offset,
scale,
LDR
6
1
I0/I1,
L
H/Q-‐form
Load
vector
reg,
register
offset,
LDR
5
1
L
extend
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 19 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Load
vector
reg,
register
offset,
LDR
5
1
L
extend,
scale,
S/D-‐form
Load
vector
reg,
register
offset,
LDR
6
1
I0/I1,
L
extend,
scale,
H/Q-‐form
Load
vector
pair,
immed
offset,
S/D-‐ LDP,
LDNP
5
1
L
form
Load
vector
pair,
immed
offset,
Q-‐ LDP,
LDNP
6
1/2
L
form
Load
vector
pair,
immed
post-‐index,
LDP
5
(1)
1
L,
I0/I1
3
S/D-‐form
Load
vector
pair,
immed
post-‐index,
LDP
6
(1)
1/2
L,
I0/I1
3
Q-‐form
Load
vector
pair,
immed
pre-‐index,
LDP
5
(1)
1
L,
I0/I1
3
S/D-‐form
Load
vector
pair,
immed
pre-‐index,
Q-‐ LDP
6
(1)
1/2
L,
I0/I1
3
form
Note:
1. For FP load multiple instructions, N=floor((num_regs+1)/2).
2. For conditional FP load multiple instructions, N = num_regs for conditional forms only.
3. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the load µop (update latency is shown in parentheses).
3.13 FP Store Instructions
Stores µops can issue after their address operands become available and do not need to wait for data operands.
After they are executed, stores are buffered and committed in the background.
Instruction Group Aarch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
FP
store,
immed
offset
VSTR
1
1
S
FP
store
multiple,
S-‐form
VSTMIA,
VSTMDB,
N
1/N
S
1
VPUSH
FP
store
multiple,
D-‐form
VSTMIA,
VSTMDB,
N
1/N
S
1
VPUSH
(FP
store,
writeback
forms)
(1)
Same
as
before
+I0/I1
2
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Store
vector
reg,
unscaled
immed,
STUR
1
1
S
B/H/S/D-‐form
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 20 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Store
vector
reg,
unscaled
immed,
Q-‐form
STUR
2
1/2
S
Store
vector
reg,
immed
post-‐index,
STR
1
(1)
1
S,
I0/I1
2
B/H/S/D-‐form
Store
vector
reg,
immed
post-‐index,
Q-‐ STR
2
(1)
1/2
S,
I0/I1
2
form
Store
vector
reg,
immed
pre-‐index,
STR
1
(1)
1
S,
I0/I1
2
B/H/S/D-‐form
Store
vector
reg,
immed
pre-‐index,
Q-‐form
STR
2
(1)
1/2
I0/I1,
S
2
Store
vector
reg,
unsigned
immed,
STR
1
1
S
B/H/S/D-‐form
Store
vector
reg,
unsigned
immed,
Q-‐form
STR
2
1/2
I0/I1,
S
Store
vector
reg,
register
offset,
basic,
STR
1
1
S
B/H/S/D-‐form
Store
vector
reg,
register
offset,
basic,
Q-‐ STR
2
1/2
I0/I1,
S
form
Store
vector
reg,
register
offset,
scale,
H-‐ STR
3
1
I0/I1,
S
form
Store
vector
reg,
register
offset,
scale,
S/D-‐ STR
1
1
S
form
Store
vector
reg,
register
offset,
scale,
Q-‐ STR
4
1/2
I0/I1,
S
form
Store
vector
reg,
register
offset,
extend,
STR
1
1
S
B/H/S/D-‐form
Store
vector
reg,
register
offset,
extend,
Q-‐ STR
4
1/2
M,
S
form
Store
vector
reg,
register
offset,
extend,
STR
3
1
I0/I1,
S
scale,
H-‐form
Store
vector
reg,
register
offset,
extend,
STR
1
1
S
scale,
S/D-‐form
Store
vector
reg,
register
offset,
extend,
STR
4
1/2
I0/I1,
S
scale,
Q-‐form
Store
vector
pair,
immed
offset,
S-‐form
STP
1
1
S
Store
vector
pair,
immed
offset,
D-‐form
STP
2
1/2
S
S
Store
vector
pair,
immed
offset,
Q-‐form
STP
4
1/4
I0/I1,
S
Store
vector
pair,
immed
post-‐index,
S-‐ STP
1
(1)
1
S,
I0/I1
2
form
Store
vector
pair,
immed
post-‐index,
D-‐ STP
2
(1)
1/2
S,
I0/I1
2
form
Store
vector
pair,
immed
post-‐index,
Q-‐ STP
4
(1)
1/4
S,
I0/I1
2
form
Store
vector
pair,
immed
pre-‐index,
S-‐form
STP
1
(1)
1
S,
I0/I1
2
Store
vector
pair,
immed
pre-‐index,
D-‐ STP
2
(1)
1/2
S,
I0/I1
2
form
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 21 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Store
vector
pair,
immed
pre-‐index,
Q-‐ STP
4
(1)
1/4
I0/I1,
S
2
form
Note:
1. For single-precision store multiple instructions, N=floor((num_regs+1)/2). For double-precision store
multiple instructions, N=(num_regs).
2. Writeback forms of store instructions require an extra µop to update the base address. This update is
typically performed in parallel with, or prior to, the store µop (address update latency is shown in
parentheses).
3.14 ASIMD Integer Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
absolute
diff,
D-‐form
VABD
3
2
F0/F1
ASIMD
absolute
diff,
Q-‐form
VABD
3
1
F0/
F1
ASIMD
absolute
diff
accum,
D-‐ VABA
4
(1)
1
F1
2
form
ASIMD
absolute
diff
accum,
Q-‐ VABA
5
(2)
1/2
F1
2
form
ASIMD
absolute
diff
accum
long
VABAL
4
(1)
1
F1
2
ASIMD
absolute
diff
long
VABDL
3
2
F0/F1
ASIMD
arith,
basic
VADD,
VADDL,
3
2
F0/F1
VADDW,
VNEG,
VPADD,
VPADDL,
VSUB,
VSUBL,
VSUBW
ASIMD
arith,
complex
VABS,
VADDHN,
3
2
F0/F1
VHADD,
VHSUB,
VQABS,
VQADD,
VQNEG,
VQSUB,
VRADDHN,
VRHADD,
VRSUBHN,
VSUBHN
ASIMD
compare
VCEQ,
VCGE,
VCGT,
3
2
F0/F1
VCLE,
VTST
ASIMD
logical
VAND,
VBIC,
VMVN,
3
2
F0/F1
VORR,
VORN,
VEOR
ASIMD
max/min
VMAX,
VMIN,
VPMAX,
3
2
F0/F1
VPMIN
ASIMD
multiply,
D-‐form
VMUL,
VQDMULH,
4
1
F0
VQRDMULH
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 22 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
multiply,
Q-‐form
VMUL,
VQDMULH,
5
1/2
F0
VQRDMULH
ASIMD
multiply
accumulate,
D-‐ VMLA,
VMLS
4
(1)
1
F0
1
form
ASIMD
multiply
accumulate,
Q-‐ VMLA,
VMLS
5
(2)
1/2
F0
1
form
ASIMD
multiply
accumulate
long
VMLAL,
VMLSL
4
(1)
1
F0
1
ASIMD
multiply
accumulate
VQDMLAL,
VQDMLSL
4
(2)
1
F0
1
saturating
long
ASIMD
multiply
long
VMULL.S,
VMULL.I,
4
1
F0
VMULL.P8,
VQDMULL
ASIMD
pairwise
add
and
VPADAL
4
(1)
1
F1
2
accumulate
ASIMD
shift
accumulate
VSRA,
VRSRA
4
(1)
1
F1
2
ASIMD
shift
by
immed,
basic
VMOVL,
VSHL,
VSHLL,
3
1
F1
VSHR,
VSHRN
ASIMD
shift
by
immed,
complex
VQRSHRN,
VQRSHRUN,
4
1
F1
VQSHL{U},
VQSHRN,
VQSHRUN,
VRSHR,
VRSHRN
ASIMD
shift
by
immed
and
VSLI,
VSRI
3
1
F1
insert,
basic,
D-‐form
ASIMD
shift
by
immed
and
VSLI,
VSRI
4
1/2
F1
insert,
basic,
Q-‐form
ASIMD
shift
by
register,
basic,
D-‐ VSHL
3
1
F1
form
ASIMD
shift
by
register,
basic,
Q-‐ VSHL
4
1/2
F1
form
ASIMD
shift
by
register,
VQRSHL,
VQSHL,
VRSHL
4
1
F1
complex,
D-‐form
ASIMD
shift
by
register,
VQRSHL,
VQSHL,
VRSHL
5
1/2
F1
complex,
Q-‐form
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
absolute
diff,
D-‐form
SABD,
UABD
3
2
F0/F1
ASIMD
absolute
diff,
Q-‐form
SABD,
UABD
3
2
F0/F1
ASIMD
absolute
diff
accum,
D-‐ SABA,
UABA
4
(1)
1
F1
2
form
ASIMD
absolute
diff
accum,
Q-‐ SABA,
UABA
5
(2)
1/2
F1
2
form
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 23 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
absolute
diff
accum
long
SABAL(2),
UABAL(2)
4
(1)
1
F1
2
ASIMD
absolute
diff
long
SABDL,
UABDL
3
2
F0/F1
ASIMD
arith,
basic
ABS,
ADD,
ADDP,
NEG,
3
2
F0/F1
SADDL(2),
SADDLP,
SADDW(2),
SHADD,
SHSUB,
SSUBL(2),
SSUBW(2),
SUB,
UADDL(2),
UADDLP,
UADDW(2),
UHADD,
UHSUB,
USUBW(2)
ASIMD
arith,
complex
ADDHN(2),
3
2
F0/F1
RADDHN(2),
RSUBHN(2),
SQABS,
SQADD,
SQNEG,
SQSUB,
SRHADD,
SUBHN(2),
SUQADD,
UQADD,
UQSUB,
URHADD,
USQADD
ASIMD
arith,
reduce,
4H/4S
ADDV,
SADDLV,
3
1
F1
UADDLV
ASIMD
arith,
reduce,
8B/8H
ADDV,
SADDLV,
6
1
F1,
F0/F1
UADDLV
ASIMD
arith,
reduce,
16B
ADDV,
SADDLV,
6
1/2
F1
UADDLV
ASIMD
compare
CMEQ,
CMGE,
CMGT,
3
2
F0/F1
CMHI,
CMHS,
CMLE,
CMLT,
CMTST
ASIMD
logical
AND,
BIC,
EOR,
MOV,
3
2
F0/F1
MVN,
ORN,
ORR
ASIMD
max/min,
basic
SMAX,
SMAXP,
SMIN,
3
2
F0/F1
SMINP,
UMAX,
UMAXP,
UMIN,
UMINP
ASIMD
max/min,
reduce,
4H/4S
SMAXV,
SMINV,
3
1
F1
UMAXV,
UMINV
ASIMD
max/min,
reduce,
8B/8H
SMAXV,
SMINV,
6
1
F1,
F0/F1
UMAXV,
UMINV
ASIMD
max/min,
reduce,
16B
SMAXV,
SMINV,
6
1/2
F1
UMAXV,
UMINV
ASIMD
multiply,
D-‐form
MUL,
PMUL,
4
1
F0
SQDMULH,
SQRDMULH
ASIMD
multiply,
Q-‐form
MUL,
PMUL,
5
1/2
F0
SQDMULH,
SQRDMULH
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 24 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
multiply
accumulate,
D-‐ MLA,
MLS
4
(1)
1
F0
1
form
ASIMD
multiply
accumulate,
Q-‐ MLA,
MLS
5
(2)
1/2
F0
1
form
ASIMD
multiply
accumulate
long
SMLAL(2),
SMLSL(2),
4
(1)
1
F0
1
UMLAL(2),
UMLSL(2)
ASIMD
multiply
accumulate
SQDMLAL(2),
4
(2)
1
F0
1
saturating
long
SQDMLSL(2)
ASIMD
multiply
long
SMULL(2),
UMULL(2),
4
1
F0
SQDMULL(2)
ASIMD
polynomial
(8x8)
multiply
PMULL.8B,
4
1
F0
3
long
PMULL2.16B
ASIMD
pairwise
add
and
SADALP,
UADALP
4
(1)
1
F1
2
accumulate
ASIMD
shift
accumulate
SRA,
SRSRA,
USRA,
4
(1)
1
F1
2
URSRA
ASIMD
shift
by
immed,
basic
SHL,
SHLL(2),
SHRN(2),
3
1
F1
SLI,
SRI,
SSHLL(2),
SSHR,
SXTL(2),
USHLL(2),
USHR,
UXTL(2)
ASIMD
shift
by
immed
and
SLI,
SRI
3
1
F1
insert,
basic,
D-‐form
ASIMD
shift
by
immed
and
SLI,
SRI
4
1/2
F1
insert,
basic,
Q-‐form
ASIMD
shift
by
immed,
complex
RSHRN(2),
SRSHR,
4
1
F1
SQSHL{U},
SQRSHRN(2),
SQRSHRUN(2),
SQSHRN(2),
SQSHRUN(2),
URSHR,
UQSHL,
UQRSHRN(2),
UQSHRN(2)
ASIMD
shift
by
register,
basic,
D-‐ SSHL,
USHL
3
1
F1
form
ASIMD
shift
by
register,
basic,
Q-‐ SSHL,
USHL
4
1/2
F1
form
ASIMD
shift
by
register,
SRSHL,
SQRSHL,
SQSHL,
4
1
F1
complex,
D-‐form
URSHL,
UQRSHL,
UQSHL
ASIMD
shift
by
register,
SRSHL,
SQRSHL,
SQSHL,
5
1/2
F1
complex,
Q-‐form
URSHL,
UQRSHL,
UQSHL
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 25 of 42
Note:
1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops, allowing
a typical sequence of integer multiply-accumulate µops to issue one every cycle or one every other cycle
(accumulate latency is shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of such µops to issue one every cycle (accumulate latency is shown in
parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2 Vd.8H,
Vn.16B, Vm.16B”
3.15 ASIMD Floating-Point Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
FP
absolute
value
VABS
3
2
F0/F1
ASIMD
FP
arith,
D-‐form
VABD,
VADD,
VPADD,
4
2
F0/F1
VSUB
ASIMD
FP
arith,
Q-‐form
VABD,
VADD,
VSUB
4
1
F0/F1
ASIMD
FP
compare
VACGE,
VACGT,
VACLE,
3
2
F0/F1
VACLT,
VCEQ,
VCGE,
VCGT,
VCLE
ASIMD
FP
convert,
integer,
D-‐ VCVT,
VCVTA,
VCVTM,
3
1
F0
form
VCVTN,
VCVTP
ASIMD
FP
convert,
integer,
Q-‐ VCVT,
VCVTA,
VCVTM,
4
1/2
F0
form
VCVTN,
VCVTP
ASIMD
FP
convert,
fixed,
D-‐form
VCVT
3
1
F0
ASIMD
FP
convert,
fixed,
Q-‐form
VCVT
4
1/2
F0
ASIMD
FP
convert,
half-‐precision
VCVT
7
1/2
F0,
F1
ASIMD
FP
max/min,
D-‐form
VMAX,
VMIN,
VPMAX,
3
2
F0/F1
VPMIN,
VMAXNM,
VMINNM
ASIMD
FP
max/min,
Q-‐form
VMAX,
VMIN,
VMAXNM,
3
1
F0/F1
VMINNM
ASIMD
FP
multiply,
D-‐form
VMUL
4
2
F0/F1
2
ASIMD
FP
multiply,
Q-‐form
VMUL
4
1
F0/F1
2
ASIMD
FP
multiply
accumulate,
VMLA,
VMLS,
VFMA,
7
(3)
2
F0/F1
1
D-‐form
VFMS
ASIMD
FP
multiply
accumulate,
VMLA,
VMLS,
VFMA,
7
(3)
1
F0/F1
1
Q-‐form
VFMS
ASIMD
FP
negate
VNEG
3
2
F0/F1
ASIMD
FP
round
to
integral,
D-‐ VRINTA,
VRINTM,
3
1
F0
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 26 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
form
VRINTN,
VRINTP,
VRINTX,
VRINTZ
ASIMD
FP
round
to
integral,
Q-‐ VRINTA,
VRINTM,
4
1/2
F0
form
VRINTN,
VRINTP,
VRINTX,
VRINTZ
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
FP
absolute
value
FABS
3
2
F0/F1
ASIMD
FP
arith,
normal,
D-‐form
FABD,
FADD,
FSUB
4
2
F0/F1
ASIMD
FP
arith,
normal,
Q-‐form
FABD,
FADD,
FSUB
4
1
F0/F1
ASIMD
FP
arith,
pairwise,
D-‐form
FADDP
4
2
F0/F1
ASIMD
FP
arith,
pairwise,
Q-‐ FADDP
7
2/3
F0/F1
form
ASIMD
FP
compare
FACGE,
FACGT,
FCMEQ,
3
2
F0/F1
FCMGE,
FCMGT,
FCMLE,
FCMLT
ASIMD
FP
convert,
long
(F16
to
FCVTL(2)
7
1/2
F0,
F0/F1
F32)
ASIMD
FP
convert,
long
(F32
to
FCVTL(2)
3
1
F0
F64)
ASIMD
FP
convert,
narrow
(F32
FCVTN(2),
FCVTXN(2)
7
1/2
F0,
F0/F1
to
F16)
ASIMD
FP
convert,
narrow
(F64
FCVTN(2),
FCVTXN(2)
3
1
F0
to
F32)
ASIMD
FP
convert,
other,
D-‐form
FCVTAS,
VCVTAU,
3
1
F0
F32
and
Q-‐form
F64
FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU,
SCVTF,
UCVTF
ASIMD
FP
convert,
other,
Q-‐ FCVTAS,
VCVTAU,
4
1/2
F0
form
F32
FCVTMS,
FCVTMU,
FCVTNS,
FCVTNU,
FCVTPS,
FCVTPU,
FCVTZS,
FCVTZU,
SCVTF,
UCVTF
ASIMD
FP
divide,
D-‐form,
F32
FDIV
6-‐11
1/9-‐1/4
F0
3
ASIMD
FP
divide,
Q-‐form,
F32
FDIV
12-‐22
1/18-‐1/10
F0
3
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 27 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
FP
divide,
Q-‐form,
F64
FDIV
12-‐36
1/32-‐1/10
F0
3
ASIMD
FP
max/min,
normal
FMAX,
FMAXNM,
FMIN,
3
2
F0/F1
FMINNM
ASIMD
FP
max/min,
pairwise
FMAXP,
FMAXNMP,
3
2
F0/F1
FMINP,
FMINNMP
ASIMD
FP
max/min,
reduce
FMAXV,
FMAXNMV,
6
1
F0/F1
FMINV,
FMINNMV
ASIMD
FP
multiply,
D-‐form
FMUL,
FMULX
4
2
F0/F1
2
ASIMD
FP
multiply,
Q-‐form
FMUL,
FMULX
4
1
F0/F1
2
ASIMD
FP
multiply
accumulate,
FMLA,
FMLS
7
(3)
2
F0/F1
1
D-‐form
ASIMD
FP
multiply
accumulate,
FMLA,
FMLS
7
(3)
1
F0/F1
1
Q-‐form
ASIMD
FP
negate
FNEG
3
2
F0/F1
ASIMD
FP
round,
D-‐form
FRINTA,
FRINTI,
FRINTM,
3
1
F0
FRINTN,
FRINTP,
FRINTX,
FRINTZ
ASIMD
FP
round,
Q-‐form
FRINTA,
FRINTI,
FRINTM,
4
1/2
F0
FRINTN,
FRINTP,
FRINTX,
FRINTZ
Note:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µops,
allowing a typical sequence of floating-point multiply-accumulate µops to issue one every N cycles
(accumulate latency N is shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µops to
the accumulate operands of an ASIMD FP multiply-accumulate µop. The latter can potentially be issued
one cycle after the ASIMD FP multiply µop has been issued.
3. ASIMD divide operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
3.16 ASIMD Miscellaneous Instructions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
bitwise
insert,
D-‐form
VBIF,
VBIT,
VBSL
3
2
F0/F1
ASIMD
bitwise
insert,
Q-‐form
VBIF,
VBIT,
VBSL
3
1
F0/F1
ASIMD
count,
D-‐form
VCLS,
VCLZ,
VCNT
3
2
F0/F1
ASIMD
count,
Q-‐form
VCLS,
VCLZ,
VCNT
3
1
F0/F1
ASIMD
duplicate,
core
reg
VDUP
8
1
L,
F0/F1
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 28 of 42
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
duplicate,
scalar
VDUP
3
2
F0/F1
ASIMD
extract
VEXT
3
2
F0/F1
ASIMD
move,
immed
VMOV
3
2
F0/F1
ASIMD
move,
register
VMOV
3
2
F0/F1
ASIMD
move,
narrowing
VMOVN
3
2
F0/F1
ASIMD
move,
saturating
VQMOVN,
VQMOVUN
4
1
F1
ASIMD
reciprocal
estimate,
D-‐ VRECPE,
VRSQRTE
3
1
F0
form
ASIMD
reciprocal
estimate,
Q-‐ VRECPE,
VRSQRTE
4
1/2
F0
form
ASIMD
reciprocal
step,
D-‐form
VRECPS,
VRSQRTS
7
2
F0/F1
ASIMD
reciprocal
step,
Q-‐form
VRECPS,
VRSQRTS
7
1
F0/F1
ASIMD
reverse
VREV16,
VREV32,
3
2
F0/F1
VREV64
ASIMD
swap,
D-‐form
VSWP
3
2
F0/F1
ASIMD
swap,
Q-‐form
VSWP
3
1
F0/F1
ASIMD
table
lookup,
1
reg
VTBL,
VTBX
3
2
F0/F1
ASIMD
table
lookup,
2
reg
VTBL,
VTBX
3
2
F0/F1
ASIMD
table
lookup,
3
reg
VTBL,
VTBX
6
2
F0/F1
ASIMD
table
lookup,
4
reg
VTBL,
VTBX
6
2
F0/F1
ASIMD
transfer,
scalar
to
core
VMOV
5
1
L
reg,
word
ASIMD
transfer,
scalar
to
core
VMOV
6
1
L,
I0/I1
reg,
byte/hword
ASIMD
transfer,
core
reg
to
VMOV
8
1
L,
F0/F1
scalar
ASIMD
transpose,
D-‐form
VTRN
3
2
F0/F1
ASIMD
transpose,
Q-‐form
VTRN
3
1
F0/F1
ASIMD
unzip/zip,
D-‐form
VUZP,
VZIP
3
2
F0/F1
ASIMD
unzip/zip,
Q-‐form
VUZP,
VZIP
6
2/3
F0/F1
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
bit
reverse
RBIT
3
2
F0/F1
ASIMD
bitwise
insert,
D-‐form
BIF,
BIT,
BSL
3
2
F0/F1
ASIMD
bitwise
insert,
Q-‐form
BIF,
BIT,
BSL
3
1
F0/F1
ASIMD
count,
D-‐form
CLS,
CLZ,
CNT
3
2
F0/F1
ASIMD
count,
Q-‐form
CLS,
CLZ,
CNT
3
1
F0/F1
ASIMD
duplicate,
gen
reg
DUP
8
1
L,
F0/F1
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 29 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
ASIMD
duplicate,
element
DUP
3
2
F0/F1
ASIMD
extract
EXT
3
2
F0/F1
ASIMD
extract
narrow
XTN
3
2
F0/F1
ASIMD
extract
narrow,
saturating
SQXTN(2),
SQXTUN(2),
4
1
F1
UQXTN(2)
ASIMD
insert,
element
to
INS
3
2
F0/F1
element
ASIMD
move,
integer
immed
MOVI
3
2
F0/F1
ASIMD
move,
FP
immed
FMOV
3
2
F0/F1
ASIMD
reciprocal
estimate,
D-‐ FRECPE,
FRECPX,
3
1
F0
form
FRSQRTE,
URECPE,
URSQRTE
ASIMD
reciprocal
estimate,
Q-‐ FRECPE,
FRECPX,
4
1/2
F0
form
FRSQRTE,
URECPE,
URSQRTE
ASIMD
reciprocal
step,
D-‐form
FRECPS,
FRSQRTS
7
2
F0/F1
ASIMD
reciprocal
step,
Q-‐form
FRECPS,
FRSQRTS
7
1
F0/F1
ASIMD
reverse
REV16,
REV32,
REV64
3
2
F0/F1
ASIMD
table
lookup,
D-‐form
TBL,
TBX
3xN
F0/F1
1
ASIMD
table
lookup,
Q-‐form
TBL,
TBX
3xN
+
3
F0/F1
1
ASIMD
transfer,
element
to
gen
UMOV
5
1
L
reg,
word
or
dword
ASIMD
transfer,
element
to
gen
SMOV,
UMOV
6
1
L,
I0/I1
reg,
others
ASIMD
transfer,
gen
reg
to
INS
8
1
L,
F0/F1
element
ASIMD
transpose,
D-‐form
TRN1,
TRN2
3
2
F0/F1
ASIMD
unzip/zip,
D-‐form
UZP1,
UZP2,
ZIP1,
ZIP2
3
2
F0/F1
Note:
1. For table branches (TBL and TBX), N denotes the number of registers in the table.
3.17 ASIMD Load Instructions
The latencies shown assume the memory access hits in the Level 1 Data Cache. Compared to standard loads, an
extra cycle is required to forward results to FP/ASIMD pipelines.
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
load,
1
element,
multiple,
1
reg
VLD1
5
1
L
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 30 of 42
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
load,
1
element,
multiple,
2
reg
VLD1
5
1
L
ASIMD
load,
1
element,
multiple,
3
reg
VLD1
6
1/2
L
ASIMD
load,
1
element,
multiple,
4
reg
VLD1
6
1/2
L
ASIMD
load,
1
element,
one
lane
VLD1
8
1
L,
F0/F1
ASIMD
load,
1
element,
all
lanes
VLD1
8
1
L,
F0/F1
ASIMD
load,
2
element,
multiple,
2
reg
VLD2
8
1
L,
F0/F1
ASIMD
load,
2
element,
multiple,
4
reg
VLD2
9
1/2
L,
F0/F1
ASIMD
load,
2
element,
one
lane,
size
32
VLD2
8
1
L,
F0/F1
ASIMD
load,
2
element,
one
lane,
size
VLD2
8
1
L,
F0/F1
8/16
ASIMD
load,
2
element,
all
lanes
VLD2
8
1
L,
F0/F1
ASIMD
load,
3
element,
multiple,
3
reg
VLD3
9
1/2
L,
F0/F1
ASIMD
load,
3
element,
one
lane,
size
32
VLD3
8
1
L,
F0/F1
ASIMD
load,
3
element,
one
lane,
size
VLD3
9
2/3
L,
F0/F1
8/16
ASIMD
load,
3
element,
all
lanes
VLD3
8
1
L,
F0/F1
ASIMD
load,
4
element,
multiple,
4
reg
VLD4
9
1/2
L,
F0/F1
ASIMD
load,
4
element,
one
lane,
size
32
VLD4
8
1
L,
F0/F1
ASIMD
load,
4
element,
one
lane,
size
VLD4
9
1/2
L,
F0/F1
8/16
ASIMD
load,
4
element,
all
lanes
VLD4
8
1
L,
F0/F1
(ASIMD
load,
writeback
form)
(1)
Same
as
+I0/I1
1
before
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
load,
1
element,
multiple,
1
reg,
D-‐form
LD1
5
1
L
ASIMD
load,
1
element,
multiple,
1
reg,
Q-‐form
LD1
5
1
L
ASIMD
load,
1
element,
multiple,
2
reg,
D-‐form
LD1
5
1
L
ASIMD
load,
1
element,
multiple,
2
reg,
Q-‐form
LD1
6
1/2
L
ASIMD
load,
1
element,
multiple,
3
reg,
D-‐form
LD1
6
1/2
L
ASIMD
load,
1
element,
multiple,
3
reg,
Q-‐form
LD1
7
1/3
L
ASIMD
load,
1
element,
multiple,
4
reg,
D-‐form
LD1
6
1/2
L
ASIMD
load,
1
element,
multiple,
4
reg,
Q-‐form
LD1
8
1/4
L
ASIMD
load,
1
element,
one
lane,
B/H/S
LD1
8
1
L,
F0/F1
ASIMD
load,
1
element,
one
lane,
D
LD1
5
1
L
ASIMD
load,
1
element,
all
lanes,
D-‐form,
B/H/S
LD1R
8
1
L,
F0/F1
ASIMD
load,
1
element,
all
lanes,
D-‐form,
D
LD1R
5
1
L
ASIMD
load,
1
element,
all
lanes,
Q-‐form
LD1R
8
1
L,
F0/F1
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 31 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
load,
2
element,
multiple,
D-‐form,
B/H/S
LD2
8
1
L,
F0/F1
ASIMD
load,
2
element,
multiple,
Q-‐form,
LD2
9
1/2
L,
F0/F1
B/H/S
ASIMD
load,
2
element,
multiple,
Q-‐form,
D
LD2
6
1/2
L
ASIMD
load,
2
element,
one
lane,
B/H
LD2
8
1
L,
F0/F1
ASIMD
load,
2
element,
one
lane,
S
LD2
8
1
L,
F0/F1
ASIMD
load,
2
element,
one
lane,
D
LD2
6
1
L
ASIMD
load,
2
element,
all
lanes,
D-‐form,
B/H/S
LD2R
8
1
L,
F0/F1
ASIMD
load,
2
element,
all
lanes,
D-‐form,
D
LD2R
5
1
L
ASIMD
load,
2
element,
all
lanes,
Q-‐form
LD2R
8
1
L,
F0/F1
ASIMD
load,
3
element,
multiple,
D-‐form,
B/H/S
LD3
9
1/2
L,
F0/F1
ASIMD
load,
3
element,
multiple,
Q-‐form,
LD3
10
1/3
L,
F0/F1
B/H/S
ASIMD
load,
3
element,
multiple,
Q-‐form,
D
LD3
8
1/4
L
ASIMD
load,
3
element,
one
lane,
B/H
LD3
9
2/3
L,
F0/F1
ASIMD
load,
3
element,
one
lane,
S
LD3
8
1
L,
F0/F1
ASIMD
load,
3
element,
one
lane,
D
LD3
6
1/2
L
ASIMD
load,
3
element,
all
lanes,
D-‐form,
B/H/S
LD3R
8
1
L,
F0/F1
ASIMD
load,
3
element,
all
lanes,
D-‐form,
D
LD3R
6
1/2
L
ASIMD
load,
3
element,
all
lanes,
Q-‐form,
B/H/S
LD3R
9
2/3
L,
F0/F1
ASIMD
load,
3
element,
all
lanes,
Q-‐form,
D
LD3R
9
1/2
L,
F0/F1
ASIMD
load,
4
element,
multiple,
D-‐form,
B/H/S
LD4
9
1/2
L,
F0/F1
ASIMD
load,
4
element,
multiple,
Q-‐form,
LD4
11
1/4
L,
F0/F1
B/H/S
ASIMD
load,
4
element,
multiple,
Q-‐form,
D
LD4
8
1/4
L
ASIMD
load,
4
element,
one
lane,
B/H
LD4
9
1/2
L,
F0/F1
ASIMD
load,
4
element,
one
lane,
S
LD4
8
1
L,
F0/F1
ASIMD
load,
4
element,
one
lane,
D
LD4
6
1/2
L
ASIMD
load,
4
element,
all
lanes,
D-‐form,
B/H/S
LD4R
8
1
L,
F0/F1
ASIMD
load,
4
element,
all
lanes,
D-‐form,
D
LD4R
6
1
L
ASIMD
load,
4
element,
all
lanes,
Q-‐form,
B/H/S
LD4R
9
1/2
L,
F0/F1
ASIMD
load,
4
element,
all
lanes,
Q-‐form,
D
LD4R
9
2/5
L,
F0/F1
(ASIMD
load,
writeback
form)
(1)
Same
as
+I0/I1
1
before
Note:
1. Writeback forms of load instructions require an extra µop to update the base address. This update is
typically performed in parallel with the load µop (update latency is shown in parentheses).
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 32 of 42
3.18 ASIMD Store Instructions
Stores µops can issue after their address operands are available and do not need to wait for data operands. After
they are executed, stores are buffered and committed in the background.
Instruction Group AArch32 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
store,
1
element,
multiple,
1
reg
VST1
1
1
S
ASIMD
store,
1
element,
multiple,
2
reg
VST1
2
1/2
S
ASIMD
store,
1
element,
multiple,
3
reg
VST1
3
1/3
S
ASIMD
store,
1
element,
multiple,
4
reg
VST1
4
1/4
S
ASIMD
store,
1
element,
one
lane
VST1
3
1
F0/F1,
S
ASIMD
store,
2
element,
multiple,
2
reg
VST2
3
1/2
F0/F1,
S
ASIMD
store,
2
element,
multiple,
4
reg
VST2
4
1/4
F0/F1,
S
ASIMD
store,
2
element,
one
lane
VST2
3
1
F0/F1,
S
ASIMD
store,
3
element,
multiple,
3
reg
VST3
3
1/3
F0/F1,
S
ASIMD
store,
3
element,
one
lane,
size
32
VST3
3
1/2
F0/F1,
S
ASIMD
store,
3
element,
one
lane,
size
8/16
VST3
3
1
F0/F1,
S
ASIMD
store,
4
element,
multiple,
4
reg
VST4
4
1/4
F0/F1,
S
ASIMD
store,
4
element,
one
lane,
size
32
VST4
3
1/2
F0/F1,
S
ASIMD
store,
4
element,
one
lane,
size
8/16
VST4
3
1
F0/F1,
S
(ASIMD
store,
writeback
form)
+1
+I0/I1
1
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
store,
1
element,
multiple,
1
reg,
D-‐form
ST1
1
1
S
ASIMD
store,
1
element,
multiple,
1
reg,
Q-‐form
ST1
2
1/2
S
ASIMD
store,
1
element,
multiple,
2
reg,
D-‐form
ST1
2
1/2
S
ASIMD
store,
1
element,
multiple,
2
reg,
Q-‐form
ST1
4
1/4
S
ASIMD
store,
1
element,
multiple,
3
reg,
D-‐form
ST1
3
1/3
S
ASIMD
store,
1
element,
multiple,
3
reg,
Q-‐form
ST1
6
1/6
S
ASIMD
store,
1
element,
multiple,
4
reg,
D-‐form
ST1
4
1/4
S
ASIMD
store,
1
element,
multiple,
4
reg,
Q-‐form
ST1
8
1/8
S
ASIMD
store,
1
element,
one
lane,
B/H/S
ST1
3
1
F0/F1,
S
ASIMD
store,
1
element,
one
lane,
D
ST1
1
1
S
ASIMD
store,
2
element,
multiple,
D-‐form,
B/H/S
ST2
3
1/2
F0/F1,
S
ASIMD
store,
2
element,
multiple,
Q-‐form,
B/H/S
ST2
4
1/4
F0/F1,
S
ASIMD
store,
2
element,
multiple,
Q-‐form,
D
ST2
4
1/4
S
ASIMD
store,
2
element,
one
lane,
B/H/S
ST2
3
1
F0/F1,
S
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 33 of 42
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
ASIMD
store,
2
element,
one
lane,
D
ST2
2
1/2
S
ASIMD
store,
3
element,
multiple,
D-‐form,
B/H/S
ST3
3
1/3
F0/F1,
S
ASIMD
store,
3
element,
multiple,
Q-‐form,
B/H/S
ST3
6
1/6
F0/F1,
S
ASIMD
store,
3
element,
multiple,
Q-‐form,
D
ST3
6
1/6
S
ASIMD
store,
3
element,
one
lane,
B/H
ST3
3
1
F0/F1,
S
ASIMD
store,
3
element,
one
lane,
S
ST3
3
1/2
F0/F1,
S
ASIMD
store,
3
element,
one
lane,
D
ST3
3
1/3
S
ASIMD
store,
4
element,
multiple,
D-‐form,
B/H/S
ST4
4
1/4
F0/F1,
S
ASIMD
store,
4
element,
multiple,
Q-‐form,
B/H/S
ST4
8
1/8
F0/F1,
S
ASIMD
store,
4
element,
multiple,
Q-‐form,
D
ST4
8
1/8
S
ASIMD
store,
4
element,
one
lane,
B/H
ST4
3
1
F0/F1,
S
ASIMD
store,
4
element,
one
lane,
S
ST4
3
1/2
F0/F1,
S
ASIMD
store,
4
element,
one
lane,
D
ST4
4
1/4
S
(ASIMD
store,
writeback
form)
(1)
Same
as
+I0/I1
1
before
Note:
1. Writeback forms of store instructions require an extra µop to update the base address. This update is
typically performed in parallel with the store µop (update latency is shown in parentheses).
3.19 Cryptography Extensions
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Crypto
AES
ops
AESD,
AESE,
AESIMC,
AESMC
3
1
F0
1,
2
Crypto
polynomial
(64x64)
3
1
F0
2
multiply
long
VMULL.P64
Crypto
SHA1
xor
ops
SHA1SU0
6
2
F0/F1
Crypto
SHA1
fast
ops
SHA1H,
SHA1SU1
3
1
F0
2
Crypto
SHA1
slow
ops
SHA1C,
SHA1M,
SHA1P
6
1/2
F0
2
Crypto
SHA256
fast
ops
SHA256SU0
3
1
F0
2
Crypto
SHA256
slow
ops
SHA256H,
SHA256H2,
6
1/2
F0
2
SHA256SU1
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Crypto
AES
ops
AESD,
AESE,
AESIMC,
AESMC
3
1
F0
1,
2
Crypto
polynomial
(64x64)
3
1
F0
2
multiply
long
PMULL(2)
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 34 of 42
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
Crypto
SHA1
xor
ops
SHA1SU0
6
2
F0/F1
Crypto
SHA1
schedule
3
1
F0
2
acceleration
ops
SHA1H,
SHA1SU1
Crypto
SHA1
hash
6
1/2
F0
2
acceleration
ops
SHA1C,
SHA1M,
SHA1P
Crypto
SHA256
schedule
3
1
F0
2
acceleration
op
(1
µop)
SHA256SU0
Crypto
SHA256
schedule
6
1/2
F0
2
acceleration
op
(2
µops)
SHA256SU1
Crypto
SHA256
hash
6
1/2
F0
2
acceleration
ops
SHA256H,
SHA256H2
Note:
1. Adjacent AESE/AESMC instruction pairs and adjacent AESD/AESIMC instruction pairs will exhibit the
described performance characteristics. See Section 4.10 for additional details.
2. Crypto execution support late forwarding of the result from a producer µop to a consumer µop. This results
in a one cycle reduction in latency as seen by the consumer.
3.20 CRC
Instruction Group AArch32 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
CRC
checksum
ops
CRC32,
CRC32C
2
1
M
1
Instruction Group AArch64 Instructions Exec Execution Utilized Notes
Latency Throughput Pipelines
CRC
checksum
ops
CRC32,
CRC32C
2
1
M
1
Note:
1. CRC execution supports late forwarding of the result from a producer CRC µop to a consumer CRC µop. This
results in a one cycle reduction in latency as seen by the consumer.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 35 of 42
4 SPECIAL CONSIDERATIONS
4.1 Dispatch Constraints
Dispatch of µops from the in-order portion to the out-of-order portion of the microarchitecture includes a number of
constraints. It is important to consider these constraints during code generation in order to maximize the effective
dispatch bandwidth and subsequent execution bandwidth of the Cortex®-A72 processor.
The dispatch stage can process up to three µops per cycle, with the following limitations on the number of µops of
each type that can be simultaneously dispatched.
• One µop using the B pipeline
• Up to two µops using the I pipelines
• Up to two µops using the M pipeline
• One µop using the F0 pipeline
• One µop using the F1 pipeline
• Up to two µops using the L or S pipeline
If there are more µops available to be dispatched in a given cycle than can be supported by the constraints above,
µops will be dispatched in oldest-to-youngest age order to the extent allowed by the above.
4.2 Conditional Execution
The ARMv8 architecture allows many types of A32 instructions to be conditionally executed based upon condition
flags (N, Z, C, V). If the condition flags satisfy a condition specified in the instruction encoding, an instruction has
its normal effect. If the flags do not satisfy this condition, the instruction acts as a NOP.
This leads to conditional register writes for most types of conditional instructions. In an out-of-order processor
such as Cortex®-A72 processor, this has two side-effects:
• The first side-effect is that the conditional instruction requires the old value of its destination register as an
input operand.
• The second side-effect is that all subsequent consumers of the conditional instruction destination register
depend upon this operation, regardless of the state of the condition flags (that is, even if the destination
register is unchanged in the event the condition is not met.).
These effects should be taken into account when considering whether to use conditional execution for long-
latency operations. The overheads of conditional execution might begin to outweigh the benefits. Consider the
following example.
MULEQ R1, R2, R3
MULNE R1, R2, R4
For this pair of instructions, the second multiply is dependent upon the result of the first multiply, not through one
of its normal input operands (R2 and R4), but through the destination register R1. The combined latency for these
instructions is six cycles, rather than the four cycles that would be required if these instructions were not
conditional (three cycles latency for the first, and one additional cycle for the second which is fully pipelined behind
the first). So if the condition is easily predictable (by the branch predictor), conditional execution can lead to a
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 36 of 42
performance loss. But if the condition is not easily predictable, conditional execution can lead to a performance
gain because the latency of a branch mispredict is generally higher than the execution latency of conditional
instructions. In general, ARM recommends that conditional instruction forms be considered only for integer
instructions with latency less than or equal to two cycles, loads, and stores.
4.3 Conditional ASIMD
Conditional execution is architecturally possible for ASIMD instructions in Thumb state using IT blocks. However,
this type of encoding is considered abnormal and is not recommended for Cortex®-A72. It will likely perform
worse than the equivalent unconditional encodings.
4.4 Register Forwarding Hazards
The ARMv8-A architecture allows FP instructions to read and write 32-bit S-registers. In AArch32, each S-register
corresponds to one half (upper or lower) of an overlayed 64-bit D-register. Register forwarding hazards might
occur when one µop reads a D-register or Q-register operand that has recently been written with one or more S-
register results. Consider the following abnormal scenario.
VMOV S0,R0
VMOV S1,R1
VADD D2, D1, D0
The first two instructions write S0 and S1, which correspond to the bottom and top halves of D0. The third
instruction then requires D0 as an input operand. In this scenario, Cortex®-A72 processor detects that at least one
of the upper or lower S0/S1 registers overlayed on D0 were previously written, at which point the VADD instruction
is serialized until the prior S-register writes are guaranteed to have been architecturally committed, likely incurring
significant additional latency. Note that after the D0 register has been written as a D-register or Q-register
destination, subsequent consumers of that register will no longer encounter this register-hazard condition, until the
next S-register write, if any.
The Cortex®-A72 processor is able to avoid this register-hazard condition for certain cases. The following rules
describe the conditions under which a register-hazard can occur.
• The producer writes an S-register (not a D[x] scalar)
• The consumer reads an overlapping D-register (not as a D[x] scalar, nor as an implicit operand caused by
conditional execution)
• The consumer is a FP/ASIMD µop (not a store µop)
To avoid unnecessary hazards, ARM recommends that the programmer use D[x] scalar writes when populating
registers prior to ASIMD operations. For example, either of the following instruction forms would safely prevent a
subsequent hazard:
VLD1.32 Dd[x], [address]
VMOV.32 Dd[x], Rt
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 37 of 42
The Performance Monitor Unit (PMU) in the Cortex®-A72 processor can be used to determine when register
forwarding hazards are actually occuring. The implementation defined PMU event number 0x12C
(DISP_SWDW_STALL) has been assigned to count the number of cycles spent stalling due to these hazards.
4.5 Load/Store Throughput
The Cortex®-A72 processor includes separate load and store pipelines,which allow it to execute one load µop and
one store µop every cycle
To achieve maximum throughput for memory copy (or similar loops), do the following:
• Unroll the loop to include multiple load and store operations for each iteration, minimizing the overheads
of looping.
• Use discrete, non-writeback forms of load and store instructions (such as LDRD and STRD), interleaving
them so that one load and one store operation can be performed each cycle. Avoid load-multiple/store-
multiple instruction encodings (such as LDM and STM), which lead to separated bursts of load and store
µops which might not allow concurrent use of both the load and store pipelines.
The following example shows a recommended instruction sequence for a long memory copy in AArch32 state:
Loop_start:
SUBS r2,r2,#64
LDRD r3,r4,[r1,#0]
STRD r3,r4,[r0,#0]
LDRD r3,r4,[r1,#8]
STRD r3,r4,[r0,#8]
LDRD r3,r4,[r1,#16]
STRD r3,r4,[r0,#16]
LDRD r3,r4,[r1,#24]
STRD r3,r4,[r0,#24]
LDRD r3,r4,[r1,#32]
STRD r3,r4,[r0,#32]
LDRD r3,r4,[r1,#40]
STRD r3,r4,[r0,#40]
LDRD r3,r4,[r1,#48]
STRD r3,r4,[r0,#48]
LDRD r3,r4,[r1,#56]
STRD r3,r4,[r0,#56]
ADD r1,r1,#64
ADD r0,r0,#64
BGT Loop_start
A recommended copy routine for AArch64 would look similar to the sequence above, but would use LDP/STP
instructions.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 38 of 42
4.6 Load/Store Alignment
The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The
Cortex®-A72 processor handles most unaligned accesses without performance penalties. However, there are
cases which reduce bandwidth or incur additional latency, as described below.
• Load operations that cross a cache-line (64-byte) boundary
• Store operations that cross a 16-byte boundary
4.7 Branch Alignment
Branch instruction and branch target instruction alignment can affect performance. For best-case performance,
consider the following guidelines.
• Try not to include more than two taken branches within the same quadword-aligned quadword of
instruction memory.
• Consider aligning subroutine entry points and branch targets to quadword boundaries, within the bounds
of the code-density requirements of the program. This will ensure that the subsequent fetch can retrieve
four (or a full quadword’s worth of) instructions, maximizing fetch bandwidth following the taken branch.
4.8 Setting Condition Flags
The ARM instruction set includes instruction forms that set the condition flags. In addition to compares, many
types of data processing operations set the condition flags as a side-effect. Excessive use of flag-setting
instruction forms might result in performance degradation, therefore ARM recommends that, where possible, non-
flag-setting instructions and instruction-forms are used except where the condition-flag result is explicitly required
for subsequent branches or conditional instructions.
When using the Thumb instruction set, special attention should be given to the use of 16-bit instruction forms.
Many of those (such as moves, adds, shifts) automatically set the condition flags. For best performance, consider
using the 32-bit encodings which include forms that do not set the condition flags, within the bounds of the code-
density requirements of the program.
4.9 Special Register Access
The Cortex®-A72 processor performs register renaming for general purpose registers to enable speculative and
out-of-order instruction execution. However, most special-purpose registers are not renamed. Instructions that
read or write non-renamed registers are subjected to one or more of the following additional execution constraints:
• Non-speculative execution – Instructions can only execute non-speculatively.
• In-order execution – Instructions must execute in-order with respect to other similar instructions, or in
some cases with respect to all instructions.
• Flush side-effects – Instructions trigger a flush side-effect after executing for synchronization.
The table below summarizes various special instructions and the associated execution constraints or side-effects.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 39 of 42
Instructions Forms Non- In- Flush Notes
Speculative Order Side-
Effect
ISB
Yes
Yes
Yes
1
CPS
Yes
Yes
Yes
1
SETEND
Yes
Yes
Yes
1
MRS
(read)
APSR,
CPSR
Yes
Yes
No
1
MRS
(read)
SPSR
No
Yes
No
1
MSR
(write)
ASPR_nzcvq,
CPSR_f
No
No
No
1,
2,
3
MSR
(write)
APSR,
CPSR
other
Yes
Yes
Yes
1
MSR
(write)
SPSR
Yes
Yes
No
1
VMRS
(read)
FPSCR
to
APSR_nzcv
No
No
No
1,
2
VMRS
(read)
Other
Yes
Yes
No
1
VMSR
(write)
Yes
Yes
Yes
1
VMSR
(write)
FPSCR,
changing
only
NZCV
Yes
Yes
No
1
MRC
(read)
Some
Yes
No
1,
2,
4
MCR
(write)
Yes
Yes
Some
1,
4
Note:
1. Conditional forms of these instructions for which the condition is not satisfied will not access special
registers or trigger flush side-effects.
2. Conditional forms of these instructions are always executed non-speculatively and in-order to properly
resolve the condition.
3. MSR instructions that write APSR_nzcvq generate a separate µop to write the Q bit. That µop executes
non-speculatively and in-order. But the main µop, which writes the NZCV bits, executes as shown in the
table above.
4. A subset of MCR instructions must be executed non-speculatively. A subset of MRC instructions trigger
flush side-effects for synchronization. Those subsets are not documented here.
4.10 AES Encryption/Decryption
The Cortex®-A72 processor can issue one AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined)
with an execution latency of three cycles (see Section 3.19). This means encryption or decryption for at least
three data chunks should be interleaved for maximum performance:
AESE data0, key0
AESMC data0, data0
AESE data1, key0
AESMC data1, data1
AESE data2, key0
AESMC data2, data2
AESE data0, key1
AESMC data0, data0
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 40 of 42
...
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions provide higher performance when adjacent,
and in the described order, in the program code. Therefore it is important to ensure that these instructions come
in pairs in AES encryption/decryption loops, as shown in the code segment above.
4.11 Fast literal generation
The Cortex®-A72 processor supports optimized literal generation for 32- and 64-bit code. A typical literal
generation sequence in 32-bit code is:
MOV rX, #bottom_16_bits
MOVT rX, #top_16_bits
In 64-bit code, generating a 32-bit immediate:
MOV wX, #bottom_16_bits
MOVK wX, #top_16_bits, lsl #16
In 64-bit code, generating the bottom half of a 64-bit immediate:
MOV xX, #bottom_16_bits
MOVK xX, #top_16_bits, lsl #16
In 64-bit code, generating the top half of a 64-bit immediate:
MOVK xX, #bits_47_to_32, lsl #32
MOVK xX, #bits_63_to_48, lsl #48
If any of these sequences appear sequentially and in the described order in program code, the two instructions
can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program
code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.
Thus it is advantageous to ensure that compilers or programmers writing assembly code schedule these
instruction pairs sequentially.
4.12 PC-relative address calculation
The Cortex®-A72 processor supports optimized PC-relative address calculation using the following instruction
sequence:
ADRP xX, #label
ADD xY, xX, #imm
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 41 of 42
If this sequence appears sequentially and in the described order in program code, the two instructions can be
executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code.
Thus it is advantageous to ensure that compilers or programmers writing assembly code schedule these
instruction pairs sequentially.
4.13 FPCR self-synchronization
Programmers and compiler writers should note that writes to the FPCR register are self-synchronizing, i.e. its
effect on subsequent instructions can be relied upon without an intervening context synchronizing operation.
Copyright © 2015 ARM. All rights reserved.
ARM UAN 0016A Page 42 of 42