INTRODUCTION TO 8051
MICROCONTROLLER
UNIT – III
• Introduction to ARM processor: ARM architecture, Application specific classification of ARM
family, Pipeline, programming model, memory organization, processor modes, Instruction encoding
format, data processing and arithmetic and branch instructions, call or exceptions in ARM
9 Hours
ARM history
• 1983 Acorn RISC Machine (ARM) developed by Acorn computers
• To replace 6502 in BBC(British Broadcasting Corporation) computers
• Its simplicity comes from the inexperience team
• Match the needs for generalized SoC for reasonable power, performance and
die size
• The first commercial RISC implemenation
• Later, A company was founded in November 1990 as Advanced RISC
Machines Ltd and structured as a joint venture between Acorn
Computers, Apple Computer (now Apple Inc.) and VLSI Technology
• The reason for this was because Apple wanted to use Arm
technology but didn’t want to base a product on Acorn IP – who, at
the time were considered a competitor. Apple invested the cash, VLSI
Technology provided the tools, and Acorn provided the 12 engineers
and with that Arm was born
Naming ARM
• ARMxyzTDMIEJFS
• x: series
• y: MMU(memory management unit)
• z: cache memory
• T: Thumb decoder
• D: debugger
• M: Multiplier(fast)
• I: EmbeddedICE (built-in debugger
hardware)
• E: Enhanced instruction
• J: Jazelle (JVM supports java byte
code)
• F: Floating-point co-processor
• S: Synthesizible version (source code
version for EDA tools)
Popular ARM architectures
• ARM7TDMI
– 3 pipeline stages (fetch/decode/execute)
– High code density/low power consumption
– One of the most used ARM-version (for low-end
systems)
– All ARM cores after ARM7TDMI include TDMI even if
they do not include TDMI in their labels
• ARM9TDMI
– Compatible with ARM7
– 5 stages (fetch/decode/execute/memory/write)
– Separate instruction and data cache
• ARM11
ARM Core comparison
ARM Core Application
• RISC: simple but powerful instructions that
execute within a single cycle at high clock
speed.
• Four major design rules:
• Instructions: reduced set/single
cycle/fixed length
ARM is a • Pipeline: decode in one stage/no need for
Reduced microcode
• Registers: a large set of general-purpose
Instruction Set registers
• Load/store architecture: data processing
Computer (RISC) instructions apply to registers only;
load/store to transfer data from memory
• Results in simple design and fast clock rate
• The distinction between RISC and CISC has
blurs because CISC implements RISC concepts
Features of ARM
• Multiprocessing Systems –
ARM processors are designed so that they can be used in cases of multiprocessing
systems where more than one processors are used to process information
• Tightly Coupled Memory –
Memory of ARM processors is tightly coupled. This has very fast response time. It has
low latency (quick response) that can also be used in cases of cache memory being
unpredictable.
• Memory Management –
ARM processor has management section. This includes Memory Management Unit
and Memory Protection Unit. These management systems become very important in
managing memory efficiently.
• Thumb-2 Technology –
Thumb-2 Technology was introduced in 2003 and was used to create variable length
instruction set. It extends 16-bit instructions of initial Thumb technology to 32-bit
instructions. It has better performance than previously used Thumb technology.
• One cycle execution time –
ARM processor is optimized for each instruction on CPU. Each instruction is of fixed
length that allows time for fetching future instructions before executing present
instruction. ARM has CPI (Clock Per Instruction) of one cycle.
• Pipelining –
Processing of instructions is done in parallel using pipelines. Instructions are broken
down and decoded in one pipeline stage. The pipeline advances one step at a time
to increase throughput (rate of processing).
• Large number of registers –
Large number of registers are used in ARM processor to prevent large amount of
memory interactions. Registers contain data and addresses. These act as local
memory store for all operations
Architectural inheritance
• At the time the first ARM chip was designed, the only examples of RISC architectures
were the Berkeley RISC I and II and the Stanford MIPS (which stands for
Microprocessor without Interlocking Pipeline Stages).
Features used
The ARM architecture incorporated a number of features from the Berkeley RISC design, but a
number of other features were rejected. Those that were used were:
• a load-store architecture;
• fixed-length 32-bit instructions;
• 3-address instruction formats.
Features rejected
• Register windows.
The register banks on the Berkeley RISC processors incorporated a large number of registers, 32 of
which were visible at any time. The principal problem with register windows is the large chip area
occupied by the large number of registers. This feature was therefore rejected on cost grounds.
• Delayed branches.
Branches cause pipelines problems since they interrupt the smooth flow of instructions. Most RISC
processors ameliorate the problem by using delayed branches where the branch takes effect after the
following instruction has executed. The problem with delayed branches is that they remove the
atomicity of individual instructions.
• Single-cycle execution of all instructions.
Although the ARM executes most data processing instructions in a single clock cycle, many other
instructions take multiple clock cycles. Instead of single-cycle execution of all instructions, the ARM
was designed to use the minimum number of cycles required for memory accesses.
ARM architecture
• Arithmetic Logic Unit (ALU)
The ALU has two 32-bits inputs. The primary comes from the register file, whereas the other comes
from the shifter. Status registers flags modified by the ALU outputs.
• Booth Multiplier
Booth algorithm is a noteworthy multiplication algorithmic rule for 2’s complement numbers. This
treats positive and negative numbers uniformly.
• Barrel Shifter
The barrel shifter features a 32-bit input to be shifted. This input is coming from the register file or
it might be immediate data.
• Control Unit
For any microprocessor, control unit is the heart of the whole process and it is responsible for the
system operation, so the control unit design is the most important part within the whole design.
• Registers
There is a register set in the architecture to help in managing the registers with their corresponding
time in the structure. The function of registers is to manage the files and keep them in the proper
place and check for their initial state in the processor.
• Address incrementer
Incrementing up or decrementing down memory address according to the addressing mode.
Pipelining
• The Process of fetching the next instruction while the current
instruction is being executed is called Pipelining.
• Pipelining is supported by the processor to increase the speed of
program execution.
• Increases throughput.
• The Pipeline has three stages fetch, decode and execute
3- Stage Pipeline
• Fetch: The instruction is fetched from memory and placed in the instruction
pipeline
• Decode: The instruction is decoded and datapath control signals prepared for the
next cycle. In this stage the instruction owns the decode logic but not the
datapath
• Execute: The register owns the datapath; the register bank is read, an operand
shifted, the ALU result generated and written back into destination register
1 FETCH DECODE EXECUTE
2 FETCH DECODE EXECUTE
3 FETCH DECODE EXECUTE
instruction
time
3 Stage
Pipeline
organization
The principal components are (datapath)
• The register bank, which stores the processor state. It has
two read ports and one write port which can each be used
to access any register, plus an additional read port and an
additional write port that give special access to r15, the
program counter.
• The barrel shifter, which can shift or rotate one operand
by any number of bits.
• The ALU, which performs the arithmetic and logic
functions required by the instruction set.
• The address register and incrementer, which select and
hold all memory addresses and generate sequential
addresses when required.
• The data registers, which hold data passing to and from
memory.
• The instruction decoder and associated control logic.
Examples on 3 stage pipeline
• The three instructions are placed into the pipeline sequentially
• In the first cycle the core fetches the ADD instruction from memory
• In the second cycle the core fetches the SUB instruction and decodes
the ADD instruction
• In the third cycle, both the SUB and ADD instructions are moved
along the pipeline
• The ADD instruction is executed, the SUB instruction is decoded, and
the CMP instruction is fetched.
Multi-cycle instruction in 3 stage pipeline
If, as noted above, instructions fetch the next instruction but one during their first cycle, this
suggests that the PC must point eight bytes (two instructions) ahead of the current
instruction.
The simplest way to view breaks in the ARM
pipeline
• All instructions occupy the datapath for one or more adjacent cycles.
• For each cycle that an instruction occupies the datapath, it occupies
the decode logic in the immediately preceding cycle.
• During the first datapath cycle each instruction issues a fetch for the
next instruction but one.
• Branch instructions flush and refill the instruction pipeline.
• PC must point ahead of 8 bytes (PC+8) of the current instruction in 3
stage pipeline [ each instruction is 32 bit or 4 byte wide]
Performance measure
• The time required for program execution is
• where Ninst is the number of ARM instructions executed in the course of the program,
CPI is the average number of clock cycles per instruction and fclk is the processor's clock
frequency.
• Since Ninst is constant for a given program; there are only two ways to increase
performance:
a) Increase the clock rate, fclk.
This requires the logic in each pipeline stage to be simplified and, therefore, the number of
pipeline stages to be increased.
b) Reduce the average number of clock cycles per instruction, CPI.
This requires either that instructions which occupy more than one pipeline slot in a 3-stage
pipeline ARM are re-implemented to occupy fewer slots, or that pipeline stalls caused by
dependencies between instructions are reduced, or a combination of both
Memory Bottle Neck
• The fundamental problem with reducing the CPI relative to a 3-stage core is
related to the von Neumann bottleneck - any stored-program computer
with a single instruction and data memory will have its performance
limited by the available memory bandwidth.
• A 3-stage ARM core accesses memory on (almost) every clock cycle either
to fetch an instruction or to transfer data
• This simply tightening up on the few cycles where the memory is not used
will yield only a small performance gain.
• To get a significantly better CPI the memory system must deliver more than
one value in each clock cycle either by delivering more than 32 bits per
cycle from a single memory or by having separate memories for instruction
and data accesses.
• As a result of the above issues, higher performance ARM cores employ a
5-stage pipeline and have separate instruction and data memories
5 Stage pipeline
• Memory: Data memory is accessed if required. Otherwise the ALU
result is simply buffered for one clock cycle to give the same pipeline
flow for all instructions.
• Write: The results generated by the instruction are written back to
the register file, including any data loaded from memory
Data forwarding
A major source of complexity in the 5-stage pipeline (compared to the
3-stage pipeline) is that, because instruction execution is spread across
three pipeline stages, the only way to resolve data dependencies
without stalling the pipeline is to introduce forwarding paths
• Data dependencies arise when an instruction needs to use the result
of one of its predecessors before that result has returned to the
register file
• Forwarding paths allow results to be passed between stages as soon
as they are available, and the 5-stage ARM pipeline requires each of
the three source operands to be forwarded from any of three
intermediate result registers
Example : ADD r1, r2, r3
ADD r4, r1, r3
5 stage
pipeline
organization
Bottle neck of 5-stage pipeline
• As the pipeline length increases, the amount of work done at each
stage is reduced, which allows the processor to attain a higher
operating frequency.
• The system latency increases because it takes more cycles to fill the
pipeline before the core can execute an instruction
• The increased pipeline length also means there can be data
dependency between certain stages
• Branch instructions also introduce control stall delays into the
pipeline, commonly referred to as the branch penalty
Register Model and
Addressing Modes – ARM
32
Processor Mode
33
Processor Mode
The ARM has seven basic operating modes:
• Each mode has access to own stack and a different subset of registers
• Some operations can only be carried out in a privileged mode
34
Processor Mode
• FIQ fast interrupt request handling
• IRQ interrupt request handling
• Entered on reset or when a Supervisor Call instruction (SVC) is
executed
• Abort (ABT)
• Undef (UND)
• System (SYS)
35
Processor Mode
36
37
Processor Mode
38
The ARM Register Set
39
Current Program Status Register(CPSR)
40
• When the processor is executing in ARM state:
• All instructions are 32 bits in length
• All instructions must be word aligned
• Therefore the PC value is stored in bits [31:2]
with bits [1:0] equal to zero (as instruction
R14(LR) and cannot be halfword or byte aligned).
• R14 is used as the subroutine link register (LR) and
R15(PC) stores the return address when Branch with Link
operations are performed,
Registers calculated from the PC.
• Thus to return from a linked branch
• MOV r15,r14
or
• MOV pc,lr
41
An exception is any condition that needs to halt
normal execution of the instructions
• When an exception occurs, the core:
• Copies CPSR into (saved program status register)
SPSR_<mode>
• Sets appropriate CPSR bits
• Change to ARM state
Exception • Change to exception mode
Handling • Disable interrupts (if appropriate)
• Stores the return address in LR_<mode>
• Sets PC to vector address
To return, exception handler needs to:
• Restore CPSR from SPSR_<mode>
• Restore PC from LR_<mode>
42
Vector Address for different exceptions
43
ARM Instruction Format
44
Conditional Field in Instructions
• In ARM state, all instructions are conditionally executed according to the
state of the CPSR condition codes and the instruction’s condition field
• This field (bits 31:28) determines the circumstances under which an
instruction is to be executed
• If the state of the C, N, Z and V flags fulfils the conditions encoded by the
field, the instruction is executed, otherwise it is ignored
• There are sixteen possible conditions, each represented by a two-character
suffix that can be appended to the instruction’s mnemonic
• For example, a Branch (B in assembly language) becomes BEQ for "Branch
if Equal", which means the Branch will only be taken if the Z flag is set.
45
46
The sixteenth (1111) is reserved, and must not be used.
Using and updating conditional field
• To execute an instruction conditionally, simply postfix it with the
appropriate condition:
• For example an add instruction takes the form:
• ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL)
• To execute this only if the zero flag is set:
• ADDEQ r0,r1,r2 ; If zero flag set then…
; ... r0 = r1 + r2
• By default, data processing operations do not affect the condition
flags (apart from the comparisons where this is the only effect). To
cause the condition flags to be updated, the S bit of the instruction
needs to be set by postfixing the instruction (and any condition code)
with an “S”.
• For example to add two numbers and set the condition flags:
• ADDS r0,r1,r2 ; r0 = r1 + r2 ;
... and set flags
47
Branch Instructions
• Branch : B{<cond>} label
• Branch with Link : BL{<cond>} sub_routine_label
• The offset for branch instructions is calculated by the assembler:
• By taking the difference between the branch instruction and the target address
minus 8 (to allow for the pipeline).
• This gives a 26 bit offset which is right shifted 2 bits (as the bottom two bits are
always zero as instructions are word – aligned) and stored into the instruction
encoding.
• This gives a range of ± 32 Mbytes.
48
• When executing the branch instruction, the
processor:
• shifts the 24 bit offset left two bits, sign extends
it to 32 bits, and adds it to PC.
• Execution then continues from the new PC, once the
pipeline has been refilled.
• The "Branch with link" instruction implements a
subroutine call by writing PC-4 into the LR of the
current bank.
• i.e. the address of the next instruction following
the branch with link (allowing for the pipeline).
• To return from subroutine, simply need to restore the
PC from the LR:
• MOV pc, lr
• Again, pipeline has to refill before execution
continues.
• The "Branch" instruction does not affect LR.
49
Load and Store Instructions
• LDR is used to load something from memory into a register
• STR is used to store something from a register to a memory address
LDR R2, [R0] ; Load the value at the address found in R0 to the destination register R2
STR R2, [R1] ; Stores the value found in R2 to the memory address found in R1
50
Example on LDR and STR
51
Load and Store Word or Byte:
Base Register
• The memory location to be accessed is held in a base register
• STR r0, [r1] ; Store contents of r0 to location pointed to
; by contents of r1.
• LDR r2, [r1] ; Load r2 with contents of memory location
; pointed to by contents of r1.
r0 Memory
Source
0x5
Register
for STR
r1 r2
Base Destination
0x200 0x200 0x5 0x5
Register Register
for LDR
52
Addressing Modes
53
Memory Addressing Modes in ARM
54
ARM Indexed Addressing Modes
shift=direction #integer, where direction is LSL for left shift or LSR for right shift,
and integer is a 5-bit unsigned number specifying the shift format
55
Relative Addressing
56
Arithmetic Instructions
57
ARM Instruction Set
58
Instruction set encoding
59
Condition field
60
61
Data processing
62
Data processing instructions
63
Instruction encoding
64
Shift operation
65
Shift and rotate Instructions
66
Logical shift left
Logical shift right
67
Arithmetic shift right
Rotate right
68
Rotate right extended
69
70
Arithmetic (Add and subtraction)
71
ADC
72
ADD
73
AND
74
75
MOV
76
MVN
77
RSB
78
Example
79
Example
80
BIC
81
Comparison
82
83
Multiplication
84
85
86
Flow control instructions
87
Branch Instructions
• Branch : B{<cond>} label
• Branch with Link : BL{<cond>} sub_routine_label
• The offset for branch instructions is calculated by the assembler:
• By taking the difference between the branch instruction and the target address
minus 8 (to allow for the pipeline).
• This gives a 26 bit offset which is right shifted 2 bits (as the bottom two bits are
always zero as instructions are word – aligned) and stored into the instruction
encoding.
• This gives a range of ± 32 Mbytes.
88
• When executing the branch instruction, the
processor:
• shifts the offset left two bits, sign extends it to 32
bits, and adds it to PC.
• Execution then continues from the new PC, once the
pipeline has been refilled.
• The "Branch with link" instruction implements a
subroutine call by writing PC-4 into the LR of the
current bank.
• i.e. the address of the next instruction following
the branch with link (allowing for the pipeline).
• To return from subroutine, simply need to restore the
PC from the LR:
• MOV pc, lr
• Again, pipeline has to refill before execution
continues.
• The "Branch" instruction does not affect LR.
89
Branch and link
90
Conditional branch
91
Load and Store
92
LDR and STR
93
Loading constants
• No ARM instruction loads a 32-bit constant into a register because
ARM instructions are 32-bit long
• There is a pseudo code for this
94
95
96
ARM Instruction Execution
97
Data Processing Instructions
• A data processing instruction requires two operands, one of which is
always a register and the other is either a second register or an
immediate value
Example - MOV R0,R2
MOV R1,#23451678H
MOV R0,R0,LSL #7
• The second operand is passed through the barrel shifter where it is
subject to a general shift operation, then it is combined with the first
operand in the ALU using a general ALU operation
98
99
• All these operations take place in a single clock cycle.
• Note also how the PC value in the address register is incremented
and copied back into both the address register and r15 in the register
bank, and the next instruction but one is loaded into the bottom of
the instruction pipeline (i. pipe).
• The immediate value, when required, is extracted from the current
instruction at the top of the instruction pipeline. For data processing
instructions only the bottom eight bits (bits [7:0]) of the instruction
are used in the immediate value.
100
• A data transfer (load or store) instruction computes a
memory address in a manner very similar to the way
Data Path a data processing instruction computes its result
• A register is used as the base address, to which is
Activity added (or from which is subtracted) an offset which
again may be another register or an immediate value
During Load Example : LDR r3, address ; uses PC to locate
and Store the program memory
LDR r0, [r1]
Operation STR r2,[r3]
101
• The datapath operation for the two cycles of a data store instruction (SIR) with
an immediate offset are shown in Figure.
• Note how the incremented PC value is stored in the register bank at the end of
the first cycle so that the address register is free to accept the data transfer
address for the second cycle, then at the end of the second cycle the PC is fed
back to the address register to allow instruction prefetching to continue.
• It should, perhaps, be noted at this stage that the value sent to the address
register in a cycle is the value used for the memory access in the following
cycle.
• The address register is, in effect, a pipeline register between the processor
datapath and the external memory.
102
• When the instruction specifies the store of a byte data type, the 'data
out' block extracts the bottom byte from the register and replicates it
four times across the 32-bit data bus.
• External memory control logic can then use the bottom two bits of the
address bus to activate the appropriate byte within the memory system.
• Load instructions follow a similar pattern except that the data from
memory only gets as far as the 'data in' register on the second cycle and
a third cycle is needed to transfer the data from there to the destination
register.
103
104
Computing
Address
• 1St cycle of operation
105
Store date
• 2nd cycle of operation
At the end of
the second cycle the PC is fed back to the
address register to allow instruction
prefetching to continue
106
Branch
• Branch instructions compute the target address in the first cycle
• A 24-bit immediate field is extracted from the instruction and then
shifted left two bit positions to give a word-aligned offset which is
added to the PC
• The result is issued as an instruction fetch address, and while the
instruction pipeline refills the return address is copied into the link
register (r14) if this is required (that is, if the instruction is a 'branch
with link').
107
108
Branch
computing
target address
in first cycle
109
Save Return address
– second cycle
• The third cycle, which is required to
complete the pipeline refilling, is also
used to make a small correction to the
value stored in the link register in
order that it points directly at the
instruction which follows the branch
110