Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views16 pages

Esd-Unit 2 Notes

This document covers instruction sets in computer architecture, focusing on the differences between Von Neumann and Harvard architectures, and the characteristics of ARM processors. It explains the concepts of CISC and RISC, detailing the advantages of RISC in terms of performance and power efficiency. Additionally, it discusses memory organization, cache performance, and the significance of pipelining in CPU performance.

Uploaded by

bmogobe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Esd-Unit 2 Notes

This document covers instruction sets in computer architecture, focusing on the differences between Von Neumann and Harvard architectures, and the characteristics of ARM processors. It explains the concepts of CISC and RISC, detailing the advantages of RISC in terms of performance and power efficiency. Additionally, it discusses memory organization, cache performance, and the significance of pipelining in CPU performance.

Uploaded by

bmogobe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT II

INSTRUCTION SETS
LEARNING OBJECTIVES

At the end of the unit, the students can able to


• Differentiate Von Newman and Harvard Architecture.
• Implement instruction sets
• Understand Characteristics of the instruction set
• Know the versions seen in ARM.
• Understand Instructions of ARM.
• Know Memory Organization.

2.1 Computer architecture taxonomy


Von Neumann architectures

A block diagram for one type of computer is shown in Figure 2.1. The computing system
consists of a central processing unit (CPU) and a memory. The memory holds both data
and instructions, and can be read or written when given an address. A computer whose
memory holds both data and instructions is known as a von Neumann machine.

Figure 2. 1 A Von Neumann Architecture Computer

The CPU has several internal registers that store values used internally. One of those registers
is the program counter (PC), which holds the address in memory of an instruction. The CPU
fetches the instruction from memory, decodes the instruction, and executes it. The program
counter does not directly determine what the machine does next, but only indirectly by
pointing to an instruction in memory. By changing only the instructions, we can change what
the CPU does. It is this separation of the instruction memory from the CPU that distinguishes
a stored-program computer from a general finite-state machine.

Harvard architectures

25
An alternative to the von Neumann style of organizing computers is the Harvard
architecture, which is nearly as old as the von Neumann architecture. As shown in

Figure 2.2, a Harvard machine has separate memories for data and program. The
program counter points to program memory, not data memory. As a result, it is harder to
write self-modifying programs (programs that write data values, then use those values as
instructions) on Harvard machines. Harvard architectures are widely used today for one
very simple reason—the separation of program and data memories provides higher
performance for digital signal processing. Processing signals in real time places great
strains on the data access system in two ways: First, large amounts of data flow through
the CPU; and second, that data must be processed at precise intervals, not just when the
CPU gets around to it.

Figure 2. 2 Harvard Architecture


Data sets that arrive continuously and periodically are called streaming data. Having two
memories with separate ports provides higher memory bandwidth; not making data and
memory compete for the same port also makes it easier to move the data at the proper times.
DSPs constitute a large fraction of all microprocessors sold today, and most of them are
Harvard architectures. A single example shows the importance of DSP: Most of the telephone
calls in the world go through at least two DSPs, one at each end of the phone call

Comparison
Complex Instruction Set Computer Vs Reduced Instruction Set Computer

CISC machines provided a variety of instructions that may perform very complex tasks,
such as string searching; they also generally used a number of different instruction
formats of varying lengths. One of the advances in the development of high-performance
microprocessors was the concept of reduced instruction set computers (RISC). These
computers tended to provide somewhat fewer and simpler instructions. RISC machines
generally use load/store instruction sets—operations cannot be performed directly on
memory locations, only on registers. The instructions were also chosen so that they could
be efficiently executed in pipelined processors. Early RISC designs substantially
outperformed CISC designs of the period. As it turns out, we can use RISC techniques to
efficiently execute at least a common subset of CISC instruction sets, so the performance
gap between RISC-like and CISC-like instruction sets has narrowed somewhat

Comparison

RISC CISC

Reduced Instruction Set Computer Complex Instruction Set Computer


The operations performed here are It works on any operations for a single
minimized by used load/store Operation.
operations

27
The instructions pipelined and hence it It has many addressing modes and
uses only few instructions to do a job. also it takes many lines of instructions
to do a particular job.

2.2 ARM Processor

• An ARM processor is one of a family of CPUs based on the RISC (reduced instruction
set computer) architecture developed by Advanced RISC Machines (ARM).
• ARM7 is a von Neumann architecture machine, while ARM9 uses a Harvard
architecture.
• The difference is invisible to the assembly language programmer, except for
possible performance differences.
• ARM makes 32-bit and 64-bit RISC multi-core processors. RISC processors are
designed to perform a smaller number of types of computer instructions so that
they can operate at a higher speed, performing more millions of instructions per
second (MIPS).
• By stripping out unneeded instructions and optimizing pathways, RISC processors
provide outstanding performance at a fraction of the power demand of CISC
(complex instruction set computing) devices.
• ARM processors are extensively used in consumer electronic devices such as
smartphones, tablets, multimedia players and other mobile devices, such as
wearables. Because of their reduced instruction set, they require fewer transistors,
which enables a smaller die size for the integrated circuitry (IC).
• The ARM processor’s smaller size, reduced complexity and lower power
consumption makes them suitable for increasingly miniaturized devices.

ARM processor features include:

• Load/store architecture.
• An orthogonal instruction set.
• Mostly single-cycle execution.
• Enhanced power-saving design.
• 64 and 32-bit execution states for scalable high performance.
• Hardware virtualization support.

28
Program counter

A program counter is a register in a computer processor that contains the


address of the instruction being executed at the current time. The program counter is
seen in the programmer model of ARM processor. The program counter is the r15 of the
general purpose register.

Memory Organization of ARM


• The ARM architecture supports two basic types of data:
➢ The standard ARM word is 32 bits long.
➢ The word may be divided into four 8-bit bytes.
• ARM7 allows addresses up to 32 bits long.
• An address refers to a byte, not a word. Therefore, the word 0 in the ARM address
space is at location 0, the word 1 is at 4, the word 2 is at 8, and so on.
• The ARM processor can be configured at power-up to address the bytes in a word
in either
➢ little-endian mode - with the lowest-order byte residing in
the low-order bits of the word) or
➢ big-endian mode - the lowest-order byte stored in the
highest

Figure 2. 3Little Endian and Big Endian Format

29
Characteristics of the Instruction set of an ARM processor.

The Instruction may be of Fixed or variable length.

▪ It uses various appropriate Addressing modes to operate each and every


instruction.
▪ It depends on the number of operands.
▪ It also relays on the type of operation it does

Features of assembly language.

• One instruction per line.


• Labels provide names for addresses (usually in first column).
• Instructions often start in later columns.
• Columns run to end of line.

General-purpose registers in an ARM programming model:


There are 16 registers (from r0 to r15) are seen in general purpose processor of
ARM.

Purpose of the CPSR

The purpose of Current Program Status Register is to automatically set the


arithmetic, logical, or shifting during every operation.

CPSR-Current Program Status Register

• CPSR is one which sets automatically during every arithmetic, logical, or


shifting operation.

• The top four bits of the CPSR hold the following useful information about the
results of that arithmetic/-logical operation. These bits can be used to easily
check the results of an arithmetic operation

Function of the 4- bits used in CPSR.

➢ The negative (N) bit is set when the result is negative in two’s-complement
arithmetic.
➢ The zero (Z) bit is set when every bit of the result is zero.

30
➢ The carry (C) bit is set when there is a carry out of the operation.
➢ The overflow (V) bit is set when an arithmetic operation results in an overflow.
Assembly language program for the ARM processor
ADR r4, p; get address for a
LDR r0, [r4]; get value of a
ADR r4, q; get address for b, reusing r4
LDR r1, [r4]; load value of b
ADD r3, r0, r1; set intermediate result for x to a + b
ADR r4, r; get address for c
LDR r2, [r4]; get value of c
SUB r3, r3, r2; complete computation of x
ADR r4, w; get address for x
STR r3, [r4]; store x at proper location

ADR r4, i; get address for c


LDR r0, [r4]; get value of c
ADR r4, j; get address for d, reusing r4
LDR r1, [r4]; load value of d
ADD r3, r0, r1; set intermediate result for c+d
ADR r4, k; get address for e
LDR r5, [r4]; get value of e
ADR r4, l; get the address for f
LDR r6, [r4]; get value of f
SUB r6, r5, r6; set the second intermediate result for y
MUL r7, r3, r6 ; multiply r3 and r6
ADR r4, x; get the address of y
STR r8, [r4]; Store y at proper location

Assembly Language code

Compute and test the condition


ADR r4, a; get address for a

31
LDR r0, [r4]; get value of a
ADR r4, b; get address for b
LDR r1, [r4]; get value of b
CMP r0, r1; compare a < b
BGE fblock;
if a >= b, take branch
The true block follows
MOV r0, #5; generate value for x
ADR r4, x; get address for x
STR r0, [r4]; store value of x
ADR r4, c; get address for c
LDR r0, [r4]; get value of c
ADR r4, d; get address for d
LDR r1, [r4]; get value of d
ADD r0, r0, r1; compute c + d
ADR r4, y; get address for y
STR r0, [r4]; store value of y
B after; branch around the false block
The false block follows
fblock ADR r4,c ; get address for c
LDR r0, [r4]; get value of c
ADR r4, d; get address for d
LDR r1, [r4]; get value of d
SUB r0, r0, r1; compute c – d
ADR r4, x; get address for x
2.3 MEMORY SYSTEM MECHANISM

2.3.1 CACHES

CACHE PERFORMANCE

• Caches are invisible in the programming model, caches are introduced because
they substantially reduce memory access time when the requested location is in
the cache.

• However, the desired location is not always in the cache because it is considerably
smaller than main memory.

• The extra time required to access a memory location not in the cache is often
called the cache miss penalty. The amount of variation depends on several
factors in the system architecture, but a cache miss is often several clock cycles
slower than a cache hit.

The time required to access a memory location depends on whether the requested
location is in the cache. However, as we have seen, a location may not be in the cache for
several reasons.

• Compulsory miss - the location has not been referenced before.

• Conflict miss - two particular memory locations are fighting for the same cache line.

• Capacity miss - the program’s working set is simply too large for the cache.

The contents of the cache can change considerably over the course of execution of a
program. When we have several programs running concurrently on the CPU, we can have
very dramatic changes in the cache contents. It is necessary to examine the behavior of
the programs running on the system to be able to accurately estimate performance when
caches are involved.

41
2.3.1.2 A direct-mapped cache.
The cache consists of cache blocks, each of which includes a tag to show which memory
location is represented by this block, a data field holding the contents of that memory, and a
valid tag to show whether the contents of this cache block are valid.
• An address is divided into three sections.
• The index is used to select which cache block to check.
• The tag is compared against the tag value in the block selected by the index.
• If the address tag matches the tag value in the block, that block includes the
desired memory location.
• If the length of the data field is longer than the minimum addressable unit, then
the lowest bits of the address are used as an offset to select the required value
from the data field.

Figure 2. 5 Direct Mapped Cache

• In the above diagram, there is only one block that must be checked to see
whether a location is in the cache.
• If the access is a hit, the data value is read from the cache.
• Two schemes
o Write-through - every write changes both the cache and the corresponding
Main memory location.
o write-back policy- reduce the number of times we write to main memory
2.4.1.2 A set-associative cache
• A set-associative cache is characterized by the number of banks or ways it uses,
giving an n-way set-associative cache.
• A set is formed by all the blocks (one for each bank) that share the same index.
Each set is implemented with a direct-mapped cache. A cache request is
broadcast to all banks simultaneously.
• If any of the sets has the location, the cache reports a hit.
• The set associative cache structure incurs a little extra overhead and is slightly
slower than a direct-mapped cache, but the higher hit rates that it can provide
often compensate.

Figure 2. 6 Set Associative Cache

• The set-associative cache generally provides higher hit rates than the direct
mapped cache because conflicts between a small numbers of locations can be
resolved within the cache.

• The set-associative cache is somewhat slower, so the CPU designer has to be


careful that it doesn’t slow down the CPU’s cycle time too much. A more
important problem with set-associative caches for embedded program.

2.5 CPU PERFORMANCE

There are two factors that can substantially influence program performance: pipelining
and caching.

• Pipelining

• Caching
2. 5.1 PIPELINING
• Modern CPUs are designed as pipelined machines in which several instructions
are executed in parallel.

• Pipelining greatly increases the efficiency of the CPU

• A CPU pipeline works best when its contents flow smoothly.

• Some sequences of instructions can disrupt the flow of information in the pipeline
and, temporarily at least, slow down the operation of the CPU.

ARM7 pipeline

The ARM7 has a three-stage pipeline:

1. Fetch: the instruction is fetched from memory.

2. Decode: the instruction’s opcode and operands are decoded to determine what
function to perform.

3. Execute: the decoded instruction is executed.

Figure 2. 7 ARM 7 Pipelining

Each of these operations requires one clock cycle for typical instructions. Thus, a normal
instruction requires three clock cycles to completely execute, known as the latency of
instruction execution.
Pipeline stalls

• The one-cycle-per-instruction completion rate does not hold in every case,


however.
• The case for extended execution is when an instruction is too complex to complete
the execution phase in a single cycle.
• A multiple load instruction is an example of an instruction that requires several
cycles in the execution phase.
Figure 2. 8 Pipeline Stall example

Data stall

• Data Stall is the execution of a sequence of instructions starting with a load


multiple (LDMIA) instruction.
• Because there are two registers to load, the instruction must stay in the execution
phase for two cycles.
• In a multiphase execution, the decode stage is also occupied, because it must
continue to remember the decoded instruction.
• As a result, the SUB instruction is fetched at the normal time but not decoded until
the LDMIA is finishing.
• This delays the fetching of the third instruction, the CMP.

Control Stall

• Branches also introduce control stall delays into the pipeline, commonly referred
to as the branch penalty, as shown in below figure
Figure 2. 9 Control Stall Example

• The decision whether to take the conditional branch BNE is not made until the
third clock cycle of that instruction’s execution, which computes the branch target
address.
• If the branch is taken, the succeeding instruction at PC+4 has been fetched and
started to be decoded.
• When the branch is taken, the branch target address is used to fetch the branch
target instruction.
• We have to wait for the execution cycle to complete before knowing the target, we
must throw away two cycles of work on instructions in the path not taken.
• The CPU uses the two cycles between starting to fetch the branch target and
starting to execute that instruction to finish housekeeping tasks related to the
execution of the branch.

One way around this problem is to introduce the delayed branch. In this style of
branch instruction, a fixed number of instructions directly after the branch are always
executed, whether or not the branch is taken. This allows the CPU to keep the pipeline full
during execution of the branch. However, some of those instructions after the delayed
branch may be no-ops. Any instruction in the delayed branch window must be valid for
both execution paths, whether or not the branch is taken. If there are not enough
instructions to fill the delayed branch window, it must be filled with nop.

2.6 Revision Questions


1. Write short notes on Von Neumann Architecture.
2. Write short notes on Harvard Architecture.
3. Draw the Block diagram of Harvard Architecture.
4. Draw the Basic Von Neumann Architecture Diagram.
5. Write any two difference between RISC and CISC Processors.
6. Compare Von Neumann and Harvard Architecture.
7. Explain the functions of the 4- bits used in CPSR.
8. Why RISC is supported than CISC Processors?
9. Draw the multicycle pipeline execution of ARM 7 for the instructions ldma r0, {r2,
r3} and sub r2, r3, r6.
10. Draw the Big Endian Mode format.

11. Write shorts notes on PC Stack in PIC Microcontroller.


12. Write shorts notes on Program Counter in PIC Microcontroller.
13. Explain the 3-stage Pipeline of ARM 7 with neat diagram.
14. With a neat diagram, explain in detail about Direct Mapped Cache.
15. Define the Write-through seen in Direct Map Cache.
16. Define the Write – Back Policy seen in Direct Map Cache.
17. Write an assembly language program for the ARM processor to implement

18. Write an assembly language program for the ARM processor to implement

19. Draw the Memory System Structure of Cache.


20. With a neat sketch, Explain Set Associative Cache in detail.
21. List out any three difference between Harvard and von Neumann architectures?
22. Draw the Little Endian mode format.
23. Draw the Big Endian mode format.
24. How many general-purpose registers are there in an ARM programming model?
25. What is the purpose of the CPSR?
26. What is the purpose of ‘Z’ bit?
27. What is the purpose of Overflow flags in CPSR?
28. List the CPSR flags.
29. Differentiate Direct Mapped Cache and Set Associative Cache.
30. Draw the Memory System Structure of Cache.
31. Explain the following with example
• Pipeline Stalls.
• Control Stalls
• Data Stalls.
32. Define Fetch and Decode.
33. Define compulsory Miss
34. Define capacity Miss
35. Define Conflict Miss

42

You might also like