1a) Illustrate the concept of efficient and optimized usage of structure in ARM C
Compiler with respect to arrangement and size.
Every data type have alignment requirements ( it is mandated by processor architecture, not by
language). A processor will have processing word length as that of data bus size. On a 32-bit machine,
the processing word size will be 4 bytes.
If an integer of 4 bytes is allocated on X address (X is a multiple of 4), the processor needs only one
memory cycle to read the entire integer. Whereas, if the integer is allocated at an address other than a
multiple of 4, it spans across two rows of the banks as shown in the below figure 3.3. Such an integer
requires two memory read cycles to fetch the data.
Load and store instructions are only guaranteed to load and store values with address aligned to the
size of the access width.
Therefore ARM compilers will automatically align the start address of a structure to a multiple of the
largest access width used within the structure (usually four or eight bytes) and align entries within
structures to their access width by inserting padding.
Example:
struct {
char a;
int b;
char c;
short d;
}
For a little-endian memory system the compiler will lay this out adding padding to ensure that the next
object is aligned to the size of that object:
To improve the memory usage, you should reorder the elements
struct {
char a;
char c;
short d;
int b;
}
This reduces the structure size from 12 bytes to 8 bytes, with the following new layout:
The following rules generate a structure with the elements packed for maximum efficiency:
a) Place all 8-bit elements at the start of the structure.
b) Place all 16-bit elements next, then 32-bit, then 64-bit.
c) Place all arrays and larger elements at the end of the structure.
d) If the structure is too big for a single instruction to access all the elements, then group the elements
into substructures. The compiler can maintain pointers to the individual substructures.
Summary
For Efficient Structure Arrangement we need to consider the below points :
Lay structures out in order of increasing element size. Start the structure with the smallest elements
and finish with the largest.
Avoid very large structures. Instead use a hierarchy of smaller structures.
For portability, manually add padding (that would appear implicitly) into API structures so that the
layout of the structure does not depend on the compiler.
Beware of using enum types in API structures. The size of an enum type is compiler dependent.
1b) Design for implementation an ARM C compiler oriented C program to print the list
of all even numbers between 0 to 100.
#include <stdio.h>
int main() {
unsigned int i;
for (i = 0; i <= 100; i += 2) {
printf("%u\n", i);
}
return 0;
}
1c) Illustrate the concept of how registers are allocated to optimize the program.
Register allocation is a critical optimization technique in ARM C programming that assigns variables to
processor registers to improve execution speed. Efficient register allocation reduces memory access
latency, minimizes spilling to memory, and enhances overall performance.
Concept Illustration:
1. Basic Principle: The compiler tries to assign frequently used and live variables to the limited set of ARM
registers (r0 to r12, with some reserved). Variables that are actively used within a small scope or loop
are prioritized for register assignment.
2. Allocation Strategy:
• Prioritize Variables in Hot Loops: Variables that are used inside inner loops are given registers to avoid
repeated memory loads/stores.
• Limit the Number of Active Variables: As per the report, limit internal function variables to about 12 to
match the register count and avoid spilling.
3. Example – Loop Optimization:
Suppose you want to sum even numbers between 0 and 100:
#include <stdio.h>
int main() {
int sum = 0;
unsigned int i;
for (i = 0; i <= 100; i += 2) {
sum += i;
printf("Sum: %d\n", sum);
return 0;
• Register Allocation:
The compiler allocates i and sum to registers (r0, r1) to avoid reading and writing to memory during
each iteration.
4. Spilling (if necessary): If more variables are needed than available registers, some variables
temporarily spill to memory. The compiler attempts to minimize spilling by:
Reusing registers when variables are out of scope.
Prioritizing variables used within loops or critical sections.
5. Impact: Using registers for loop counters and accumulators:
• Eliminates repeated memory loads/stores
• Reduces instruction count
• Achieves faster execution
Summary: Register allocation assigns the most frequently used variables within a scope to the limited
ARM registers, simplifying memory access, reducing instruction overhead, and maximizing
performance. Efficient register use is crucial in embedded systems where resources are limited.
3a) Discuss the concept of Exception, Exception handling and Vector Table
An exception is an unexpected event during program execution that causes the processor to halt
normal operations and handle the situation. The following can cause exception:
a) Reset
b) Undefined instruction
c) Software interrupt
d) Prefetch abort
e) Data abort
f) Interrupt request
When an exception occurs, the processor switches to a specific mode, saves its state, jumps to a
handler routine to manage the event, and then resumes normal execution after the issue is addressed.
Exceptions are crucial for system stability and error management.
Exception Handling Exception handling involves specialized software routines called exception
handlers that determine the cause of the exception and execute the necessary response. When an
exception occurs, the ARM core automatically switches to a specific processor mode associated with
that exception. During this process, the core saves the current program state (like the cpsr — Current
Program Status Register) into a dedicated save register (spsr) accepted for that mode, and saves the
address of the instruction that was halted (the pc) into a link register (lr). It then loads the program
counter (pc) with the address of the corresponding exception handler (from the vector table). After
servicing the exception, the handler restores the processor's state and resumes normal operation
The vector table is a table of addresses that the ARM core branches to when an exception is raised.
These addresses contain branch instructions. The memory map address 0x00000000 is reserved for the
vector table, a set of 32-bit words. On some processors the vector table can be optionally located at a
higher address in memory (starting at the offset 0xffff0000).
The branch instruction can be any of the following forms:
B <address>
LDR pc, [pc, #offset]
LDR pc, [pc, #-0xff0]
MOV pc, #immediate
3b) Illustrate the concept of Interrupt Latency and strategy to reduce it.
It is the time interval, from an external interrupt request signal being raised to the first fetch of an
instruction of a specific interrupt service routine (ISR).
Interrupt latency depends on a combination of hardware and software.
System designer must balance the system design to handle multiple simultaneous interrupt sources
and minimize interrupt latency.
If the interrupts are not handled in a timely manner, then the system will exhibit slow response times.
Software handlers have two main methods to minimize interrupt latency.
1) Nested interrupt handler,
2) Prioritization.
Nested interrupt handler
Nested interrupt handler allows other interrupts to occur even when it is currently servicing an existing
interrupt.
This is achieved by reenabling the interrupts as soon as the interrupt source has been serviced but
before the interrupt handling is complete.
Once a nested interrupt has been serviced, then control is relinquished to the original interrupt service
routine. Fig 4.3 shows the three level nested interrupt,
Prioritization
We can program the interrupt controller to ignore interrupts of the same or lower priority than the
interrupt we are handling presently, so only
a higher-priority task can interrupt our handler. We then re-enable the interrupts. The processor
spends time in the lower-priority interrupts until a higher-priority interupt occurs. Therefore higher-
priority interrupts have a lower average interrupt latency than the lower-priority interrupts.
It reduces latency by speeding up the completion time on the critical time-sensitive interrupts.
3c) Discuss Enabling and disabling of IRQ and FIQ via programing CPSR.
The ARM processor core has a simple procedure to manually enable and disable interrupts by
modifying the cpsr when the processor is in a privileged mode.
The procedure uses three ARM instructions.
1)The instruction MRS copies the contents of the cpsr into register r1.
2)The instruction BIC clears the IRQ or FIQ mask bit.
3) The instruction MSR then copies the updated contents in register r1 back into the cpsr, to enable
the interrupt request.
Table 4.5 shows how IRQ and FIQ interrupts are enabled.
The postfix _c identifies that the bit field being updated is the control field bit [7:0] of the cpsr.
Table 4.6 shows procedure to disable or mask an interrupt request.
To enable and disable both the IRQ and FIQ exceptions , the immediate value on the data processing
BIC or ORR instruction has to be changed to 0xc0.
The interrupt request is either enabled or disabled only once the MSR instruction has completed the
execution stage of the pipeline. Interrupts
can still be raised or masked prior to the MSR completing this stage.
5a) Explain the basic architecture of cache memory.
The basic architecture of cache memory consists of three main parts for each cache line: a directory
store, a data section, and status information.
1. Directory Store (Cache-Tag):
The directory store, often referred to as the cache-tag, is a dedicated storage area within each cache
line that holds a portion of the main memory address, known as the tag. This tag serves to identify the
origin of the data stored in the cache line. When the processor requests data, the cache controller
compares the tag portion of the requested address with the stored cache-tag to determine if the
required data is already present in the cache (a cache hit) or not (a cache miss). This comparison is
fundamental for maintaining coherence between the cache contents and main memory.
2. Data Section:
The data section contains the actual data fetched from main memory and stored in the cache line.
When the processor accesses a specific memory location, it retrieves the entire cache line from this
data section, which includes multiple words (e.g., four 32-bit words in a line). Loading an entire line at
once exploits the principle of locality of reference, improving access times for subsequent data
requests within the same line. This organization ensures faster data retrieval compared to accessing
main memory directly.
3. Status Bits:
The cache line maintains several status bits that indicate its current state and integrity:
• Valid Bit: This bit indicates whether the cache line contains valid, usable data. If set to '1', the data in
this line is current and can be used by the processor. If it is '0', the data is invalid—possibly because it
has been invalidated or not yet initialized. The valid bit prevents the processor from using stale or
uninitialized data.
• Dirty Bit: This bit indicates whether the data in the cache line has been modified (written to) but not
yet written back to main memory. If the dirty bit is '1', it means the cache contains updated data that
must be written back to main memory before the cache line can be replaced or invalidated. This
ensures data consistency between the cache and main memory during cache line eviction or
replacement.
4. Cache Lines:
A cache is composed of multiple cache lines, each capable of storing a block of data (often multiple
words). Each line is identified by its position within the cache, organized to allow efficient lookup and
management. When a data request occurs, the cache controller identifies the appropriate line using
address fields, and then it compares tags and status bits to determine if the data can be used directly
or needs to be fetched from main memory.
5. Address Fields (Tag, Set Index, Data Index):
The address of a memory request is divided into several fields, each serving a specific purpose within
the cache architecture:
• Tag: The tag is a subset of the address used to identify the specific block in main memory that the
cache line may contain. During a cache lookup, the cache controller compares the address's tag with
the stored cache-tag to verify if the data corresponds to the requested address.
• Set Index: This field determines which set (or group) within the cache to examine. In set-associative or
direct-mapped caches, the set index narrows down the search to a specific subset of cache lines, thus
improving lookup efficiency.
• Data Index: The data index specifies the particular word, byte, or sub-word within the cache line. It
enables the cache controller to select the exact piece of data requested by the processor from within
the cache line.
5b) With a neat block diagram explain associative cache (set Associate cache).
An associative cache, specifically a set-associative cache, is a type of cache memory designed to reduce
conflicts and improve hit rates compared to direct-mapped caches.
Key points about set-associative cache:
- The cache is divided into multiple sets.
- Each set contains a fixed number of cache lines, called "ways" (e.g., 4-way, 8-way).
- A memory address is divided into three fields: tag, set index, and data (or word) index.
- The set index determines which set in the cache might contain the data.
- Within that set, the cache checks multiple lines (ways) simultaneously to find a matching tag (using
hardware such as Content Addressable Memory, CAM).
- If a matching tag is found (a hit), the data is retrieved from that cache line.
- If no match (a miss), the cache replaces one of the lines in the set, usually using a replacement policy
like least recently used (LRU).
Advantages:
- Reduced conflict misses compared to direct-mapped caches.
- More flexible placement of data since a memory location can reside in any line within the set.
In Figure 12.8, the cache maps main memory blocks to any of four cache lines (ways) within a set. The set
index points to the group of lines, and the tag comparison determines the exact line containing the data.