0% found this document useful (0 votes)

88 views8 pages

Instruction Cache Compression: A Paper Presentation On

The document discusses modifying a processor's pipeline to support instruction cache compression. Two additional stages are added - one for address translation from uncompressed to compressed, and one for instruction decompression. This increases branch penalty. To reduce it, a Branch Compensation Cache is proposed to store target instructions of branches. Evaluation shows the BCC can significantly reduce branch penalty with a small cache size. The compression algorithm trades off random access granularity for compression rate, achieving around 70% compression on average. Cache compression improves hit rates especially for caches in the "thrashing region" between 10-90% hit rates.

Uploaded by

Huzaifa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views8 pages

Instruction Cache Compression: A Paper Presentation On

Uploaded by

Huzaifa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 8

A Paper Presentation On

INSTRUCTION CACHE COMPRESSION

MODIFICATION OF THE PROCESSOR

We consider a 64-bit simple scalar instruction set, which is similar to MIPS We

assume a simple 5-stage pipeline processor for ease of discussion. In order to compress
the instruction cache, decompression must be done after the cache. Therefore the CPU
must now be modified to include 2 additional functions.

 Address translation from the uncompressed version to the compressed (CLB)

 Instruction Decompression. (DEC)

These 2 additional sections are added to the usual 5-stage pipeline as shown in
figure 2.

The CLB section implements address translation from the uncompressed version
to the compressed version for each instruction and DEC implements Instruction
Decompression.

The Instruction Fetch (IF) stage will now fetch the compressed instruction instead
of the uncompressed instruction. The stage length for CLB and DEC are both dependent
on the compression algorithm.

The additional CLB and DEC sections are both before the Instruction Decode
(ID) stage. Thus stages within these sections will add to branch penalty. In addition, these
penalties cannot be reduced by branch prediction since they are before the ID stage,
where instruction opcode is first being identified.

COMPRESSION ALGORITHM
Problems: Compression algorithm used must have simple CLB and DEC sections, or it
will lead to large branch penalty. The problems we face with the usual CCRP
compression algorithm that compresses serially within the block, byte wise are:

 For a 64-bit instruction, we require 8 DEC stages to decompress.

 Granularity of random access.

With CCRP, random access is only at block level. If the program counter jumps
from one compressed block to another and it is the very last instruction in the block that it
is jumping to, then all instructions within the block must be decompressed before the
actual instruction will be fetched. This will lead to additional stages of delay.

Modified Algorithm: We devise such an algorithm. It can be thought of as a binary

Huffman with only two compressed symbol lengths, but due to its similarity to table
lookup we refer to it as Table. It uses 4 tables, each with 256, 16-bit entries. Each 64-bit
instruction is compressed individually as follow:

Divide the 64-bit instruction into 4 sections of 16-bit.

Search content of table 1 for section 1 of the instruction. If found, record the entry
number.

Repeat Step (ii) for section 2 with table 2 and so on with remaining sections.

If entry number in the corresponding table replaces all the sections, then the
instruction is compressed. Otherwise, the instruction remains uncompressed.

A 96-bit LAT entry tracks every 64 instructions. 32-bits equals the compressed
address. The rest of 64 bits are on or off depending on whether the corresponding
instruction is compressed or not.

Since this method tracts each instruction individually, the inter-block jump
problem is gone. For the DEC section, decompression is either nothing or 4 table lookups
in parallel. So one stage is sufficient. For the CLB section, it is simply a CLB lookup
follow by some adding in a tree of adders. This could be done in one cycle, with CLB
lookup taking half and adding taking another half.

Details: With the Table compression algorithm, the compressed instructions can be either
4 or 8 byte long. This could lead to misalignment problem for instruction cache access in
case of 8-byte instruction. For example, an 8-byte instruction can have its first 4 byte in
the end of one cache line and its last 4 byte in the start of another cache line. This results
in 2 cache accesses for a single instruction fetch. To solve this, instruction cache is
divided into two banks, with each bank 4-byte wide. The first bank is used to store the
upper half word of the 64-bit word. The second bank is used to store the lower half. Since
every 8-byte instruction must have its two 4 byte halves in different banks, simple logic
can be used to ensure each bank is accessed correctly to fetch the instruction in one cycle.
The determination of the content for the 4 tables is critical in achieving a good
compression for this method. Between the 4 tables, they can keep 256 power 4, or 4 gig
entries of distinct instructions. However, for every entry within a table, 256 power 3, or
16 Meg entries of instructions will take on that same section. So it is quantity without
quality.

Result for the Modified Algorithm

Figure 3 is the compression result for Table compression. On average Table can
compress to 70% of the original size. In comparison to CCRP with Huffman byte serial
compression at block size of 8, we loose about 20% on average. This 20% lost is partly
due to the smaller granularity of random access we have achieved. In general the smaller
granularity of random access, the less compression will be achieved. This can be
observed again with Gzip’s result, which does not provide any random access. On
average it compresses to 20% of the original size, which is 30% better than CCRP with
Huffman.

BRANCH COMPENSATION CACHE (BCC)

Problem: As mentioned earlier, in order to perform decompression after instruction
cache (between the I-cache and the CPU), we add two more stages into the pipeline
before ID: CLB (Cache Lookaside Buffer) and DEC (decompression). As a result, branch
penalty is increased to three cycles. This is illustrated in figure 4.

Figure-4: Increased Branch Penalty

Originally, in regular five-stage pipeline, the branch penalty is only one stage
(IF), but in our seven-stage pipeline, the branch penalty increases to three stages (CLB,
IF, DEC). Obviously it is not good, so we need a solution.
Solution: The solution is to add a branch compensation cache (BCC) and try to pre-store
the target instructions there. Whenever we encounter a branch (or more precisely, a PC
jump), go to the BCC and check if the target instruction is there. The basic idea is shown
in figure 5.

Figure-5 USE BCC TO REDUCE BRANCH PENALTY

In figure 5, a PC jump is found at the ID stage, so we go to check BCC to see if
the target instruction is pre-stored there. If no, then we have to go through CLB, DEC and
ID pipeline stages; but if yes, then we simply fetch it and keep going, no branch penalty
at all in such case.
Figure-6: BCC IMPLEMENTATION
Figure 6 shows a more detailed pipeline implementation, where the key points
are:

 At ID stage we find out a jump instruction being taken, so we

must flush the current pipeline and restart.

 The restarted pipeline is used to get the target instruction

through normal stages, i.e. using CLB to get the compressed
address from the regular address, using IF to fetch the
decompressed instruction, using DEC to decompress it to get
target instruction. Simultaneously, we also go to the BCC to
check if the required target instruction is pre-stored there. If no,
then keep running the restarted pipeline; if yes, then we
directly fetch the target instruction from BCC and flush/restart
the pipeline for a 2nd time.

 The 2nd restarted pipeline is used to provide the sequential instructions after
the jump. In order to completely eliminate the penalty incurred by a jump, we
require T, T+8, T+16, T+24 instructions are all pre-stored in BCC upon a hit.
So the 2nd restarted pipeline starts CLB stage with T+32 instruction, where T
is the target instruction, T+8 is the next instruction, and so on.

Result: To quantify the BCC performance, we observed several applications such as

ijpeg, wave5, etc.

Figure-7: BCC PERFORMANCE

Here the cache block size is kept as 8-byte long, while the block number varies
from 1, 2, 4, 8, ¼, to 4096 to give different cache size. Directly mapped cache is used and
the replacement policy is least recently used (LRU). From the observed results we can
see that for a fairly small branch compensation cache (1KB ~ 2KB), the branch penalty is
significantly reduced, which indicates our approach is very effective.

As far as the compressed cache performance is considered, compressed cache

with Table algorithm do outperform uncompressed cache, especially within a region of
instruction cache size where uncompressed cache is getting from 10 to 90 % hit rate. The
improvement could go as high as near 40%. This is as expected since compressed cache
contains more instructions and this translates to better hit rate. The reason that there is a
window of sizes where cache compression is more effective is because this is the
thrashing region and a small increase in cache size could give big improvement.
Compression in cache is like a virtual cache increase. For example 70% compression in
cache could give a cache that is 1.4 times the original size. Therefore cache compression
is especially effective within this thrashing region. So the original idea that compressed
instruction cache can lead to smaller instruction cache is valid.
CONCLUSIONS
We observe a tradeoff between granularity of random access and compression
rate, where smaller granularity implies less compression. To avoid access branch penalty,
smaller granularity of random access than CCRP is forced in order to compress the cache.
Consequently, compression is less in instruction ROM than CCRP scenario. This
drawback offsets any saving in instruction cache, except in region where instruction
cache is big relative to overall system. Instruction cache compression is effective in
embedded system, related to area where cache performance is more important than the
instruction ROM die area. For example, in high performance computing, where large
instruction cache is useful in improving hit rate, but painful in dealing with the critical
path analysis. Here with instruction cache compression, one could potentially provide
cache performance that is 1/ (compression rate-LAT) times the physical size.

Our current Table method can be improved by following: Provide optimal table entries,
which enable us to compress more instructions. Since table entry calculations are
done in preprocessing, the potential exponential amount of time required may not be
a problem. If it does become a time problem, better heuristics can definitely be
devised.

Provide multiple levels of tables, which can give us more compression. Current Table
method has a lower compression limit of about 50% of the original size. If multiple
levels of tables are used, for example 4 entries tables, 16 entries tables and 256
entries tables, then we can compress with 4 entries tables first. If it falls then use the
next level of tables. The advantage here is higher level tables with lease entries can
compress more since pointer into the table has fewer bits.

REFERENCES

1. Wayne Wolf, Embedded RISC Architecture;

2. Barry.B.Brey, Introduction to Microprocessors;

3. Morris Mano, Computer Architecture & Organization;

4. Y. Yoshida & B.Y. Song, Code compression approach to Embedded

Processors.

C Questions PDF
No ratings yet
C Questions PDF
111 pages
Disc09 Sols
No ratings yet
Disc09 Sols
7 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Solution For Chapter 4
100% (3)
Solution For Chapter 4
26 pages
Cycle StealingInstDecompEngineV2
No ratings yet
Cycle StealingInstDecompEngineV2
4 pages
HW 6
No ratings yet
HW 6
3 pages
Lecture 8
No ratings yet
Lecture 8
37 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
ES Test Assignment Sol.
No ratings yet
ES Test Assignment Sol.
6 pages
Archi Second 2013 2014 JCE
No ratings yet
Archi Second 2013 2014 JCE
2 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
15 pages
Aca Seminar Report
No ratings yet
Aca Seminar Report
11 pages
Solutions: Solution 5.1
No ratings yet
Solutions: Solution 5.1
14 pages
CS704 Finalterm QA Past Papers
No ratings yet
CS704 Finalterm QA Past Papers
20 pages
Compressed Instruction Cache: Prepared by
No ratings yet
Compressed Instruction Cache: Prepared by
31 pages
Exercise 5 - With Solution
No ratings yet
Exercise 5 - With Solution
8 pages
Data Hazards and Cache Optimization
No ratings yet
Data Hazards and Cache Optimization
2 pages
OLD DAT105 Exercise4
No ratings yet
OLD DAT105 Exercise4
31 pages
Section 2:: Performance & Instruction Encoding TA: Scott Beamer
No ratings yet
Section 2:: Performance & Instruction Encoding TA: Scott Beamer
4 pages
Section 2:: Performance & Instruction Encoding TA: Scott Beamer
No ratings yet
Section 2:: Performance & Instruction Encoding TA: Scott Beamer
4 pages
CS 251 Cache & Memory Assignment
No ratings yet
CS 251 Cache & Memory Assignment
6 pages
Computer Architecture Assignment Help
No ratings yet
Computer Architecture Assignment Help
15 pages
Lec17 Cache 3
No ratings yet
Lec17 Cache 3
33 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
No ratings yet
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
12 pages
A Trace Cache
No ratings yet
A Trace Cache
10 pages
Sample Midterm2
No ratings yet
Sample Midterm2
4 pages
William Stallings Computer Organization and Architecture 8 Edition
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition
38 pages
Computer Architecture Solutions - OK
No ratings yet
Computer Architecture Solutions - OK
6 pages
Computer Architecture - Quantitative Approach
No ratings yet
Computer Architecture - Quantitative Approach
7 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Cache Memory and Virtual Memory
No ratings yet
Cache Memory and Virtual Memory
25 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
15IF11 Multicore E PDF
No ratings yet
15IF11 Multicore E PDF
14 pages
Midterm2 s2012 Sol
No ratings yet
Midterm2 s2012 Sol
5 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
Midtermsolutions
No ratings yet
Midtermsolutions
3 pages
Project #1: Computer Architecture EE6304 Due Date: 3/8/2012 Team Number:22
No ratings yet
Project #1: Computer Architecture EE6304 Due Date: 3/8/2012 Team Number:22
32 pages
Review Problems For Exam 1: MIPS (Instruction Count) / (Execution Time X 10
No ratings yet
Review Problems For Exam 1: MIPS (Instruction Count) / (Execution Time X 10
6 pages
Assign1 PDF
No ratings yet
Assign1 PDF
5 pages
Chapter 5
No ratings yet
Chapter 5
48 pages
Endsem Sol
No ratings yet
Endsem Sol
4 pages
Improving The Utilization of Micro-Operation Caches in x86 Processors
No ratings yet
Improving The Utilization of Micro-Operation Caches in x86 Processors
13 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
CPU Pipeline and Cache Policies Quiz
No ratings yet
CPU Pipeline and Cache Policies Quiz
4 pages
Computer Design (Spring 2010) Midterm Exam Solution
No ratings yet
Computer Design (Spring 2010) Midterm Exam Solution
2 pages
Ca Sol PDF
No ratings yet
Ca Sol PDF
8 pages
Advance Computer Architecture Homework 2 Solution
No ratings yet
Advance Computer Architecture Homework 2 Solution
8 pages
Frtyuiop
No ratings yet
Frtyuiop
8 pages
hw6 Circuits
No ratings yet
hw6 Circuits
4 pages
Code Compression: Forcon Spondence: (Jens, Todd, Willl@cs - Ariz.ona - Edu, (Cwfraser, Redmond
No ratings yet
Code Compression: Forcon Spondence: (Jens, Todd, Willl@cs - Ariz.ona - Edu, (Cwfraser, Redmond
8 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Computer Architecture Essentials
No ratings yet
Computer Architecture Essentials
24 pages
Cache 2 Output
No ratings yet
Cache 2 Output
37 pages
Instruction For AVIC F-Series In-Dash 2.008 Firmware Update
No ratings yet
Instruction For AVIC F-Series In-Dash 2.008 Firmware Update
4 pages
Bca6 PDF
No ratings yet
Bca6 PDF
4 pages
BCS-11.solved Assignment 2018-19 - Watermark Ignou Assignment Wala PDF
No ratings yet
BCS-11.solved Assignment 2018-19 - Watermark Ignou Assignment Wala PDF
39 pages
8255 PPI Technical Overview
No ratings yet
8255 PPI Technical Overview
3 pages
PHC-253 - DigitalAssignment
No ratings yet
PHC-253 - DigitalAssignment
1 page
Data Structres & Algorithms
No ratings yet
Data Structres & Algorithms
4 pages
Quartus II Handbook Volume 2: Design Implementation and Optimization
No ratings yet
Quartus II Handbook Volume 2: Design Implementation and Optimization
321 pages
Enable 802.11d in Realtek WLAN
No ratings yet
Enable 802.11d in Realtek WLAN
2 pages
memoQ 6.2 Install & Activate Guide
No ratings yet
memoQ 6.2 Install & Activate Guide
8 pages
GRC Reporting
No ratings yet
GRC Reporting
8 pages
Information Systems and System Development: Understanding Computers
No ratings yet
Information Systems and System Development: Understanding Computers
74 pages
RFID To Panel View
No ratings yet
RFID To Panel View
60 pages
05 Ecus
100% (7)
05 Ecus
27 pages
4.3 Ble
No ratings yet
4.3 Ble
35 pages
Oracle R12 Install Guide for Linux
No ratings yet
Oracle R12 Install Guide for Linux
34 pages
Signal Generator Library: Module User's Guide C24x Foundation Software
No ratings yet
Signal Generator Library: Module User's Guide C24x Foundation Software
121 pages
Online Learning Community Project Guide
No ratings yet
Online Learning Community Project Guide
3 pages
Understanding The Stack
No ratings yet
Understanding The Stack
119 pages
AutoFormplus & CATIA System Requirements
No ratings yet
AutoFormplus & CATIA System Requirements
2 pages
Intellectual Property (IP) Cores For Home Networking: White Paper: Spartan-II
No ratings yet
Intellectual Property (IP) Cores For Home Networking: White Paper: Spartan-II
7 pages
Airline Reservation System Synopsis
No ratings yet
Airline Reservation System Synopsis
19 pages
Tutorial For Alloy 3.0
No ratings yet
Tutorial For Alloy 3.0
4 pages
Bitwise Operations: ESC101 April 6, 2018
No ratings yet
Bitwise Operations: ESC101 April 6, 2018
50 pages
Data Sheet Pick It 2
No ratings yet
Data Sheet Pick It 2
6 pages
Web2py: Ideas We Stole - Ideas We Had
100% (2)
Web2py: Ideas We Stole - Ideas We Had
47 pages
Digital Registers & Counters Guide
No ratings yet
Digital Registers & Counters Guide
15 pages
A Project Report ON Coaching Management System
No ratings yet
A Project Report ON Coaching Management System
66 pages
C++ Graphics For Windows Using Winbgim: Download and Install The Winbgim Devpak File "Winbgim-6.0-1G17L.Devpak"
No ratings yet
C++ Graphics For Windows Using Winbgim: Download and Install The Winbgim Devpak File "Winbgim-6.0-1G17L.Devpak"
7 pages

Instruction Cache Compression: A Paper Presentation On

Uploaded by

Instruction Cache Compression: A Paper Presentation On

Uploaded by

A Paper Presentation On

INSTRUCTION CACHE COMPRESSION

MODIFICATION OF THE PROCESSOR

We consider a 64-bit simple scalar instruction set, which is similar to MIPS We

 Address translation from the uncompressed version to the compressed (CLB)

 Instruction Decompression. (DEC)

 For a 64-bit instruction, we require 8 DEC stages to decompress.

 Granularity of random access.

Modified Algorithm: We devise such an algorithm. It can be thought of as a binary

Divide the 64-bit instruction into 4 sections of 16-bit.

Result for the Modified Algorithm

BRANCH COMPENSATION CACHE (BCC)

Figure-4: Increased Branch Penalty

Figure-5 USE BCC TO REDUCE BRANCH PENALTY

 At ID stage we find out a jump instruction being taken, so we

 The restarted pipeline is used to get the target instruction

Result: To quantify the BCC performance, we observed several applications such as

Figure-7: BCC PERFORMANCE

As far as the compressed cache performance is considered, compressed cache

1. Wayne Wolf, Embedded RISC Architecture;

2. Barry.B.Brey, Introduction to Microprocessors;

3. Morris Mano, Computer Architecture & Organization;

4. Y. Yoshida & B.Y. Song, Code compression approach to Embedded

You might also like