A Paper Presentation On
INSTRUCTION CACHE COMPRESSION
MODIFICATION OF THE PROCESSOR
We consider a 64-bit simple scalar instruction set, which is similar to MIPS We
assume a simple 5-stage pipeline processor for ease of discussion. In order to compress
the instruction cache, decompression must be done after the cache. Therefore the CPU
must now be modified to include 2 additional functions.
Address translation from the uncompressed version to the compressed (CLB)
Instruction Decompression. (DEC)
These 2 additional sections are added to the usual 5-stage pipeline as shown in
figure 2.
The CLB section implements address translation from the uncompressed version
to the compressed version for each instruction and DEC implements Instruction
Decompression.
The Instruction Fetch (IF) stage will now fetch the compressed instruction instead
of the uncompressed instruction. The stage length for CLB and DEC are both dependent
on the compression algorithm.
The additional CLB and DEC sections are both before the Instruction Decode
(ID) stage. Thus stages within these sections will add to branch penalty. In addition, these
penalties cannot be reduced by branch prediction since they are before the ID stage,
where instruction opcode is first being identified.
COMPRESSION ALGORITHM
Problems: Compression algorithm used must have simple CLB and DEC sections, or it
will lead to large branch penalty. The problems we face with the usual CCRP
compression algorithm that compresses serially within the block, byte wise are:
For a 64-bit instruction, we require 8 DEC stages to decompress.
Granularity of random access.
With CCRP, random access is only at block level. If the program counter jumps
from one compressed block to another and it is the very last instruction in the block that it
is jumping to, then all instructions within the block must be decompressed before the
actual instruction will be fetched. This will lead to additional stages of delay.
Modified Algorithm: We devise such an algorithm. It can be thought of as a binary
Huffman with only two compressed symbol lengths, but due to its similarity to table
lookup we refer to it as Table. It uses 4 tables, each with 256, 16-bit entries. Each 64-bit
instruction is compressed individually as follow:
Divide the 64-bit instruction into 4 sections of 16-bit.
Search content of table 1 for section 1 of the instruction. If found, record the entry
number.
Repeat Step (ii) for section 2 with table 2 and so on with remaining sections.
If entry number in the corresponding table replaces all the sections, then the
instruction is compressed. Otherwise, the instruction remains uncompressed.
A 96-bit LAT entry tracks every 64 instructions. 32-bits equals the compressed
address. The rest of 64 bits are on or off depending on whether the corresponding
instruction is compressed or not.
Since this method tracts each instruction individually, the inter-block jump
problem is gone. For the DEC section, decompression is either nothing or 4 table lookups
in parallel. So one stage is sufficient. For the CLB section, it is simply a CLB lookup
follow by some adding in a tree of adders. This could be done in one cycle, with CLB
lookup taking half and adding taking another half.
Details: With the Table compression algorithm, the compressed instructions can be either
4 or 8 byte long. This could lead to misalignment problem for instruction cache access in
case of 8-byte instruction. For example, an 8-byte instruction can have its first 4 byte in
the end of one cache line and its last 4 byte in the start of another cache line. This results
in 2 cache accesses for a single instruction fetch. To solve this, instruction cache is
divided into two banks, with each bank 4-byte wide. The first bank is used to store the
upper half word of the 64-bit word. The second bank is used to store the lower half. Since
every 8-byte instruction must have its two 4 byte halves in different banks, simple logic
can be used to ensure each bank is accessed correctly to fetch the instruction in one cycle.
The determination of the content for the 4 tables is critical in achieving a good
compression for this method. Between the 4 tables, they can keep 256 power 4, or 4 gig
entries of distinct instructions. However, for every entry within a table, 256 power 3, or
16 Meg entries of instructions will take on that same section. So it is quantity without
quality.
Result for the Modified Algorithm
Figure 3 is the compression result for Table compression. On average Table can
compress to 70% of the original size. In comparison to CCRP with Huffman byte serial
compression at block size of 8, we loose about 20% on average. This 20% lost is partly
due to the smaller granularity of random access we have achieved. In general the smaller
granularity of random access, the less compression will be achieved. This can be
observed again with Gzip’s result, which does not provide any random access. On
average it compresses to 20% of the original size, which is 30% better than CCRP with
Huffman.
BRANCH COMPENSATION CACHE (BCC)
Problem: As mentioned earlier, in order to perform decompression after instruction
cache (between the I-cache and the CPU), we add two more stages into the pipeline
before ID: CLB (Cache Lookaside Buffer) and DEC (decompression). As a result, branch
penalty is increased to three cycles. This is illustrated in figure 4.
Figure-4: Increased Branch Penalty
Originally, in regular five-stage pipeline, the branch penalty is only one stage
(IF), but in our seven-stage pipeline, the branch penalty increases to three stages (CLB,
IF, DEC). Obviously it is not good, so we need a solution.
Solution: The solution is to add a branch compensation cache (BCC) and try to pre-store
the target instructions there. Whenever we encounter a branch (or more precisely, a PC
jump), go to the BCC and check if the target instruction is there. The basic idea is shown
in figure 5.
Figure-5 USE BCC TO REDUCE BRANCH PENALTY
In figure 5, a PC jump is found at the ID stage, so we go to check BCC to see if
the target instruction is pre-stored there. If no, then we have to go through CLB, DEC and
ID pipeline stages; but if yes, then we simply fetch it and keep going, no branch penalty
at all in such case.
Figure-6: BCC IMPLEMENTATION
Figure 6 shows a more detailed pipeline implementation, where the key points
are:
At ID stage we find out a jump instruction being taken, so we
must flush the current pipeline and restart.
The restarted pipeline is used to get the target instruction
through normal stages, i.e. using CLB to get the compressed
address from the regular address, using IF to fetch the
decompressed instruction, using DEC to decompress it to get
target instruction. Simultaneously, we also go to the BCC to
check if the required target instruction is pre-stored there. If no,
then keep running the restarted pipeline; if yes, then we
directly fetch the target instruction from BCC and flush/restart
the pipeline for a 2nd time.
The 2nd restarted pipeline is used to provide the sequential instructions after
the jump. In order to completely eliminate the penalty incurred by a jump, we
require T, T+8, T+16, T+24 instructions are all pre-stored in BCC upon a hit.
So the 2nd restarted pipeline starts CLB stage with T+32 instruction, where T
is the target instruction, T+8 is the next instruction, and so on.
Result: To quantify the BCC performance, we observed several applications such as
ijpeg, wave5, etc.
Figure-7: BCC PERFORMANCE
Here the cache block size is kept as 8-byte long, while the block number varies
from 1, 2, 4, 8, ¼, to 4096 to give different cache size. Directly mapped cache is used and
the replacement policy is least recently used (LRU). From the observed results we can
see that for a fairly small branch compensation cache (1KB ~ 2KB), the branch penalty is
significantly reduced, which indicates our approach is very effective.
As far as the compressed cache performance is considered, compressed cache
with Table algorithm do outperform uncompressed cache, especially within a region of
instruction cache size where uncompressed cache is getting from 10 to 90 % hit rate. The
improvement could go as high as near 40%. This is as expected since compressed cache
contains more instructions and this translates to better hit rate. The reason that there is a
window of sizes where cache compression is more effective is because this is the
thrashing region and a small increase in cache size could give big improvement.
Compression in cache is like a virtual cache increase. For example 70% compression in
cache could give a cache that is 1.4 times the original size. Therefore cache compression
is especially effective within this thrashing region. So the original idea that compressed
instruction cache can lead to smaller instruction cache is valid.
CONCLUSIONS
We observe a tradeoff between granularity of random access and compression
rate, where smaller granularity implies less compression. To avoid access branch penalty,
smaller granularity of random access than CCRP is forced in order to compress the cache.
Consequently, compression is less in instruction ROM than CCRP scenario. This
drawback offsets any saving in instruction cache, except in region where instruction
cache is big relative to overall system. Instruction cache compression is effective in
embedded system, related to area where cache performance is more important than the
instruction ROM die area. For example, in high performance computing, where large
instruction cache is useful in improving hit rate, but painful in dealing with the critical
path analysis. Here with instruction cache compression, one could potentially provide
cache performance that is 1/ (compression rate-LAT) times the physical size.
Our current Table method can be improved by following: Provide optimal table entries,
which enable us to compress more instructions. Since table entry calculations are
done in preprocessing, the potential exponential amount of time required may not be
a problem. If it does become a time problem, better heuristics can definitely be
devised.
Provide multiple levels of tables, which can give us more compression. Current Table
method has a lower compression limit of about 50% of the original size. If multiple
levels of tables are used, for example 4 entries tables, 16 entries tables and 256
entries tables, then we can compress with 4 entries tables first. If it falls then use the
next level of tables. The advantage here is higher level tables with lease entries can
compress more since pointer into the table has fewer bits.
REFERENCES
1. Wayne Wolf, Embedded RISC Architecture;
2. Barry.B.Brey, Introduction to Microprocessors;
3. Morris Mano, Computer Architecture & Organization;
4. Y. Yoshida & B.Y. Song, Code compression approach to Embedded
Processors.