Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views6 pages

Ullah 2015

The document presents UE-TCAM, an ultra-efficient SRAM-based ternary content-addressable memory (TCAM) architecture designed to address the limitations of traditional TCAMs, such as high power consumption, low storage density, and complexity. The proposed design achieves significant reductions in resource utilization, energy consumption, and latency, while improving speed and throughput, making it suitable for various applications including networking and pattern recognition. The architecture utilizes hybrid partitioning to optimize memory usage and performance, demonstrating a promising alternative to existing SRAM-based TCAM designs.

Uploaded by

Tường Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

Ullah 2015

The document presents UE-TCAM, an ultra-efficient SRAM-based ternary content-addressable memory (TCAM) architecture designed to address the limitations of traditional TCAMs, such as high power consumption, low storage density, and complexity. The proposed design achieves significant reductions in resource utilization, energy consumption, and latency, while improving speed and throughput, making it suitable for various applications including networking and pattern recognition. The architecture utilizes hybrid partitioning to optimize memory usage and performance, demonstrating a promising alternative to existing SRAM-based TCAM designs.

Uploaded by

Tường Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UE-TCAM: An Ultra Efficient SRAM-based TCAM

4
Zahid Ullah(1), Manish K. Jaiswal(2), Ray c.c. Cheung(3), and Hayden K.H. SO( )
l
Department of Electrical Engineering, CECOS University of IT and Emerging Sciences, Peshawar, Pakistan( )
(
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong , 2 4)
Department of Electronic Engineering, City University of Hong Kong, Hong Kong(3 )
Emails:[email protected](L)[email protected](2)[email protected](3 )[email protected](4)

Abstract-Ternary content-addressable memories (TeAMs) the TCAM size increases, which results in a prohibitive power
are high speed memories; however, compared to static random­ consumption, size, and cost; thus, nullifying its advantage of
access memories (SRAMs), TeAMs suffer from low storage high-speed lookup. The comparison circuity in each cell not
density, relatively slow access time, poor scalability, complexity only makes TCAM expensive but also adds complexity to the
in circuitry, and higher cost. To access the benefits of SRAM,
TCAM architecture. The extra logic and capacitive loading
several SRAM-based TeAMs, specifically on field-programmable
due to the massive parallelism lengthen the access time of
gate arra y (FPGA) platforms, were proposed. To further improve
the performance of SRAM-based TeAMs, this paper presents
TCAM, which is over 3.3 times longer than the access time
UE-TeAM, which reduces memory requirement, latency, power of SRAM [3].
consumption, and improves speed. An example design of 512 x 36
Furthermore, TCAM is not subjected to the intense com­
of UE-TeAM has been implemented on Xilinx Virtex-6 FPGA.
mercial competition found in the RAM market [4] and yet
Performance evaluation confirms a significant improvement in
the proposed UE-TeAM, which achieves 100% reduction in 18K
to gain a substantial market share. TCAMs are expensive not
B-RAMs, 74.67% reduction in SRs, 70.28% reduction in LUTs, only due to their low memory cell density but also due to
75.76% reduction in energy -delay product, and 60% reduction their insignificant market demand, which means they are not
in latency and improves speed by 70.85%, compared with the produced in mass to drive their cost down. The cost of TCAM
available SRAM-based TeAM. is about 30 times more per bit of storage than the SRAM [3].
In addition, inherited architectural barriers also limit its total
I. INTRODUCTION chip capacity. Complex integration of memory and logic also
makes TCAM testing very time consuming [2].
Ternary content-addressable memory (TCAM) provides
access to stored data by contents (data word) rather than by CAMs have limited pattern retrieval capacity and also
an address and outputs the match address. CAM searches its CAM technology does not evolve as fast as the RAM tech­
entire memory concurrently to check if that data word is stored nology. RAM technology is driven by many applications, par­
anywhere in CAM memory. CAM returns a list of one or ticularly computers and consumer electronic products; hence,
more storage addresses where the word was found. The fast cost per bit continuously decreases, as opposed to the CAM
search feature is the main influence behind using a CAM. The technology, which is considered specialized and only a modest
search operation can also be performed in regular random­ increase in bit capacity and a modest decrease in cost may
access memory (RAM) by iteratively reading and comparing be expected in future [5]. TCAM does not scale well in
entire RAM entries for every search request. As a result, the terms of clock rate, power consumption, or chip density
search time using RAM is significantly longer than the CAM whereas SRAM is scalable and less complex. The throughput
for the same search request. of classical TCAMs is also limited by the relatively low speed
of TCAMs [6].
The high-speed search operation makes CAM an attractive
choice for applications requiring high-speed search such as
B. Motivations and contributions
local-area network, databases management, pattern recogni­
tion, and artificial intelligence [1]. Recent applications include Field-programmable gate arrays (FPGAs) have a wide use
real-time pattern matching in virus-detection and intrusion­ in different applications [7] such as in image processing [8],
detection systems, gene pattern searching in bioinformatics, [9], networking systems [10], [11], and cryptography com­
data compression, and image processing [2]. putations [12], [13] owing to several benefits such as its
reconfigure-ability, massive hardware parallelism, and rapid
A. Problem statement prototyping capability. SRAM-based FPGAs such as Xilinx
Virtex-6 and Virtex-7 [14] provide high clock rate and a large
Although CAM technology presents a major advantage of
amount of on-chip dual-port memory with configurable word
a deterministic comparison in a constant time over standard
width. Xilinx Virtex-7 2000T FPGA is ideally suited for the
RAM, yet it also has shortcomings. For parallel search op­
application-specific integrated circuit (ASIC) prototyping.
eration, CAM needs comparison circuitry in each cell, which
dictates that CAM density lags RAM density. Typical TCAM The Virtex-7 2000T provides equivalent capacity and per­
cell has two SRAM cells and a comparison circuitry. A table of formance to high density ASICs, reduces board space re­
size 211 x w needs 211+ 1 X w SRAM cells and 211 x w comparison quirements and complexity, and furthermore, reduces system
circuitries, with one for each TCAM cell. For large values of n, level power consumption The current FPGA technology does
978-1-4799-8641-5/15/$31.00 ©2015 IEEE
not have hard IPs for the classical TCAMs; however, it has impact on the performance of the RAM-based CAM in [4].
for SRAMs. Benefits of SRAM over CAM and feasibility With increase in the number of stored elements, performance
of FPGA technology have motivated us to go for innovative of the method becomes gracefully degradable. Further, the
designs of TCAM. method emulates Binary CAM not the TCAM.

The proposed UE-TCAM architecture is build on the suc­ The method in [18] also exploits hashing technique for
cession of the prior work on HP-TCAM [15], Z-TCAM [16], TCAM. Being based on hashing technique, it also suffers from
and E-TCAM [17]. The proposed work in the paper makes the collisions and bucket overflow, which needs additional area. If
following key contributions. the overflow area has many records, then a search operation
may not finish until many buckets are searched. Furthermore,
• Architecture of the proposed TCAM is much simpler, when stored keys contain don't care bits in the bit positions
which consists of primarily SRAM units with simple used for hashing, then such keys must be duplicated in multiple
additional logic and is implemented on state-of-the-art buckets, which results in large memory; thus, the memory
Xilinx FPGA. utilization is not efficient.
• The proposed UE-TCAM brings an enormous reduc­ Hashing technique also cannot provide deterministic per­
tion in resource utilization. Implementation results formance due to potential collisions and is inefficient in han­
illustrates that our UE-TCAM attains 100% reduc­ dling wild-card [19]. In contrast to the hashed-based CAMs,
tion in 18K B-RAMs, 74.67% reduction in SRs, the proposed TCAM provides a deterministic search perfor­
and 70. 28% reduction in LUTs, compared with the mance and efficiently utilizes memory. SRAM-based pipelined
available SRAM-based TCAM. CAMs also take multiple clock cycles to accomplish a search
operation and the memory utilization is also not efficient [20].
• Energy/bit/search is a very useful performance metric
In contrast, our proposed TCAM has a deterministic through­
for TCAM. Compared with the existing SRAM-based
put of a single clock cycle and also provides a better utilization
TCAM, the proposed TCAM gets 58.58% reduction
of memory.
in energy consumption.
RAM-based CAMs in [5] and [21] also have unavoidable
• Latency is another important performance metric. The
shortcomings. Size of memory in both methods depends on
UE-TCAM also contributes by reducing latency 60%,
the number of bits (nob) in TCAM word. In [5], the required
compared with the available SRAM-based TCAM.
memory size would be 2110b bits arranged in a column. Size
• Compared with the state-of-the-art SRAM-based increases exponentially with increase in the number of bits in
TCAM design, the UE-TCAM also improves speed by TCAM word. For instance, 36 bits word needs a 64 GB of
70.85%. Getting higher throughput with much simpler RAM. Such a huge memory results in prohibitive area, cost,
architecture is a beauty of the proposed work. and power consumption; thus, it makes the method practically
infeasible for an arbitrarily large bit pattern. Whereas, the pro­
The proposed work may be used in network systems, posed design has a suitable partitioning scheme and efficiently
web-enabled applications, and also in cloud computing. Other supports arbitrarily large words.
applications that can benefit from the proposed TCAM are data
compression, image recognition processors, voice recognition In [21], increase in the number of bits in CAM word
processor, or any pattern recognition system in general. We exponentially increases the memory size to a prohibitive limit,
expect that CAM technology will become main-stream for like [5]. Furthermore, RAM-based CAM in [21] works only
many applications in the near future. Thus, the use of CAM on data arranged in ascending order, which is against the
technology paves the way for our proposed work in the norm of a real application where data are totally random. To
emerging applications. arrange data in ascending order, the original order of entries
needs to be preserved, which is not considered in this method.
However, if considered, the memory and power requirements
C. Paper organization will further increase. In contrast, our proposed TCAM supports
The rest of the paper is organized as follows: Section II an arbitrarily large bit pattern, preserved original addresses,
discusses related work. Section III explains hybrid partitioning, and also a suitable partitioning methodology.
which realizes architectures of the SRAM-based TCAMs. CAM in [22] integrates CAM and RAM to get overall
Section IV presents architecture of the proposed UE-TCAM. CAM functionality; thus, inherits the inborn disadvantages
Section V explains UE-TCAM operations. Section VI elabo­ of CAM. This scheme arranges traditional TCAM table into
rates operations of the proposed TCAM with examples. Sec­ groups based on some distinguishing bits in TCAM words.
tion VII provides implementation and performance evaluation So each group can have at most one possible match. Since
of the UE-TCAM. Section VIII concludes the paper and also data in real applications are totally random, making groups
highlights our future work. would be very time consuming. On the contrary, the proposed
method provides a generic TCAM and uses SRAM, not CAM,
II. RELATED WORK to emulate over all TCAM functionality.
We surveyed the literature on RAM-based CAMs and to State-of-the-art SRAM-based TCAMs-HP-TCAM [15],
the best of our knowledge, we found very few works on it. Z-TCAM [16], and E-TCAM [17] are recently published.
RAM-based CAM proposed in [4] uses hashing technique; Our proposed UE-TCAM improves them by lowering memory
thus, inherits the inborn disadvantages of hashing-collisions size, power consumption, and latency and more importantly
and bucket overflow. Number of stored elements has a great provides higher throughput.
N vertical partitions Inpulword C

· 1 1
HP1N Partition input word of C bits into N subwords; with each subword is of w bits

.SWN

.r;! w-bit w-bit w-bit w-bit w-bit w-bit

L layers

r;! r;! r;! r;! . .r;!
��L2J� �
CAM Priority Encoder

MA

· 1 1
HPLN
Fig. 2. Architecture of UE-TCAM. Layer architecture is shown in Fig. 3.
(L: # of layers, sw: subword, w: # of bits in subword, C: # of bits in the input
word, PMA: potential match address, and MA: match address).
Fig. 1. Conceptual view of hybrid partitioning (HP). (L: # of layers, N: # of
vertical partitions).

III. HYB RID PARTITIONING

We use hybrid partitioning (HP), shown Fig. 1, to di­


vide conventional TCAM table horizontally and vertically to
construct hybrid partitions. Vertical partitioning (VP) in HP
divides TCAM word of C bits into N subwords. Horizontal
partitioning (HrP) in HP divides each vertical partition into
L horizontal partitions by using the original address range of
conventional TCAM table. Thus, HP results in a total of L x
N hybrid partitions. Dimensions of each hybrid partition are
K x w where K is a subset of original addresses and w is the
number of bits in a subword. Fig. 3. Architecture of a layer of UE-TCAM. (N: # of subwords, LPE: layer
priority encoder, K: width of SRAM unit, (sw: subword, w: # of bits in a
VP is used to use as lower memory as possible. HrP cannot subword, PMA: potential match address, and MA: match address).
be used alone because it needs very huge memory size. Thus,
HrP is not feasible owing to inefficiency in terms of area, TABLE 1. COMPOSITION OF THE SRAM UNIT IN UE-TCAM

power, and cost; however, it nicely generates layers. Hybrid


partitions that span the same address range are grouped in the
Addresses I O'n I I"
Original address positions
2nn 3' I 4,n I ... (K_I)'n
same layer. For example, HP3 ], HP32 , HP33 , . . . , and HP3 N are 0 1 I 0 1 0 ... I
I 1 I I 1 0 ... I
in layer 3. 0 I I
3 1 1 ... 0
4 0 0 0 0 1 0

IV. ARCHI T ECTURE OF UE-TCAM


2"" I I
- 1 0 0 1 1 ...
A. Overall architecture
Fig. 2 shows the overall architecture where each layer
represents the layer architecture given in Fig. 3. UE-TCAM 2) K-bit AND operation: K bits rows are read out by their
has L layers and a CAM priority encoder (CPE). Output of corresponding subwords, which are then bit-wise ANDed and
each layer is a potential match address (PMA). The PMAs are the result is then forwarded to LPE for further processing.
fed to CPE, which selects match address (MA) among PMAs. Possible PMA is present among the result of K-bit AND
operation. The result is then forwarded to LPE for result
generation in the form of PMA.
B. Layer architecture
3) Layer priority encoder: Since we emulate TCAM and
Layer architecture of the proposed TCAM is illustrated in as in TCAM multiple matches may occur [23], LPE is used
Fig. 3. Its components include N SRAM units, K-bit AND to select PMA in the output of K-bit AND operation.
operation, and a layer priority encoder (LPE).

1) SRAM unit: Each SRAM unit has a size of 2w_ V. UE- TCAM OPERATIONS
wordsxK-bit where K is the subset of original addresses from
A. Data mapping operation
conventional TCAM. Maximum possible combinations of w
bits are 2w where each combination represents a subword and Tradition TCAM table is logically partitioned column-wise
in our proposed TCAM, each subword acts as an address to (vertically) and row-wise (horizontally) into TCAM sub-tables
its corresponding SRAM unit that invokes its corresponding using hybrid partitioning [15]. A partition may contain an x
row of K bits. Composition of the SRAM unit in the proposed bit, which is first expanded into binary bits (0 and 1). Each
architecture is shown in Table I where 1 shows the presence subword, acting as an address, is applied to its corresponding
of a subword at an original address. SRAM unit and K bits are written at the memory location.
TABLE II. TRADITIONAL TCAM TABLE WITH HYBRID PARTITIONS TABLE IV. SEARCHING IN LAYER 1 AND LAYER 2 IN UE-TCAM

Address Hybrid partitions Layer I Steps I Activity Layer I Layer 2


0 00 II SubworddI - 00 Subword2 - II
I 01 HP" 01 HPI2 I 10 10
2 Ox II I Read out data from:
Read out data from:
SRAM unitll
SRAM Unitl2
-

= \0
SRAM unit21
SRAM unit22
-

= II
3 II HP21 Ix HP22 2 2 K-bit ANDing result: 10 10
3 PMAs PMAI - 0 PMA2 - 2
TABLE Ill. DATA MAP PING EXAMPLE: SRAM UNITS IN LAYER 1 AND
LAYER 2 OF UE-TCAM TABLE V. OVERALL DATA SEARCH OPERATION IN UE-TCAM

Original addresses Steps I Activity


Layer I Layer 2 00II
Address
SRAM unit" SRAM unit12 SRAM unit21 I SRAM unit22 I Search key
SubwordI
-

00 and Subword2 II
I I
= =

0 0 2 3 I 2 3 2 PMAI - 0 and PMA2 - 2


0 I 0 0 0 I 0 0 I 3 ePE selects address 0 as MA
I 0 I 0 I I 0 I I
2 0 0 0 0 0 I 0 0
TABLE VI. MISMATCH CASE WHEN THE RESULT OF K-BIT AND
3 0 0 I 0 0 0 0 0
OPERATION IS 0 IN UE-TCAM

Steps I Activity

Thus, in this way, all the memory units are mapped. A subword SubwordI - 0I, Subword2 - II
10
in a partition may be present at multiple locations. So, its I Read out data from SRAM unitll
Read out data from SRAM Unitl2
-

= 0I
original addresses are mapped to the corresponding bits in 2 K-bit AND operation result - 00
their respective memory units. Mapped bits are high, while Since the result of K-bit AND operation is 0,
3
mismatch has occurred in layer 1.
remaining bits are set to low.

B. Data searching operation We select N = 2, L = 2, K = 2, and w = 2. After necessary


1) Data searching operation in a layer: Algorithm 1 de­ processing, HPII, HPI2 , HP21, and HP22 are mapped to their
scribes lookup operation in a layer of of the proposed UE­ corresponding SRAM units. Mapped memory units are shown
TCAM. The N subwords act as addresses and read out their in Table III. The mapped bits are high, while remaining bits
memory locations from their respective SRAM units, which are are low. For example, subword 00 is available on address 0 in
then bit-wise ANDed. LPE selects PMA; otherwise, mismatch conventional TCAM table. The subword 00 has been mapped
occurs in the layer. in SRAM unitll where its corresponding bit has been set to
high at address O.
2) Overall data searching operation: Overall search oper­
ation follows Algorithm 2. A search key is applied to UE­ B. Data searching example
TCAM, which is then divided into N subwords to be searched
in their corresponding SRAM units in all layers in parallel. 1) Match case: We use memory units given in Table III to
Algorithm 2 uses Algorithm 1 at step 3. CPE selects MA be searched. Table IV provides an example of search operation
among PMAs; otherwise, mismatch occurs. in layers 1 and 2, where lookup operation in each layer follows
Algorithm 1. Table V provides overall search operation in UE­
Algorithm 1 Search in a layer of UE-TCAM TCAM, which follows Algorithm 2. We provide input word
0011 for searching. UE-TCAM finds a match for the input
Input: N subwords where each subword is of w bits
word in layer 1 at location 0 and in layer 2 at location 2.
Output: PMA
Thus, we have PMAI 0 from layer 1 and PMA2 = 2 from =

1: Read all SRAM units concurrently layer 2. CPE selects PMAI 0 as MA, considering that it has =

2: ANDK = K[ & K2 & K3 . . . & KN the highest priority.


3: PMAImismatch occurs
2) Mismatch case: During a search operation in a layer,
mismatch of the input word can occur when none of the bits
Algorithm 2 Overall search in UE-TCAM is high after K-bit AND operation. Table VI shows a mismatch
case in layer 1.
Input: Search key of C bits
Output: MA
V II. IMPLEMEN TATION RESULTS AND PERFORMANCE
1: Divide search key into N subwords; each of w bits
EVALUATION
2: All layers use Algorithm 1 in parallel
3: MAimismatch occurs A. Implementation results
A sample design of 512 x 36 of the proposed UE-TCAM
and the available SRAM-based TCAMs-HP-TCAM [15], z­
V I. UE- TCAM EXAMPLE TCAM [16], and E-TCAM [17] with L=4 and N=4 was
implemented on Xilinx Virtex-6 FPGA. Table VII shows im­
A. Data mapping example
plementation results of all the SRAM-based TCAMs. Dynamic
We use Table II to be mapped to the proposed UE-TCAM. power consumption for a lookup operation was measured with
Table II also shows its hybrid partitions. We take a simple 1. 0 V core voltage and 100 MHz clock speed. We measured
example of 4 x 4 conventional TCAM table and divide it into power consumption using Xilinx Xpower analyzer [24]. We
four hybrid partitions; each one has a size of 2-wordsx2-bit. generated switching activity interchange format (SAIF) file,
TABLE VII. IMPLEMEN TATION RESULTS ON XILINX V IRTEX-6 FPGA

Results HP-TCAM [15] Z-TCAM [16] E-TCAM [17] UE-TCAM

SRs 2057 665 521 521


LUTs 5326 1982 1677 1583
Speed (MHz) 118.1 158.88 163.99 201.78
B-RAMs (18K, 36K) 16, 48 16, 32 16, 32 0, 32
Energy (fJ/bitisearch) 102.17 58.91 49.54 42.32
EDP (ns.fJ/bitisearch) 865.07 370.77 302.09 209.73
Latency (Clock cycles) 5 4 3 2

SOOO 800

4000 :2
u
ro 600
Q)
(/)
.::- 3000 � HP-TCAM :;:, I� EDP I
:0

'"
'"
Ez:::zJ Z-TCAM � 400
o 2000 IIIIIII E-TCAM <Ii
.s
mIIIIIIl UE-TCAM a.

1000 fil 200

SRs LUTs BRAMs (36K) BRAMs (18K) HP-TCAM Z-TCAM E-TCAM UE-TCAM
Resources on FGPA SRAM-based TCAMs

Fig. 4. Resource utilization comparison on Xilinx Virtex-6 FGPA. Fig. 5. EDP comparison of the SRAM-based TCAMs.

TABLE Vlll. PERFORMANCE EVALUATION OF UE-TCAM

Reduction (%) in resources, energy, EDP, and latency


Parameters and improvement in speed over the SRAM-based TCAMs [16]. However, UE-TCAM shows little but concrete improve­
HP-TCAM [15] Z-TCAM [16] E-TCAM [17] ment over E-TCAM [17]. The proposed work achieves 100%
B-RAMs (18K, 36K) 100, 33.33 100, 0.00 100, 0.00 reduction in 18K B-RAMs over all the available SRAM-based
SRs 74.67 21.65 0.00
LUTs 70.28 20.13 0.44
TCAMs, along with a sound reduction in logical resources.
Energy 58.58 28.16 14.58 Owing to reducing the utilization of FPGA resources, the
EDP 75.76 43.43 30.57 proposed UE-TCAM gets significant reduction in energy con­
Latency 60 50 33.33
Speed 70.85 27.00 23.04
sumption and latency over the existing SRAM-based TCAMs.
More importantly, the speed (throughput) improvement is also
very interesting.
which is required for more accurate power estimation. We
calculated energy/bit/search using Equation 1, which is an Bit position table (BPT) and address posltlon table ad­
important metric for TCAM. Latency of all the TCAMs include dress generator (APTAG) in HP-TCAM [15] are collectively
priority encoder (PE). responsible for validating the input subword and the generation
power of an address for the corresponding address position table
Energy/bit/search = (1) (APT). In Z-TCAM [16], validation memory (VM) validates
frequency X total bits
the input subword, if present, and the address for original
address table (OAT) is then invoked from original address table
B. Peiformance evaluation
address memory (OATAM). Thus, the collective functionality
Comparison of the resource utilization, speed, energy con­ of VM and OATAM is equivalent to the collective functionality
sumption, energy-delay product (EDP), and latency of the pro­ of BPT and APTAG. BPT and APTAG in HP-TCAM, and
posed UE-TCAM with the available SRAM-based TCAMs is VM and OATAM in Z-TCAM consume a lot of resources on
given in Table VIII, which shows that the proposed UE-TCAM FPGA; thus, result in higher resource utilization, higher power
is efficient in all parameters, considered for measurement of consumption, and higher latency.
TCAM performance.
Here a question arises. Can a subword be used as a direct
The reduction in energy consumption is due to the re­
address to a memory unit? Yes, it can be. In E-TCAM [17] a
duction in resource utilization. Furthermore, the proposed
subword is uesd as an address to VM and OAT. If the subword
TCAM provides higher throughput. Latency of UE-TCAM is
is validated, then it is used as a direct address to its memory
one clock cycle without PE. To the best of our knowledge,
unit. VM is used as a validation memory to validate a subword,
UE-TCAM is the first ever SRAM-based TCAM, which has
if present. The validated subword is then used as an address
achieved one clock cycle lookup operation.
to OAT to retrieve a row. VM is constructed from SRAM and
From Table VIII, we can analyze that UE-TCAM shows uses FGPA resources. Another question arises here. Can we
substantial improvement over HP-TCAM [15] and Z-TCAM remove VM? Yes, we can.
Since a subword is used as an address to a memory block, [3] S. Dharmapurikar, P. Krishnamurthy, and D. Taylor, "Longest prefix
there is no need to validate a subword and to generate an matching using bloom filters, " Networking, IEEEIACM Transactions on,
vol. 14, no. 2, pp. 397-409, 2006.
address in UE-TCAM. UE-TCAM removes the resources used
[4] P. Mahoney, Y. Savaria, G. Bois, and P. Plante, "Parallel hashing
by BPT and APTAG in HP-TCAM, by VM and OATAM in
memories: an alternative to content addressable memories, " in IEEE­
Z-TCAM, and by VM in E-TCAM; thus, brings a significant NEWCAS Conference, 2005. The 3rd International, 2005, pp. 223-226.
reduction in resource utilization, power consumption, EDP, and [5] S. Y. Kartalopoulos, "RAM-based associative content-addressable mem­
latency and improves speed. ory device, method of operation thereof and ATM communication
switching system employing the same, " Patent 6 097 724, August, 2000.
Fig. 4 also depicts that UE-TCAM has smaller values for
[6] w. Jiang and Y. Prasanna, "Parallel IP lookup using multiple SRAM-
BRAMs, SRs, and LUTs. Exploiting subword as an address based pipelines;' in Parallel and Distributed Processing, 2008. IPDPS
and removing BPT and APTAG bring an enormous reduction 2008. IEEE International Symposium on, 2008, pp. 1-14.
in SRs and LUTs. Since APTAG contains counter and adder [7] M. Jaiswal and R. Cheung, "VLSI implementation of double-precision
for generating an index (address), removing the APTAG also floating-point multiplier using karatsuba technique, " Circuits, Systems,
brings improvement in speed in UE-TCAM. Since BPT is an and Signal Processing, vol. 32, no. 1, pp. 15-27, 2013.
SRAM block and there are N BPTs, removing N BPTs gets [8] Z. Guo, w. Najjar, F. Vahid, and K. Vissers, "A quantitative analysis
a large reduction in SRAM blocks, which we have achieved of the speedup factors of FPGAs over processors, " in Proceedings
of the 2004 ACMISIGDA 12th International Symposium on Field
in our proposed UE-TCAM. While comparing with Z-TCAM, Programmable Gate Arrays, ser. FPGA '04, 2004, pp. 162-170.
removing VM and OATAM also brings efficiency in memory
[9] Y. Aggarwal, A. D. George, and K. C. Slatton, "Reconfigurable com­
units utilization. Similarly, while comparing with E-TCAM, puting with multiscale data fusion for remote sensing, " in Proceedings
removing VM also gets efficiency in memory units utilization. of the 2006 ACMISIGDA 14th International Symposium on Field
Programmable Gate Arrays, ser. FPGA '06, 2006, pp. 235-235.
Fig. 5 provides a comparison of EDP among the SRAM­
[10] M. Becchi and P. Crowley, "Efficient regular expression evaluation:
based TCAMs. The comparison clearly demonstrates the per­ Theory to practice, " in Proceedings of the 4th ACMIIEEE Symposium on
formance improvement of the proposed UE-TCAM over the Architectures for Networking and Communications Systems, ser. ANCS
existing SRAM-based TCAMs. The architecture of UE-TCAM '08, 2008, pp. 50-59.
is much simpler and provides higher operating speed, while [II] w. Jiang and Y. Prasanna, "Scalable packet classification on FPGA, "
exploiting minimum FPGA resources when compared with E­ Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 20, no. 9, pp. 1668-1680, 2012.
TCAM, Z-TCAM, and HP-TCAM.
[12] C. H. Kim, S. Kwon, and C. P. Hong, "FPGA implementation of high
performance elliptic curve cryptographic processor over gf(2163), " 1.
V III. CONCLUSIONS AND FUTURE WORK Syst. Archit., vol. 54, no. 10, pp. 893-900, Oct. 2008.

This paper presented an efficient SRAM-Based TCAM [13] R. C. C. Cheung, N. Telle, W. Luk, and P. Y. K. Cheung, "Customizable
elliptic curve cryptosystems, " Very Large Scale Integration (VLSl)
architecture, UE-TCAM. We implemented a sample design of
Systems, IEEE Transactions on, vol. 13, no. 9, pp. 1048-1059, Sept
512 x 36 of it on Xilinx Virtex-6 FPGA. UE-TCAM consumes 2005.
less memory resources and less logic elements on FPGA; thus, [14] Xilinx, "Xilinx FPGAs, " http://www.xilinx.com.
creating a much simpler TCAM structure. By comparing with [15] Z. UIlah, K. ligon, and S. Baeg, "Hybrid partitioned SRAM-based
the available SRAM-based TCAMs, UE-TCAM shows signif­ ternary content addressable memory, " Circuits and Systems 1: Regular
icant reduction is size, power consumption, and latency and Papers, IEEE Transactions on, vol. 59, no. 12, pp. 2969-2979, 2012.
provides higher operating speed. For example, when compared [16] Z. Ullah, M. Jaiswal, and R. Cheung, "Z-TCAM: An SRAM-based
with HP-TCAM [15], UE-TCAM brings 100% reduction in architecture for TCAM, " Very Large Scale Integration (VLSl) Systems,
IEEE Transactions on, vol. 23, no. 2, pp. 402-406, Feb 2015.
18K B-RAMs, 74.67% reduction in SRs, 70. 28% reduction in
[17] --, "E-TCAM: An efficient SRAM-based architecture for TCAM, "
LUTs, 75.76% reduction in EDP, and 60% reduction in latency
Circuits, Systems, and Signal Processing, vol. 33, no. 10, pp. 3123-
and improves speed by 70.85%. 3144, 2014.
We understand that SRAM-based TCAM design is a rich [18] S. Cho, J. Martin, R. Xu, M. Hammoud, and R. Melhem, "CA-RAM: A
high-performance memory substrate for search-intensive applications, "
field for research and further investigation is necessary to
in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE
find out more SRAM-based TCAM approaches. We hope that International Symposium on, 2007, pp. 230-241.
the area will be further enriched by researcher in industry [19] W. Jiang, Y. K. Prasanna, and N. Yamagaki, "Decision forest: A scalable
and academia. Our future work includes configuring the UE­ architecture for flexible flow matching on FPGA, " in Proceedings of
TCAM for precomparison access mode to further get power the 2010 International Conference on Field Programmable Logic and
efficiency, and using the proposed TCAM in some applications. Applications, ser. FPL '10, 2010, pp. 394-399.
[20] W. Jiang and Y. K. Prasanna, "Large-scale wire-speed packet classi­
fication on FPGAs, " in Proceedings of the ACMISIGDA international
ACKNOW LEDGMENT
symposium on Field programmable gate arrays, ser. FPGA '09, 2009,
This work was partly supported by the Croucher Startup pp. 219-228.
Grant (Grant No. 9500015). [21] M. Somasundaram, "Circuits to generate a sequential index for an input
number in a pre-defined list of numbers, " Patent 7 155 563, December,
2006.
REFERENCES
[22] --, "Memory and power efficient mechanism for fast table lookup, "
[I] M. Peng and S. Azgomi, "Content-addressable memory (CAM) and its Patent 20 060 253 648, November, 2006.
network applications, " in International IC-Taipei proceedings, Altera [23] K. Pagiamtzis and A. Sheikholeslami, "Content-addressable memory
International Ltd. (CAM) circuits and architectures: a tutorial and survey, " Solid-State
[2] N. Mohan, w. Fung, D. Wright, and M. Sachdev, "Design techniques Circuits, IEEE Journal of, vol. 41, no. 3, pp. 712-727, 2006.
and test methodology for low-power TCAMs, " Very Large Scale In­ [24] Xilinx, "Xilinx Xpower Analyzer, " http://www.xilinx.com.
tegration (VLSl) Systems, IEEE Transactions on, vol. 14, no. 6, pp.
573-586, 2006.

You might also like