Ullah 2015
Ullah 2015
4
Zahid Ullah(1), Manish K. Jaiswal(2), Ray c.c. Cheung(3), and Hayden K.H. SO( )
l
Department of Electrical Engineering, CECOS University of IT and Emerging Sciences, Peshawar, Pakistan( )
(
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong , 2 4)
Department of Electronic Engineering, City University of Hong Kong, Hong Kong(3 )
Emails:[email protected](L)[email protected](2)[email protected](3 )[email protected](4)
Abstract-Ternary content-addressable memories (TeAMs) the TCAM size increases, which results in a prohibitive power
are high speed memories; however, compared to static random consumption, size, and cost; thus, nullifying its advantage of
access memories (SRAMs), TeAMs suffer from low storage high-speed lookup. The comparison circuity in each cell not
density, relatively slow access time, poor scalability, complexity only makes TCAM expensive but also adds complexity to the
in circuitry, and higher cost. To access the benefits of SRAM,
TCAM architecture. The extra logic and capacitive loading
several SRAM-based TeAMs, specifically on field-programmable
due to the massive parallelism lengthen the access time of
gate arra y (FPGA) platforms, were proposed. To further improve
the performance of SRAM-based TeAMs, this paper presents
TCAM, which is over 3.3 times longer than the access time
UE-TeAM, which reduces memory requirement, latency, power of SRAM [3].
consumption, and improves speed. An example design of 512 x 36
Furthermore, TCAM is not subjected to the intense com
of UE-TeAM has been implemented on Xilinx Virtex-6 FPGA.
mercial competition found in the RAM market [4] and yet
Performance evaluation confirms a significant improvement in
the proposed UE-TeAM, which achieves 100% reduction in 18K
to gain a substantial market share. TCAMs are expensive not
B-RAMs, 74.67% reduction in SRs, 70.28% reduction in LUTs, only due to their low memory cell density but also due to
75.76% reduction in energy -delay product, and 60% reduction their insignificant market demand, which means they are not
in latency and improves speed by 70.85%, compared with the produced in mass to drive their cost down. The cost of TCAM
available SRAM-based TeAM. is about 30 times more per bit of storage than the SRAM [3].
In addition, inherited architectural barriers also limit its total
I. INTRODUCTION chip capacity. Complex integration of memory and logic also
makes TCAM testing very time consuming [2].
Ternary content-addressable memory (TCAM) provides
access to stored data by contents (data word) rather than by CAMs have limited pattern retrieval capacity and also
an address and outputs the match address. CAM searches its CAM technology does not evolve as fast as the RAM tech
entire memory concurrently to check if that data word is stored nology. RAM technology is driven by many applications, par
anywhere in CAM memory. CAM returns a list of one or ticularly computers and consumer electronic products; hence,
more storage addresses where the word was found. The fast cost per bit continuously decreases, as opposed to the CAM
search feature is the main influence behind using a CAM. The technology, which is considered specialized and only a modest
search operation can also be performed in regular random increase in bit capacity and a modest decrease in cost may
access memory (RAM) by iteratively reading and comparing be expected in future [5]. TCAM does not scale well in
entire RAM entries for every search request. As a result, the terms of clock rate, power consumption, or chip density
search time using RAM is significantly longer than the CAM whereas SRAM is scalable and less complex. The throughput
for the same search request. of classical TCAMs is also limited by the relatively low speed
of TCAMs [6].
The high-speed search operation makes CAM an attractive
choice for applications requiring high-speed search such as
B. Motivations and contributions
local-area network, databases management, pattern recogni
tion, and artificial intelligence [1]. Recent applications include Field-programmable gate arrays (FPGAs) have a wide use
real-time pattern matching in virus-detection and intrusion in different applications [7] such as in image processing [8],
detection systems, gene pattern searching in bioinformatics, [9], networking systems [10], [11], and cryptography com
data compression, and image processing [2]. putations [12], [13] owing to several benefits such as its
reconfigure-ability, massive hardware parallelism, and rapid
A. Problem statement prototyping capability. SRAM-based FPGAs such as Xilinx
Virtex-6 and Virtex-7 [14] provide high clock rate and a large
Although CAM technology presents a major advantage of
amount of on-chip dual-port memory with configurable word
a deterministic comparison in a constant time over standard
width. Xilinx Virtex-7 2000T FPGA is ideally suited for the
RAM, yet it also has shortcomings. For parallel search op
application-specific integrated circuit (ASIC) prototyping.
eration, CAM needs comparison circuitry in each cell, which
dictates that CAM density lags RAM density. Typical TCAM The Virtex-7 2000T provides equivalent capacity and per
cell has two SRAM cells and a comparison circuitry. A table of formance to high density ASICs, reduces board space re
size 211 x w needs 211+ 1 X w SRAM cells and 211 x w comparison quirements and complexity, and furthermore, reduces system
circuitries, with one for each TCAM cell. For large values of n, level power consumption The current FPGA technology does
978-1-4799-8641-5/15/$31.00 ©2015 IEEE
not have hard IPs for the classical TCAMs; however, it has impact on the performance of the RAM-based CAM in [4].
for SRAMs. Benefits of SRAM over CAM and feasibility With increase in the number of stored elements, performance
of FPGA technology have motivated us to go for innovative of the method becomes gracefully degradable. Further, the
designs of TCAM. method emulates Binary CAM not the TCAM.
The proposed UE-TCAM architecture is build on the suc The method in [18] also exploits hashing technique for
cession of the prior work on HP-TCAM [15], Z-TCAM [16], TCAM. Being based on hashing technique, it also suffers from
and E-TCAM [17]. The proposed work in the paper makes the collisions and bucket overflow, which needs additional area. If
following key contributions. the overflow area has many records, then a search operation
may not finish until many buckets are searched. Furthermore,
• Architecture of the proposed TCAM is much simpler, when stored keys contain don't care bits in the bit positions
which consists of primarily SRAM units with simple used for hashing, then such keys must be duplicated in multiple
additional logic and is implemented on state-of-the-art buckets, which results in large memory; thus, the memory
Xilinx FPGA. utilization is not efficient.
• The proposed UE-TCAM brings an enormous reduc Hashing technique also cannot provide deterministic per
tion in resource utilization. Implementation results formance due to potential collisions and is inefficient in han
illustrates that our UE-TCAM attains 100% reduc dling wild-card [19]. In contrast to the hashed-based CAMs,
tion in 18K B-RAMs, 74.67% reduction in SRs, the proposed TCAM provides a deterministic search perfor
and 70. 28% reduction in LUTs, compared with the mance and efficiently utilizes memory. SRAM-based pipelined
available SRAM-based TCAM. CAMs also take multiple clock cycles to accomplish a search
operation and the memory utilization is also not efficient [20].
• Energy/bit/search is a very useful performance metric
In contrast, our proposed TCAM has a deterministic through
for TCAM. Compared with the existing SRAM-based
put of a single clock cycle and also provides a better utilization
TCAM, the proposed TCAM gets 58.58% reduction
of memory.
in energy consumption.
RAM-based CAMs in [5] and [21] also have unavoidable
• Latency is another important performance metric. The
shortcomings. Size of memory in both methods depends on
UE-TCAM also contributes by reducing latency 60%,
the number of bits (nob) in TCAM word. In [5], the required
compared with the available SRAM-based TCAM.
memory size would be 2110b bits arranged in a column. Size
• Compared with the state-of-the-art SRAM-based increases exponentially with increase in the number of bits in
TCAM design, the UE-TCAM also improves speed by TCAM word. For instance, 36 bits word needs a 64 GB of
70.85%. Getting higher throughput with much simpler RAM. Such a huge memory results in prohibitive area, cost,
architecture is a beauty of the proposed work. and power consumption; thus, it makes the method practically
infeasible for an arbitrarily large bit pattern. Whereas, the pro
The proposed work may be used in network systems, posed design has a suitable partitioning scheme and efficiently
web-enabled applications, and also in cloud computing. Other supports arbitrarily large words.
applications that can benefit from the proposed TCAM are data
compression, image recognition processors, voice recognition In [21], increase in the number of bits in CAM word
processor, or any pattern recognition system in general. We exponentially increases the memory size to a prohibitive limit,
expect that CAM technology will become main-stream for like [5]. Furthermore, RAM-based CAM in [21] works only
many applications in the near future. Thus, the use of CAM on data arranged in ascending order, which is against the
technology paves the way for our proposed work in the norm of a real application where data are totally random. To
emerging applications. arrange data in ascending order, the original order of entries
needs to be preserved, which is not considered in this method.
However, if considered, the memory and power requirements
C. Paper organization will further increase. In contrast, our proposed TCAM supports
The rest of the paper is organized as follows: Section II an arbitrarily large bit pattern, preserved original addresses,
discusses related work. Section III explains hybrid partitioning, and also a suitable partitioning methodology.
which realizes architectures of the SRAM-based TCAMs. CAM in [22] integrates CAM and RAM to get overall
Section IV presents architecture of the proposed UE-TCAM. CAM functionality; thus, inherits the inborn disadvantages
Section V explains UE-TCAM operations. Section VI elabo of CAM. This scheme arranges traditional TCAM table into
rates operations of the proposed TCAM with examples. Sec groups based on some distinguishing bits in TCAM words.
tion VII provides implementation and performance evaluation So each group can have at most one possible match. Since
of the UE-TCAM. Section VIII concludes the paper and also data in real applications are totally random, making groups
highlights our future work. would be very time consuming. On the contrary, the proposed
method provides a generic TCAM and uses SRAM, not CAM,
II. RELATED WORK to emulate over all TCAM functionality.
We surveyed the literature on RAM-based CAMs and to State-of-the-art SRAM-based TCAMs-HP-TCAM [15],
the best of our knowledge, we found very few works on it. Z-TCAM [16], and E-TCAM [17] are recently published.
RAM-based CAM proposed in [4] uses hashing technique; Our proposed UE-TCAM improves them by lowering memory
thus, inherits the inborn disadvantages of hashing-collisions size, power consumption, and latency and more importantly
and bucket overflow. Number of stored elements has a great provides higher throughput.
N vertical partitions Inpulword C
· 1 1
HP1N Partition input word of C bits into N subwords; with each subword is of w bits
.SWN
L layers
�
r;! r;! r;! r;! . .r;!
��L2J� �
CAM Priority Encoder
MA
· 1 1
HPLN
Fig. 2. Architecture of UE-TCAM. Layer architecture is shown in Fig. 3.
(L: # of layers, sw: subword, w: # of bits in subword, C: # of bits in the input
word, PMA: potential match address, and MA: match address).
Fig. 1. Conceptual view of hybrid partitioning (HP). (L: # of layers, N: # of
vertical partitions).
1) SRAM unit: Each SRAM unit has a size of 2w_ V. UE- TCAM OPERATIONS
wordsxK-bit where K is the subset of original addresses from
A. Data mapping operation
conventional TCAM. Maximum possible combinations of w
bits are 2w where each combination represents a subword and Tradition TCAM table is logically partitioned column-wise
in our proposed TCAM, each subword acts as an address to (vertically) and row-wise (horizontally) into TCAM sub-tables
its corresponding SRAM unit that invokes its corresponding using hybrid partitioning [15]. A partition may contain an x
row of K bits. Composition of the SRAM unit in the proposed bit, which is first expanded into binary bits (0 and 1). Each
architecture is shown in Table I where 1 shows the presence subword, acting as an address, is applied to its corresponding
of a subword at an original address. SRAM unit and K bits are written at the memory location.
TABLE II. TRADITIONAL TCAM TABLE WITH HYBRID PARTITIONS TABLE IV. SEARCHING IN LAYER 1 AND LAYER 2 IN UE-TCAM
= \0
SRAM unit21
SRAM unit22
-
= II
3 II HP21 Ix HP22 2 2 K-bit ANDing result: 10 10
3 PMAs PMAI - 0 PMA2 - 2
TABLE Ill. DATA MAP PING EXAMPLE: SRAM UNITS IN LAYER 1 AND
LAYER 2 OF UE-TCAM TABLE V. OVERALL DATA SEARCH OPERATION IN UE-TCAM
00 and Subword2 II
I I
= =
Steps I Activity
Thus, in this way, all the memory units are mapped. A subword SubwordI - 0I, Subword2 - II
10
in a partition may be present at multiple locations. So, its I Read out data from SRAM unitll
Read out data from SRAM Unitl2
-
= 0I
original addresses are mapped to the corresponding bits in 2 K-bit AND operation result - 00
their respective memory units. Mapped bits are high, while Since the result of K-bit AND operation is 0,
3
mismatch has occurred in layer 1.
remaining bits are set to low.
1: Read all SRAM units concurrently layer 2. CPE selects PMAI 0 as MA, considering that it has =
SOOO 800
4000 :2
u
ro 600
Q)
(/)
.::- 3000 � HP-TCAM :;:, I� EDP I
:0
�
'"
'"
Ez:::zJ Z-TCAM � 400
o 2000 IIIIIII E-TCAM <Ii
.s
mIIIIIIl UE-TCAM a.
SRs LUTs BRAMs (36K) BRAMs (18K) HP-TCAM Z-TCAM E-TCAM UE-TCAM
Resources on FGPA SRAM-based TCAMs
Fig. 4. Resource utilization comparison on Xilinx Virtex-6 FGPA. Fig. 5. EDP comparison of the SRAM-based TCAMs.
This paper presented an efficient SRAM-Based TCAM [13] R. C. C. Cheung, N. Telle, W. Luk, and P. Y. K. Cheung, "Customizable
elliptic curve cryptosystems, " Very Large Scale Integration (VLSl)
architecture, UE-TCAM. We implemented a sample design of
Systems, IEEE Transactions on, vol. 13, no. 9, pp. 1048-1059, Sept
512 x 36 of it on Xilinx Virtex-6 FPGA. UE-TCAM consumes 2005.
less memory resources and less logic elements on FPGA; thus, [14] Xilinx, "Xilinx FPGAs, " http://www.xilinx.com.
creating a much simpler TCAM structure. By comparing with [15] Z. UIlah, K. ligon, and S. Baeg, "Hybrid partitioned SRAM-based
the available SRAM-based TCAMs, UE-TCAM shows signif ternary content addressable memory, " Circuits and Systems 1: Regular
icant reduction is size, power consumption, and latency and Papers, IEEE Transactions on, vol. 59, no. 12, pp. 2969-2979, 2012.
provides higher operating speed. For example, when compared [16] Z. Ullah, M. Jaiswal, and R. Cheung, "Z-TCAM: An SRAM-based
with HP-TCAM [15], UE-TCAM brings 100% reduction in architecture for TCAM, " Very Large Scale Integration (VLSl) Systems,
IEEE Transactions on, vol. 23, no. 2, pp. 402-406, Feb 2015.
18K B-RAMs, 74.67% reduction in SRs, 70. 28% reduction in
[17] --, "E-TCAM: An efficient SRAM-based architecture for TCAM, "
LUTs, 75.76% reduction in EDP, and 60% reduction in latency
Circuits, Systems, and Signal Processing, vol. 33, no. 10, pp. 3123-
and improves speed by 70.85%. 3144, 2014.
We understand that SRAM-based TCAM design is a rich [18] S. Cho, J. Martin, R. Xu, M. Hammoud, and R. Melhem, "CA-RAM: A
high-performance memory substrate for search-intensive applications, "
field for research and further investigation is necessary to
in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE
find out more SRAM-based TCAM approaches. We hope that International Symposium on, 2007, pp. 230-241.
the area will be further enriched by researcher in industry [19] W. Jiang, Y. K. Prasanna, and N. Yamagaki, "Decision forest: A scalable
and academia. Our future work includes configuring the UE architecture for flexible flow matching on FPGA, " in Proceedings of
TCAM for precomparison access mode to further get power the 2010 International Conference on Field Programmable Logic and
efficiency, and using the proposed TCAM in some applications. Applications, ser. FPL '10, 2010, pp. 394-399.
[20] W. Jiang and Y. K. Prasanna, "Large-scale wire-speed packet classi
fication on FPGAs, " in Proceedings of the ACMISIGDA international
ACKNOW LEDGMENT
symposium on Field programmable gate arrays, ser. FPGA '09, 2009,
This work was partly supported by the Croucher Startup pp. 219-228.
Grant (Grant No. 9500015). [21] M. Somasundaram, "Circuits to generate a sequential index for an input
number in a pre-defined list of numbers, " Patent 7 155 563, December,
2006.
REFERENCES
[22] --, "Memory and power efficient mechanism for fast table lookup, "
[I] M. Peng and S. Azgomi, "Content-addressable memory (CAM) and its Patent 20 060 253 648, November, 2006.
network applications, " in International IC-Taipei proceedings, Altera [23] K. Pagiamtzis and A. Sheikholeslami, "Content-addressable memory
International Ltd. (CAM) circuits and architectures: a tutorial and survey, " Solid-State
[2] N. Mohan, w. Fung, D. Wright, and M. Sachdev, "Design techniques Circuits, IEEE Journal of, vol. 41, no. 3, pp. 712-727, 2006.
and test methodology for low-power TCAMs, " Very Large Scale In [24] Xilinx, "Xilinx Xpower Analyzer, " http://www.xilinx.com.
tegration (VLSl) Systems, IEEE Transactions on, vol. 14, no. 6, pp.
573-586, 2006.