Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views9 pages

Efficient TCAM Design Based On Multipumping-Enable

The document presents a novel design for efficient Ternary Content-Addressable Memory (TCAM) based on multipumping-enabled multiported SRAM implemented on FPGA. This design addresses the limitations of existing TCAM solutions by improving memory utilization and achieving up to 2.85 times better performance per memory through innovative mapping and clocking techniques. The proposed architecture significantly reduces the complexity and resource usage of traditional TCAM designs, making it more practical for large storage capacities.

Uploaded by

Tường Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Efficient TCAM Design Based On Multipumping-Enable

The document presents a novel design for efficient Ternary Content-Addressable Memory (TCAM) based on multipumping-enabled multiported SRAM implemented on FPGA. This design addresses the limitations of existing TCAM solutions by improving memory utilization and achieving up to 2.85 times better performance per memory through innovative mapping and clocking techniques. The proposed architecture significantly reduces the complexity and resource usage of traditional TCAM designs, making it more practical for large storage capacities.

Uploaded by

Tường Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/324416045

Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM


on FPGA

Article in IEEE Access · April 2018


DOI: 10.1109/ACCESS.2018.2822311

CITATIONS READS

41 3,161

1 author:

Dr. Zahid Ullah


Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology, Haripur, Pakistan
52 PUBLICATIONS 815 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dr. Zahid Ullah on 20 July 2018.

The user has requested enhancement of the downloaded file.


Received February 22, 2018, accepted March 28, 2018, date of publication April 3, 2018, date of current version April 25, 2018.
Digital Object Identifier 10.1109/ACCESS.2018.2822311

Efficient TCAM Design Based on


Multipumping-Enabled Multiported
SRAM on FPGA
INAYAT ULLAH 1 , ZAHID ULLAH2 , (Member, IEEE),
AND JEONG-A LEE1 , (Senior Member, IEEE)
1 Department of Computer Engineering, Chosun University, Gwangju 61452, South Korea
2 Department of Electrical Engineering, CECOS university of IT & Emerging Sciences, Peshawar 25000, Pakistan
Corresponding author: Jeong-A Lee ([email protected])
This work was supported in part by the National Research Foundation of Korea through the Ministry of Science and ICT under Grant
NRF-2016R1A2B4010382 and in part by the Korea Institute of Energy Technology Evaluation and Planning and Ministry of Trade,
Industry and Energy of the Republic of Korea under Grant 20164010201020.

ABSTRACT Ternary content-addressable memory (TCAM)-based search engines play an important role
in networking routers. The search space demands of TCAM applications are constantly rising. However,
existing realizations of TCAM on field-programmable gate arrays (FPGAs) suffer from storage inefficiency.
This paper presents a multipumping-enabled multiported SRAM-based TCAM design on FPGA, to achieve
an efficient utilization of SRAM memory. Existing SRAM-based solutions for TCAM reduce the impact of
the increase in the traditional TCAM pattern width from an exponential growth in memory usage to a linear
one using cascaded block RAMs (BRAMs) on FPGA. However, BRAMs on state-of-the-art FPGAs have a
minimum depth limitation, which limits the storage efficiency for TCAM bits. Our proposed solution avoids
this limitation by mapping the traditional TCAM table divisions to shallow sub-blocks of the configured
BRAMs, thus achieving a memory-efficient TCAM memory design. The proposed solution operates the
configured simple dual-port BRAMs of the design as multiported SRAM using the multipumping technique,
by clocking them with a higher internal clock frequency to access the sub-blocks of the BRAM in one
system cycle. We implemented our proposed design on a Virtex-6 xc6vlx760 FPGA device. Compared with
existing FPGA-based TCAM designs, our proposed method achieves up to 2.85 times better performance
per memory.

INDEX TERMS Block RAM (BRAM), field-programmable gate array (FPGA), memory architecture,
multiported memory, multipumping, SRAM-based TCAM.

I. INTRODUCTION as a filter when storing signature patterns, and achieve


Ternary content-addressable memory (TCAM) compares an a substantial reduction in energy consumption by reduc-
input word with its entire stored data in parallel, and out- ing wireless data transmissions of invalid data to cloud
puts the matched word’s address. TCAM stores data in three servers [4], [5].
states: 0, 1, and X (don’t care). Traditional TCAMs are built in Field-programmable gate arrays (FPGAs) emulate TCAM
application-specific integrated circuit (ASIC), and offer high- using static random-access memory (SRAM), by address-
speed search operations in a deterministic time. ing SRAM with TCAM contents. Each SRAM word corre-
TCAM is widely employed to design high-speed search sponds to a specific TCAM pattern, and stores information
engines and has applications in networking, artificial- on its existence for all possible data of the TCAM table.
intelligence, data compression, radar signal tracking, pat- The increase in the number of TCAM pattern bits results in
tern matching in virus-detection, gene pattern searching in an exponential growth in memory usage. This exponential
bioinformatics, image processing, and to accelerate var- growth in memory usage has been reduced to linear growth
ious database search primitives [1]–[3]. The Internet-of- by cascading multiple SRAM blocks in the design of TCAM
things and big-data processing devices employ TCAM on FPGA in previous work [6], [7].

2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
19940 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

Contemporary FPGAs implement block-RAM (BRAM) in TABLE 1. List of basic notations used.
the silicon substrate, and offer a high speed. For example,
Xilinx Virtex-6 xc6vlx760 FPGA contains 720 BRAMs of
size 36 Kb [8], and provide operating frequencies of greater
than 500 MHz [9]. Designers utilize these high-speed SRAM
blocks to design SRAM-based TCAMs on FPGA.
In existing SRAM-based solutions, the storage capac-
ity of a BRAM for TCAM bits is limited by its higher
9
SRAM/TCAM ratio 29 , because of its minimum depth lim-
itation of 512 × 72 when configured in simple dual-port
mode on FPGA [8]. For example, the design methodologies
proposed in [10], [11], and [12], require a total of 56, 40,
and 40 BRAMs of size 36 Kb, respectively, to implement an
18 Kb TCAM.
Excessive usage of BRAMs in the design of TCAM can • The proposed design is more practical for large stor-
result in a lack of BRAMs for other parts of the system on age capacities, owing to the reduced routing complex-
FPGA. Furthermore, the limited amount of BRAM resources ity achieved by the use of fewer BRAMs and the
on FPGA can compel designers to implement TCAMs in reduced AND operation complexity. The novel optimiza-
distributed RAM using SLICEM, resulting in the consump- tion technique of AND-accumulating SRAM words in
tion of many slices, and a limitation on the maximum clock the proposed TCAM memory units divides the overall
frequency of the design. This problem becomes more severe AND operation complexity of the design.
for the design of large storage capacity TCAMs. The efficient • The proposed design is implemented on a state-of-the-
utilization of SRAM memory is imperative for the design of art FPGA. A detailed comparison of our proposed design
TCAMs on FPGAs. with existing methods is performed with respect to the
The design of memory-efficient TCAMs requires shallow performance per memory. Our proposed design achieves
SRAM blocks on FPGAs. Multipumping-based multiported a performance that is up to 2.85× higher per memory.
SRAM emulates the sub-blocks of a dual port SRAM block The remainder of this paper is organized as follows. Section II
as multiple shallow SRAM blocks, by operating SRAM with surveys related work. The proposed design is described in
a higher frequency clock, allowing access to its sub-blocks Section III. Section IV details the implementation setup and
in one system cycle. Researchers have designed efficient results of this work. The performance evaluation of the pro-
multiported memories using BRAMs on FPGA [13]–[16]. posed design is detailed in Section V. Section VI concludes
Existing FPGA-based TCAM design methodologies offer this work. Table 1 describes the basic notations used in
lower operational frequencies. This is mainly because of the paper.
complex wide signals routing between BRAMs and logic
resulting from excessive usage of BRAMs and complex pri- II. RELATED WORK
ority encoding units synthesized in logic slices for deeper The CAM design methodologies presented in [18] and [19]
traditional TCAMs. For example, the FPGA realizations of are based on the hashing technique, which has the inherent
TCAM using BRAMs in [7] and [17] achieve operational drawback of bucket overflow. Moreover, when implemented
frequencies of 139 MHz and 133 MHz to emulate 150 Kb in hardware this has an expensive overhead from re-hashing.
and 89 Kb TCAMs, respectively. The highest operational The CAM designs presented in [20] and [21] suffer from
frequency achieved in the previous studies [6], [10]–[12] is inefficient memory usage. The increase in pattern width
202 MHz for the implementation of an 18 Kb TCAM on results in an exponential growth in memory usage, thus
FPGA. making them infeasible for implementation in hardware. Our
The demand for efficient utilization of SRAM memory proposed solution reduces this growth to linear, as the wide
in the design of TCAM and the speed provided by existing pattern TCAMs are implemented by cascading BRAMs on
FPGA-based TCAM solutions make the use of multipump- FPGA.
ing based multiported SRAM more practical for designing The SRAM-based TCAMs presented in [10]–[12] store the
TCAM memory on FPGA. Our proposed TCAM design TCAM presence and address information in separate BRAMs
aims to achieve efficient memory utilization with a high on FPGAs, resulting in an excessive usage of BRAMs. Our
throughput. proposed design stores the TCAM presence and address
The contributions of this work are as follows: information in the same BRAM, thus efficient memory uti-
• A novel multipumping-enabled multiported SRAM- lization.
based TCAM architecture, which achieves efficient Xilinx presented two types of FPGA applications in [22]:
memory utilization, is proposed. a CAM design using BRAM resources and a TCAM design
• Our proposed approach presents a scalable and modular using the shift register (SRLE16). The first application emu-
TCAM design on FPGA. lates CAM rather than TCAM, and suffers from higher

VOLUME 6, 2018 19941


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

SRAM memory usage. The second application consumes


one 16-bit shift register look-up table (SRL16E) of SLICEM
resources on FPGA to emulate every two bits of a TCAM
table. Its implementation for large storage capacity designs
suffers from routing and timing problems. Our proposed
design has a reduced routing complexity for TCAM designs
with large storage capacities, because of its lower usage of
BRAMs and reduced AND operation complexity.
Recently binary CAM and TCAM designs built using logic
resources (SLICEL) on FPGA are presented in [23] and [24]
respectively. Practically the TCAM implemented using logic
resources on FPGA would be of limited storage capacity, FIGURE 1. Multipumping-based multiported memory: the SRAM block is
owing to the routing congestion and timing challenges. More- clocked at an integral multiple of P, allowing P access during one
over, the update of data in a TCAM design built using look-up external clock cycle.

tables (LUTs) is slow compared with SRAM-based TCAMs


and requires hardware overhead of dynamic partial reconfig-
uration controller [25].
A hierarchical search scheme on FPGA is presented for
SRAM-based CAM in [17], which reduces its average power
consumption by stopping subsequent search operations if
a match is found in the previous SRAM block. However,
in the worst-case scenario all SRAM blocks are searched.
Thus, the worst-case power consumption remains high. The
FPGA realization of TCAM presented in [26] stores the
TCAM word presence and address information separately
in Xilinx distributed RAM and BRAM, respectively. This
reduces the average power consumption of the design, as the
look-up in BRAMs is avoided if a match is not found
in the distributed RAM. However, the worst-case power
consumption remains high, with a lower overall system
throughput. FIGURE 2. (a) A conventional TCAM of 1 × 8; (b) An 16 × 1 SRAM without
The FPGA realizations of TCAM presented in [6], [7], multipumping emulating 1 × 4 TCAM; (c) An 16 × 1 SRAM with a
and [17] store the presence and address information of TCAM multipumping factor of P = 2 emulating 1 × 6 TCAM; (d) An 16 × 1 SRAM
with a multipumping factor of P = 4 emulating 1 × 8 TCAM.
words in the same SRAM block. However, this approach
suffers from higher SRAM memory utilization due to the
limited TCAM bits storage capacity of BRAMs resulting
from the minimum depth limitation on its configuration in B. BASIC IDEA
FPGAs. In the SRAM-based implementation of TCAM, the depth
Our proposed TCAM design exploits the efficient of the traditional TCAM determines the width of SRAM
utilization of SRAM memory by mapping TCAM divi- memory, and the width of the traditional TCAM is encoded
sions to shallow sub-blocks of BRAMs on FPGA. Fur- as the address of the SRAM memory. The basic concept
thermore, it operates high-speed BRAMs in the design as of the proposed multipumped SRAM-based TCAM imple-
multipumping-enabled multiported SRAM, maintaining a mentation achieving increased memory efficiency is shown
high system throughput. in Figure 2. Figure 2(a) shows a 1 × 8 traditional TCAM
table, and Figure 2(b) shows the implementation of the four
TCAM bits (0*10) by using a 16×1 SRAM block. Figure 2(c)
III. PROPOSED DESIGN shows the implementation of six TCAM bits (100*10) by
A. MULTIPUMPING-ENABLED MULTIPORTED SRAM using 16 × 1 SRAM block, which has been multipumped
The multipumping technique multiplies the ports of a dual two times, each SRAM sub-block of size 8 × 1 emulating
ported SRAM block by internally clocking it at an integral three TCAM bits. Figure 2(d) shows the implementation
multiple of the external system clock [13], [15], [16], [27]. of eight TCAM bits (0*1000*10) by using 16 × 1 SRAM
The addresses and data are registered and provided access block, which has been multipumped four times, each SRAM
to the SRAM block in a circular order by using mod P sub-block of size 4 × 1 emulating two TCAM bits. Thus,
counter bits as shown in Figure 1. Several designs utilize designing TCAM using multipumping-enabled multiported
multipumping for the implementation of efficient multiported SRAM in Figure 2(c) and (d) achieved a higher SRAM mem-
memory [28]–[30]. ory efficiency (i.e. fewer SRAM bits are utilized per TCAM

19942 VOLUME 6, 2018


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

FIGURE 4. Basic architecture of the proposed TCAM memory.


FIGURE 3. Proposed partitioning of the traditional TCAM table.

bit) when compared with that of multipumping-less SRAM-


based TCAM design in Figure 2(b). The TCAM bits storage
capacity of the SRAM block increases with multipumping.
A multiported SRAM block of size RD × RW with a
multipumping factor of P implements a traditional TCAM
table of size Plog2 (RD /P) × RW , each SRAM sub-block of
size (RD /P) × RW emulating log2 (RD /P) × RW TCAM data,
as shown in Figure 3 and 4. Our proposed design achieves
increased TCAM bits storage capacity with an increase in
multipumping factor P.

C. PROPOSED PARTITIONING OF TRADITIONAL


TCAM TABLE
We partition the traditional TCAM table of size D × W into FIGURE 5. Organization of the proposed TCAM memory units for a large
M × N partitions such that each partition consists of P parts storage capacity: (IW : input word, PE : priority encoder, OPE : overall
of log2 (RD /P) × RW size as shown in Figure 3. Our proposed priority encoder).

TCAM design uses its configured SRAM blocks of RD × RW


size as multiported SRAM, constituting P sub-blocks of size
(RD /P) × RW as shown in Figure 4. It is initialized to zero upon reset and it rolls over after every
Each sub-block of the SRAM stores log2 (RD /P) × RW P internal clock cycles. The log2 P-bits from the counter are
size divisions of the traditional TCAM. Consequently the P concatenated with the log2 (RD /P) bits from the shift register
sub-blocks of the multiported SRAM memory in our pro- to make the log2 RD -bit address space of the SRAM. At the
posed design stores a traditional TCAM division of size positive edge of the internal clock clkP , the SRAM address
Plog2 (RD /P) × RW as shown in the Figures 3 and 4. Sim- is executed such that log2 P-bits from the counter constitute
ilarly, the M × N TCAM divisions of size Plog2 (RD /P) × its most significant bits, and points to the start of the cor-
RW are mapped to the SRAM blocks of the M × N responding sub-block in SRAM and the lower log2 (RD /P)
TCAM memory units in the proposed design, as shown bits from the shift register selects an SRAM word in the
in Figures 3 and 5. sub-block.
The read SRAM words are AND-accumulated for each
D. BASIC ARCHITECTURE OF THE PROPOSED cycle in an RW -bit register using clkP . Similarly, the look-
TCAM MEMORY up is completed for a W -bit input word by reading and
The basic architecture of our proposed TCAM memory AND-accumulating SRAM words from each sub-block of
design is shown in Figure 4. It is operated by two fully the SRAM in P internal clock cycles or one system cycle.
synchronized clocks, a system clock clkS and internal clock Consequently, the P AND-accumulated SRAM words are
clkP , such that clkP is P times faster than clkS . An incoming produced as match word using clkS . The timing diagram
TCAM word is registered in a W -bit shift register using in Figure 6 elaborates the search operation of the proposed
the system clock clkS . The log2 P-bit counter generates a TCAM memory architecture shown in Figure 4 with a multi-
sequence of log2 P-bit numbers in P internal clock cycles. pumping factor of P = 2.

VOLUME 6, 2018 19943


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

TABLE 2. FPGA resource utilization of the proposed design.

to avoid a significant drop in the operating frequency of the


overall system. Overall multipumping factor P controls a
FIGURE 6. Timing diagram for the search operation in our proposed tradeoff between the SRAM memory efficiency and speed of
TCAM with a multipumping factor P = 2: (IW : input word, RW : SRAM
word read, MW : match word).
the proposed design.

IV. IMPLEMENTATION SETUP AND RESULTS


To verify our proposed design we implemented it on a Xilinx
E. MODULAR ARCHITECTURE Virtex-6 FPGA device (xc6vlx760). The proposed design
TCAM design of large storage capacity is implemented as was implemented using the Xilinx ISE 14.7 design tool, and
a cascade of M × N proposed design TCAM memory units verified through behavioral and post-route simulations using
as shown in Figure 5. An incoming W -bit TCAM word is an ISim simulator.
divided into N sub-words of Plog2 (RD /P)-bits with the bit We implemented our proposed design cases I and II on
ranges shown in Figure 5. The resultant sub-words are stored the Xilinx Virtex-6 FPGA device for 512 × 28 (14 Kb) and
in N shift registers of size Plog2 (RD /P)-bits on clkS . The 512 × 32 (16 Kb) TCAM tables, with multipumping factors
log2 RD -bit indexes from the N shift registers are provided to of P = 4 and P = 2, respectively. Our proposed design
the corresponding M TCAM memory units of the N columns CASE-III implements a large TCAM table of size 1024×140
of the proposed design in parallel using clkP , as shown in Fig- (140 Kb), with a multipumping factor of P = 4. We have
ure 5. All TCAM memory units of the design operate in selected small multipumping factors of P = 4, 2, and 4, in our
parallel using clkP . The RW -bit match words from each row proposed design cases I, II, and III, to avoid lower operating
of the TCAM memory units are bit-wise ANDed on clkS , and frequencies of the overall system.
the results are provided to the associated priority encoder (PE) Table 2 lists the FPGA resource utilization slice registers
units. The log2 D-bit match address and the match information (SRs), look-up tables, and BRAMs for the implementation of
from each PE unit are provided to the overall priority encoder our proposed design cases I, II, and III. The post place & route
unit, which eventually forwards a match address based on the results show that the proposed design cases I, II, and III could
priority. The proposed TCAM design registers an input word achieve internal clock frequencies of 475 MHz, 475 MHz,
and produces a match word as output on clkS . and 349 MHz and multipumping factors of P = 4, 2, and 4,
The update of a TCAM word is performed in each TCAM giving the system clock frequencies of 119 MHz, 237 MHz,
memory unit of the design in parallel. The worst-case update and 87 MHz, respectively.
latency of the proposed design comprises RD /P system
cycles. V. PERFORMANCE EVALUATION
The performance of our proposed design is evaluated based
F. EFFECT OF MULTIPUMPING SRAM ON THE MEMORY on its comparison with the existing SRAM-based TCAM
USAGE AND THROUGHPUT solutions on FPGAs.
Multipumping results in a useful reduction in SRAM memory
usage for the design of TCAM on FPGA. The configured A. SRAM MEMORY UTILIZATION
SRAM memory blocks in our proposed design with the multi- SRAM-based TCAM solutions implement a traditional
pumping factor of P implements traditional TCAM divisions TCAM of depth D and width W by cascading SRAM blocks
of size Plog2 (RD /P) × RW as shown in Figure 4. The TCAM of size RD × RW on FPGAs. The minimum overall SRAM
bits storage capacity of SRAM blocks in the proposed design memory requirement of the existing SRAM-based TCAM
increases with an increase in P. The upper bound on the solutions on FPGAs can be formulated as (1) shown below:
multipumping factor P is RD /2, i.e. RD /2 sub-blocks in the D W
RW log2 RD 
 
SRAM and each sub-block consists of two SRAM words. X X D W
(RD × RW ) = (RD × RW )
Multipumping divides the achievable internal clock fre- RW log2 RD
M =1 N =1
quency of the design by the multipumping factor, to obtain  
RD
the operating frequency of the overall system [13]–[16]. = DW (1)
Although an increase in the multipumping factor P results in log2 RD
a higher memory efficiency for the design of TCAM, only The overall memory requirement of the proposed design for
the use of small multipumping factors is practical in order the implementation of a D × W size traditional TCAM using

19944 VOLUME 6, 2018


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

TABLE 3. Performance per memory comparison of the proposed TCAM with previous approaches.

RD × RW size SRAM blocks is devised as (2) shown below: of 119 MHz and 237 MHz with multipumping factors of P =
D W 4 and 2 respectively. The operating frequency of our proposed
RW Plog2 (RD /P)
X X design CASE-II is higher than previous works [10]–[12],
(RD × RW ) [22], and [26] for an 18 Kb traditional TCAM emulation.
M =1 N =1 Our proposed design methodology is more useful for the
 
D W design of large storage capacity TCAMs. The TCAM mem-
= (RD × RW )
RW Plog2 (RD /P) ory units of our proposed design AND-accumulate SRAM
 
RD words from the sub-blocks of the SRAM blocks in each
= DW (2)
Plog2 (RD /P) system cycle, reducing the complexity of the AND operation
units of the overall architecture, as shown in Figures 4 and 5.
Equation (2) describes that the SRAM memory usage of our
This further prevents the AND operation units from limiting
proposed design is Plog2R(RD D /P) times that of the corresponding
the operating frequency of wide pattern TCAMs designs
traditional TCAM table of size D × W .
on FPGA. Our proposed design uses fewer BRAMs, thus
Our proposed design achieves a considerable reduction in
alleviating the overall routing complexity of the design on
the SRAM memory usage by a factor of P[1−log21P/log2 RD ] ,
FPGA. The divided AND operation complexity and reduced
when compared with that of the existing approaches as
routing complexity makes our proposed design more practical
described using (3) as follows:
  for large storage capacity TCAMs.
DW Plog2R(RD D /P) log2 RD log2 RD
The system frequency of our proposed design CASE-III
 = = emulating a large capacity TCAM of 140 Kb is 87 MHz,
Plog2 (RD /P)

DW RD P[log2 RD − log2 P] which is comparable with the maximum achievable frequency
log2 RD
1 97 MHz in previous work [7] implementing a large size
= (3) TCAM of 150 Kb. While the SRAM memory usage of our
P[1 − log2 P/log2 RD ]
proposed design CASE-III is 70% lower than that of [7].
The usage of BRAMs in our proposed design is compared Our proposed design provides increased design flexibil-
with those of previous approaches in Column 5 of Table 3. ity in terms of the speed vs memory usage tradeoff. The
Our proposed TCAM design CASE-I emulates a 14 Kb designer must consider the important design factors such as
traditional TCAM, achieving a lower BRAMs utilization the required storage capacity, relative availability of BRAMs
of 8 BRAMs compared with the usage of 56, 40, 40, 32, on the target FPGA, and required throughput for the selection
and 64 BRAMs for previous approaches in [10]–[12], [22], of the multipumping factor in our proposed design.
and [26], respectively for an 18 Kb traditional TCAM emula-
tion. The proposed design CASE-III emulates a large TCAM C. PERFORMANCE PER MEMORY
of size 1024 × 140 using 80 BRAMs. It achieves a lower Considering the time-space tradeoff, we used the perfor-
BRAMs utilization compared with the large TCAM imple- mance evaluation metric performance per memory from [31],
mentations of size 1024 × 150 and 504 × 180 in the previous given by (4).
approaches [7] and [17], using 272 and 140 BRAMs, respec-
tively. Throughput(Gb/s)
(4)
Normalized Memory [Memory(Kb)/TCAM Depth]
B. THROUGHPUT Table 3 compares the performance per memory of our design
The operational speed of our proposed design is compared with previous FPGA-based TCAMs. The depth and pattern
with those of previous approaches in column 4 of Table 3. Our width of traditional TCAMs implemented in previous studies
proposed design cases I and II emulates traditional TCAM are listed in the third column. For a fair comparison, the speed
of size 14 Kb and 16 Kb achieving operating frequencies results of the compared works with technology differences

VOLUME 6, 2018 19945


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

are normalized to 40 nm, using (5) from [32]. The speed [4] L.-Y. Huang et al., ‘‘ReRAM-based 4T2R nonvolatile TCAM with 7x
results in parenthesis represent the original data reported in NVM-stress reduction, and 4x improvement in speed-wordlength-capacity
for normally-off instant-on filter-based search engines used in big-data
the respective papers. processing,’’ in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2014, pp. 1–2.
    [5] M.-F. Chang et al., ‘‘A 3T1R nonvolatile TCAM using MLC ReRAM
∗ 40(nm) VDD for frequent-off instant-on filters in IoT and big-data processing,’’ IEEE
T =T × × (5)
Technology(nm) 1.0 J. Solid-State Circuits, vol. 52, no. 6, pp. 1664–1679, Jun. 2017.
[6] Z. Ullah, M. K. Jaiswal, R. C. C. Cheung, and H. K. H. So,
where T represents the original delay time, and T ∗ denotes ‘‘UE-TCAM: An ultra efficient SRAM-based TCAM,’’ in Proc. IEEE
the normalized delay time for 40 nm CMOS technology Region 10 Conf. (TENCON), Nov. 2015, pp. 1–6.
[7] W. Jiang, ‘‘Scalable ternary content addressable memory implementation
with a supply voltage of 1.0 V. The proposed design cases using FPGAs,’’ in Proc. 9th ACM/IEEE Symp. Archit. Netw. Commun.
I and II implemented 14 Kb and 16 Kb traditional TCAMs Syst., 2013, pp. 71–82.
using 288 Kb and 576 Kb SRAM memory with operat- [8] Virtex-6 FPGA Memory Resources User Guide, Xilinx, San Jose, CA,
USA, 2014. [Online]. Available: http://www.xilinx.com
ing frequencies of 119 MHz and 237 MHz, respectively.
[9] P. Alfke, ‘‘Creative uses of block RAM,’’ Xilinx, San Jose, CA, USA,
The proposed design cases I and II achieved a performance White Paper WP335, 2008.
per memory of 5.78 ((Gb/s × TCAMDepth)/Kb) and 6.58 [10] Z. Ullah, K. Ilgon, and S. Baeg, ‘‘Hybrid partitioned SRAM-based ternary
((Gb/s × TCAMDepth)/Kb), respectively. content addressable memory,’’ IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 59, no. 12, pp. 2969–2979, Dec. 2012.
Table 3 shows that the performance per memory of the [11] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, ‘‘Z-TCAM: An SRAM-
proposed design cases I and II are 1.83 times higher than based architecture for TCAM,’’ IEEE Trans. Very Large Scale Integr.
that of UE-TCAM [6], which was the highest among the (VLSI) Syst., vol. 23, no. 2, pp. 402–406, Feb. 2015.
[12] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, ‘‘E-TCAM: An efficient
existing methods. Our proposed design CASE-III emulates a SRAM-based architecture for TCAM,’’ Circuits, Syst. Signal Process.,
large TCAM of size 1024 × 140, achieving the performance vol. 33, no. 10, pp. 3123–3144, Oct. 2014.
per memory of 4.25 ((Gb/s × TCAMDepth)/Kb), which is [13] C. E. LaForest and J. G. Steffan, ‘‘Efficient multi-ported memories for
FPGAs,’’ in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. Gate
2.85 times higher than for large TCAM of size 1024 × 150 in Arrays, 2010, pp. 41–50.
the existing study [7]. [14] A. Abdelhadi and G. G. F. Lemieux, ‘‘Modular switched multiported
Our proposed design scales well in terms of the perfor- SRAM-based memories,’’ ACM Trans. Reconfigurable Technol. Syst.,
vol. 9, no. 3, p. 22, 2016.
mance when evaluated for the design of a large storage capac-
[15] H. E. Yantir, S. Bayar, and A. Yurdakul, ‘‘Efficient implementations of
ity. Table 3 shows that the performance per memory of our multi-pumped multi-port register files in FPGAs,’’ in Proc. Euromicro
proposed design CASE-III is slightly lower than the proposed Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185–192.
design CASE-I (with the same multipumping factor of P = 4) [16] C. E. LaForest. Multi-Ported Memories for FPGAs. Accessed:
Nov. 10, 2017. [Online]. Available: http://fpgacpu.ca/multiport/index.html
while the implemented TCAM size of CASE-III is ten times [17] Z. Qian and M. Margala, ‘‘Low power RAM-based hierarchical CAM on
greater than that of CASE-I. FPGA,’’ in Proc. Int. Conf. ReConFigurable Comput. FPGAs (ReConFig),
Dec. 2014, pp. 1–4.
[18] P. Mahoney, Y. Savaria, G. Bois, and P. Plante, ‘‘Parallel hashing memories:
VI. CONCLUSIONS AND FUTURE WORK An alternative to content addressable memories,’’ in Proc. 3rd Int. IEEE-
Re-configurable hardware FPGAs emulate TCAM function- NEWCAS Conf., Jun. 2005, pp. 223–226.
ality using SRAM memory. Existing SRAM-based solutions [19] S. Cho, J. R. Martin, R. Xu, M. H. Hammoud, and R. Melhem, ‘‘CA-RAM:
A high-performance memory substrate for search-intensive applications,’’
of TCAM on FPGAs achieve inefficient memory usage and in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Apr. 2007,
offer lower operational frequencies. We have presented a pp. 230–241.
memory-efficient design of TCAM, based on multipumping- [20] S. V. Kartalopoulos, ‘‘RAM-based associative content-addressable mem-
ory device, method of operation thereof and ATM communication switch-
enabled multiported SRAM, by operating the SRAM blocks
ing system employing the same,’’ U.S. Patent 6 097 724, Aug. 1, 2000.
in the design at a frequency that is multiple times higher [21] M. Somasundaram, ‘‘Circuits to generate a sequential index for an
than that of the overall system. This allows reading from input number in a pre-defined list of numbers,’’ U.S. Patent 7 155 563,
its sub-blocks to take place within one system cycle. The Dec. 26, 2006.
[22] K. Locke. (2011). Xilinx Application Note: XAPP1151—Parameterizable
FPGA implementation results show that the performance per Content-Addressable Memory. [Online]. Available: http://www.xilinx.com
memory of our proposed design is up to 2.85 times higher [23] Z. Ullah, ‘‘LH-CAM: Logic-based higher performance binary CAM archi-
than for existing SRAM-based TCAM solutions on FPGA. tecture on FPGA,’’ IEEE Embedded Syst. Lett., vol. 9, no. 2, pp. 29–32,
Jun. 2017.
Our proposed solution is general, and can be applied to [24] M. Irfan and Z. Ullah, ‘‘G-AETCAM: Gate-based area-efficient
many applications. Our future work will include the appli- ternary content-addressable memory on FPGA,’’ IEEE Access, vol. 5,
cation of the proposed design to various applications. pp. 20785–20790, 2017.
[25] A. Kulkarni and D. Stroobandt, ‘‘MiCAP-Pro: A high speed custom
reconfiguration controller for dynamic circuit specialization,’’ Des. Autom.
REFERENCES Embedded Syst., vol. 20, no. 4, pp. 341–359, 2016.
[1] B. Agrawal and T. Sherwood, ‘‘Ternary CAM power and delay model: [26] A. Ahmed, K. Park, and S. Baeg, ‘‘Resource-efficient SRAM-based ternary
Extensions and uses,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., content addressable memory,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
vol. 16, no. 5, pp. 554–564, May 2008. Syst., vol. 25, no. 4, pp. 1583–1587, Apr. 2017.
[2] M. Imani, A. Rahimi, and T. S. Rosing, ‘‘Resistive configurable associative [27] N. Manjikian, ‘‘Design issues for prototype implementation of a pipelined
memory for approximate computing,’’ in Proc. IEEE Design, Autom. Test superscalar processor in programmable logic,’’ in Proc. IEEE Pacific Rim
Eur. Conf. Exhib. (DATE), Mar. 2016, pp. 1327–1332. Conf. Commun., Comput. Signal Process. (PACRIM), vol. 1. Aug. 2003,
[3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, ‘‘Design techniques and pp. 155–158.
test methodology for low-power TCAMs,’’ IEEE Trans. Very Large Scale [28] H. Yokota, ‘‘Multiport memory system,’’ U.S. Patent 4 930 066,
Integr. (VLSI) Syst., vol. 14, no. 6, pp. 573–586, Jun. 2006. May 29, 1990.

19946 VOLUME 6, 2018


I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA

[29] B. A. Chappell, T. I. Chappell, M. K. Ebcioglu, and S. E. Schuster, ‘‘Vir- ZAHID ULLAH (M’16) received the B.Sc. degree
tual multi-port RAM employing multiple accesses during single machine (Hons.) in computer system engineering from
cycle,’’ U.S. Patent 5 542 067, Jul. 30, 1996. the University of Engineering and Technology,
[30] G. S. Ditlow et al., ‘‘A 4R2W register file for a 2.3 GHz wire- Peshawar, Pakistan in 2006, the M.S. degree in
speed POWER processor with double-pumped write operation,’’ in IEEE electronic, electrical, control, and instrumenta-
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2011, tion engineering from Hanyang University, South
pp. 256–258. Korea, in 2010, and the Ph.D. degree in elec-
[31] H. Nakahara, T. Sasao, H. Iwamoto, and M. Matsuura, ‘‘LUT cascades
tronic engineering from the City University of
based on edge-valued multi-valued decision diagrams: Application to
Hong Kong, Hong Kong, in 2014. He is currently
packet classification,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 6,
no. 1, pp. 73–86, Mar. 2016. serving as an Associate Professor with the Depart-
[32] P.-T. Huang and W. Hwang, ‘‘A 65 nm 0.165 fJ/Bit/search 256×144 TCAM ment of Electrical Engineering, CECOS University of IT & Emerging Sci-
macro design for IPv6 lookup tables,’’ IEEE J. Solid-State Circuits, vol. 46, ences, Peshawar, Pakistan. He has authored prestigious journal and confer-
no. 2, pp. 507–519, Feb. 2011. ence papers and holds patents in his name in the field of FPGA-based TCAM.
His research interests include low power/high speed CAM design on FPGA,
low power/high speed VLSI design, and embedded systems.

JEONG-A LEE (M’84–SM’01) received the


B.S. degree (Hons.) in computer engineering from
Seoul National University in 1982, the M.S. degree
in computer science from Indiana University
Bloomington, in 1985, and the Ph.D. degree in
computer science from the University of California,
INAYAT ULLAH received the bachelor’s degree Los Angeles in 1990. From 1990 to 1995, she
in computer system engineering from the Uni- was an Assistant Professor with the Department of
versity of Engineering and Technology, Peshawar, Electrical and Computer Engineering, University
Pakistan, in 2007. He is currently pursuing the of Houston. Since 1995 she has been affiliated
Ph.D. degree with the College of Electronics with Chosun University, South Korea. From 2008 to 2009, she served
and Information Engineering, Chosun University, as a Program Director of ECE division, National Research Foundation
South Korea. Since 2008 he has been a Faculty of Korea. She has authored and co-authored over 100 reviewed journal
Member with the Department of Electrical Engi- and conference papers. Her research activities cover high performance
neering, Federal Urdu University of Arts, Science computer architectures, memory architecture, approximate computing, self-
& Technology, Pakistan. His areas of interest are aware computing, and reliable computing. She is a member of the National
digital design, computer architecture, parallel processing, memory architec- Academy of Engineering in South Korea.
ture, and re-configurable architectures.

VOLUME 6, 2018 19947

View publication stats

You might also like