Efficient TCAM Design Based On Multipumping-Enable
Efficient TCAM Design Based On Multipumping-Enable
net/publication/324416045
CITATIONS READS
41 3,161
1 author:
SEE PROFILE
All content following this page was uploaded by Dr. Zahid Ullah on 20 July 2018.
ABSTRACT Ternary content-addressable memory (TCAM)-based search engines play an important role
in networking routers. The search space demands of TCAM applications are constantly rising. However,
existing realizations of TCAM on field-programmable gate arrays (FPGAs) suffer from storage inefficiency.
This paper presents a multipumping-enabled multiported SRAM-based TCAM design on FPGA, to achieve
an efficient utilization of SRAM memory. Existing SRAM-based solutions for TCAM reduce the impact of
the increase in the traditional TCAM pattern width from an exponential growth in memory usage to a linear
one using cascaded block RAMs (BRAMs) on FPGA. However, BRAMs on state-of-the-art FPGAs have a
minimum depth limitation, which limits the storage efficiency for TCAM bits. Our proposed solution avoids
this limitation by mapping the traditional TCAM table divisions to shallow sub-blocks of the configured
BRAMs, thus achieving a memory-efficient TCAM memory design. The proposed solution operates the
configured simple dual-port BRAMs of the design as multiported SRAM using the multipumping technique,
by clocking them with a higher internal clock frequency to access the sub-blocks of the BRAM in one
system cycle. We implemented our proposed design on a Virtex-6 xc6vlx760 FPGA device. Compared with
existing FPGA-based TCAM designs, our proposed method achieves up to 2.85 times better performance
per memory.
INDEX TERMS Block RAM (BRAM), field-programmable gate array (FPGA), memory architecture,
multiported memory, multipumping, SRAM-based TCAM.
2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
19940 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 6, 2018
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
I. Ullah et al.: Efficient TCAM Design Based on Multipumping-Enabled Multiported SRAM on FPGA
Contemporary FPGAs implement block-RAM (BRAM) in TABLE 1. List of basic notations used.
the silicon substrate, and offer a high speed. For example,
Xilinx Virtex-6 xc6vlx760 FPGA contains 720 BRAMs of
size 36 Kb [8], and provide operating frequencies of greater
than 500 MHz [9]. Designers utilize these high-speed SRAM
blocks to design SRAM-based TCAMs on FPGA.
In existing SRAM-based solutions, the storage capac-
ity of a BRAM for TCAM bits is limited by its higher
9
SRAM/TCAM ratio 29 , because of its minimum depth lim-
itation of 512 × 72 when configured in simple dual-port
mode on FPGA [8]. For example, the design methodologies
proposed in [10], [11], and [12], require a total of 56, 40,
and 40 BRAMs of size 36 Kb, respectively, to implement an
18 Kb TCAM.
Excessive usage of BRAMs in the design of TCAM can • The proposed design is more practical for large stor-
result in a lack of BRAMs for other parts of the system on age capacities, owing to the reduced routing complex-
FPGA. Furthermore, the limited amount of BRAM resources ity achieved by the use of fewer BRAMs and the
on FPGA can compel designers to implement TCAMs in reduced AND operation complexity. The novel optimiza-
distributed RAM using SLICEM, resulting in the consump- tion technique of AND-accumulating SRAM words in
tion of many slices, and a limitation on the maximum clock the proposed TCAM memory units divides the overall
frequency of the design. This problem becomes more severe AND operation complexity of the design.
for the design of large storage capacity TCAMs. The efficient • The proposed design is implemented on a state-of-the-
utilization of SRAM memory is imperative for the design of art FPGA. A detailed comparison of our proposed design
TCAMs on FPGAs. with existing methods is performed with respect to the
The design of memory-efficient TCAMs requires shallow performance per memory. Our proposed design achieves
SRAM blocks on FPGAs. Multipumping-based multiported a performance that is up to 2.85× higher per memory.
SRAM emulates the sub-blocks of a dual port SRAM block The remainder of this paper is organized as follows. Section II
as multiple shallow SRAM blocks, by operating SRAM with surveys related work. The proposed design is described in
a higher frequency clock, allowing access to its sub-blocks Section III. Section IV details the implementation setup and
in one system cycle. Researchers have designed efficient results of this work. The performance evaluation of the pro-
multiported memories using BRAMs on FPGA [13]–[16]. posed design is detailed in Section V. Section VI concludes
Existing FPGA-based TCAM design methodologies offer this work. Table 1 describes the basic notations used in
lower operational frequencies. This is mainly because of the paper.
complex wide signals routing between BRAMs and logic
resulting from excessive usage of BRAMs and complex pri- II. RELATED WORK
ority encoding units synthesized in logic slices for deeper The CAM design methodologies presented in [18] and [19]
traditional TCAMs. For example, the FPGA realizations of are based on the hashing technique, which has the inherent
TCAM using BRAMs in [7] and [17] achieve operational drawback of bucket overflow. Moreover, when implemented
frequencies of 139 MHz and 133 MHz to emulate 150 Kb in hardware this has an expensive overhead from re-hashing.
and 89 Kb TCAMs, respectively. The highest operational The CAM designs presented in [20] and [21] suffer from
frequency achieved in the previous studies [6], [10]–[12] is inefficient memory usage. The increase in pattern width
202 MHz for the implementation of an 18 Kb TCAM on results in an exponential growth in memory usage, thus
FPGA. making them infeasible for implementation in hardware. Our
The demand for efficient utilization of SRAM memory proposed solution reduces this growth to linear, as the wide
in the design of TCAM and the speed provided by existing pattern TCAMs are implemented by cascading BRAMs on
FPGA-based TCAM solutions make the use of multipump- FPGA.
ing based multiported SRAM more practical for designing The SRAM-based TCAMs presented in [10]–[12] store the
TCAM memory on FPGA. Our proposed TCAM design TCAM presence and address information in separate BRAMs
aims to achieve efficient memory utilization with a high on FPGAs, resulting in an excessive usage of BRAMs. Our
throughput. proposed design stores the TCAM presence and address
The contributions of this work are as follows: information in the same BRAM, thus efficient memory uti-
• A novel multipumping-enabled multiported SRAM- lization.
based TCAM architecture, which achieves efficient Xilinx presented two types of FPGA applications in [22]:
memory utilization, is proposed. a CAM design using BRAM resources and a TCAM design
• Our proposed approach presents a scalable and modular using the shift register (SRLE16). The first application emu-
TCAM design on FPGA. lates CAM rather than TCAM, and suffers from higher
TABLE 3. Performance per memory comparison of the proposed TCAM with previous approaches.
RD × RW size SRAM blocks is devised as (2) shown below: of 119 MHz and 237 MHz with multipumping factors of P =
D W 4 and 2 respectively. The operating frequency of our proposed
RW Plog2 (RD /P)
X X design CASE-II is higher than previous works [10]–[12],
(RD × RW ) [22], and [26] for an 18 Kb traditional TCAM emulation.
M =1 N =1 Our proposed design methodology is more useful for the
D W design of large storage capacity TCAMs. The TCAM mem-
= (RD × RW )
RW Plog2 (RD /P) ory units of our proposed design AND-accumulate SRAM
RD words from the sub-blocks of the SRAM blocks in each
= DW (2)
Plog2 (RD /P) system cycle, reducing the complexity of the AND operation
units of the overall architecture, as shown in Figures 4 and 5.
Equation (2) describes that the SRAM memory usage of our
This further prevents the AND operation units from limiting
proposed design is Plog2R(RD D /P) times that of the corresponding
the operating frequency of wide pattern TCAMs designs
traditional TCAM table of size D × W .
on FPGA. Our proposed design uses fewer BRAMs, thus
Our proposed design achieves a considerable reduction in
alleviating the overall routing complexity of the design on
the SRAM memory usage by a factor of P[1−log21P/log2 RD ] ,
FPGA. The divided AND operation complexity and reduced
when compared with that of the existing approaches as
routing complexity makes our proposed design more practical
described using (3) as follows:
for large storage capacity TCAMs.
DW Plog2R(RD D /P) log2 RD log2 RD
The system frequency of our proposed design CASE-III
= = emulating a large capacity TCAM of 140 Kb is 87 MHz,
Plog2 (RD /P)
DW RD P[log2 RD − log2 P] which is comparable with the maximum achievable frequency
log2 RD
1 97 MHz in previous work [7] implementing a large size
= (3) TCAM of 150 Kb. While the SRAM memory usage of our
P[1 − log2 P/log2 RD ]
proposed design CASE-III is 70% lower than that of [7].
The usage of BRAMs in our proposed design is compared Our proposed design provides increased design flexibil-
with those of previous approaches in Column 5 of Table 3. ity in terms of the speed vs memory usage tradeoff. The
Our proposed TCAM design CASE-I emulates a 14 Kb designer must consider the important design factors such as
traditional TCAM, achieving a lower BRAMs utilization the required storage capacity, relative availability of BRAMs
of 8 BRAMs compared with the usage of 56, 40, 40, 32, on the target FPGA, and required throughput for the selection
and 64 BRAMs for previous approaches in [10]–[12], [22], of the multipumping factor in our proposed design.
and [26], respectively for an 18 Kb traditional TCAM emula-
tion. The proposed design CASE-III emulates a large TCAM C. PERFORMANCE PER MEMORY
of size 1024 × 140 using 80 BRAMs. It achieves a lower Considering the time-space tradeoff, we used the perfor-
BRAMs utilization compared with the large TCAM imple- mance evaluation metric performance per memory from [31],
mentations of size 1024 × 150 and 504 × 180 in the previous given by (4).
approaches [7] and [17], using 272 and 140 BRAMs, respec-
tively. Throughput(Gb/s)
(4)
Normalized Memory [Memory(Kb)/TCAM Depth]
B. THROUGHPUT Table 3 compares the performance per memory of our design
The operational speed of our proposed design is compared with previous FPGA-based TCAMs. The depth and pattern
with those of previous approaches in column 4 of Table 3. Our width of traditional TCAMs implemented in previous studies
proposed design cases I and II emulates traditional TCAM are listed in the third column. For a fair comparison, the speed
of size 14 Kb and 16 Kb achieving operating frequencies results of the compared works with technology differences
are normalized to 40 nm, using (5) from [32]. The speed [4] L.-Y. Huang et al., ‘‘ReRAM-based 4T2R nonvolatile TCAM with 7x
results in parenthesis represent the original data reported in NVM-stress reduction, and 4x improvement in speed-wordlength-capacity
for normally-off instant-on filter-based search engines used in big-data
the respective papers. processing,’’ in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2014, pp. 1–2.
[5] M.-F. Chang et al., ‘‘A 3T1R nonvolatile TCAM using MLC ReRAM
∗ 40(nm) VDD for frequent-off instant-on filters in IoT and big-data processing,’’ IEEE
T =T × × (5)
Technology(nm) 1.0 J. Solid-State Circuits, vol. 52, no. 6, pp. 1664–1679, Jun. 2017.
[6] Z. Ullah, M. K. Jaiswal, R. C. C. Cheung, and H. K. H. So,
where T represents the original delay time, and T ∗ denotes ‘‘UE-TCAM: An ultra efficient SRAM-based TCAM,’’ in Proc. IEEE
the normalized delay time for 40 nm CMOS technology Region 10 Conf. (TENCON), Nov. 2015, pp. 1–6.
[7] W. Jiang, ‘‘Scalable ternary content addressable memory implementation
with a supply voltage of 1.0 V. The proposed design cases using FPGAs,’’ in Proc. 9th ACM/IEEE Symp. Archit. Netw. Commun.
I and II implemented 14 Kb and 16 Kb traditional TCAMs Syst., 2013, pp. 71–82.
using 288 Kb and 576 Kb SRAM memory with operat- [8] Virtex-6 FPGA Memory Resources User Guide, Xilinx, San Jose, CA,
USA, 2014. [Online]. Available: http://www.xilinx.com
ing frequencies of 119 MHz and 237 MHz, respectively.
[9] P. Alfke, ‘‘Creative uses of block RAM,’’ Xilinx, San Jose, CA, USA,
The proposed design cases I and II achieved a performance White Paper WP335, 2008.
per memory of 5.78 ((Gb/s × TCAMDepth)/Kb) and 6.58 [10] Z. Ullah, K. Ilgon, and S. Baeg, ‘‘Hybrid partitioned SRAM-based ternary
((Gb/s × TCAMDepth)/Kb), respectively. content addressable memory,’’ IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 59, no. 12, pp. 2969–2979, Dec. 2012.
Table 3 shows that the performance per memory of the [11] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, ‘‘Z-TCAM: An SRAM-
proposed design cases I and II are 1.83 times higher than based architecture for TCAM,’’ IEEE Trans. Very Large Scale Integr.
that of UE-TCAM [6], which was the highest among the (VLSI) Syst., vol. 23, no. 2, pp. 402–406, Feb. 2015.
[12] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung, ‘‘E-TCAM: An efficient
existing methods. Our proposed design CASE-III emulates a SRAM-based architecture for TCAM,’’ Circuits, Syst. Signal Process.,
large TCAM of size 1024 × 140, achieving the performance vol. 33, no. 10, pp. 3123–3144, Oct. 2014.
per memory of 4.25 ((Gb/s × TCAMDepth)/Kb), which is [13] C. E. LaForest and J. G. Steffan, ‘‘Efficient multi-ported memories for
FPGAs,’’ in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Program. Gate
2.85 times higher than for large TCAM of size 1024 × 150 in Arrays, 2010, pp. 41–50.
the existing study [7]. [14] A. Abdelhadi and G. G. F. Lemieux, ‘‘Modular switched multiported
Our proposed design scales well in terms of the perfor- SRAM-based memories,’’ ACM Trans. Reconfigurable Technol. Syst.,
vol. 9, no. 3, p. 22, 2016.
mance when evaluated for the design of a large storage capac-
[15] H. E. Yantir, S. Bayar, and A. Yurdakul, ‘‘Efficient implementations of
ity. Table 3 shows that the performance per memory of our multi-pumped multi-port register files in FPGAs,’’ in Proc. Euromicro
proposed design CASE-III is slightly lower than the proposed Conf. Digit. Syst. Design (DSD), Sep. 2013, pp. 185–192.
design CASE-I (with the same multipumping factor of P = 4) [16] C. E. LaForest. Multi-Ported Memories for FPGAs. Accessed:
Nov. 10, 2017. [Online]. Available: http://fpgacpu.ca/multiport/index.html
while the implemented TCAM size of CASE-III is ten times [17] Z. Qian and M. Margala, ‘‘Low power RAM-based hierarchical CAM on
greater than that of CASE-I. FPGA,’’ in Proc. Int. Conf. ReConFigurable Comput. FPGAs (ReConFig),
Dec. 2014, pp. 1–4.
[18] P. Mahoney, Y. Savaria, G. Bois, and P. Plante, ‘‘Parallel hashing memories:
VI. CONCLUSIONS AND FUTURE WORK An alternative to content addressable memories,’’ in Proc. 3rd Int. IEEE-
Re-configurable hardware FPGAs emulate TCAM function- NEWCAS Conf., Jun. 2005, pp. 223–226.
ality using SRAM memory. Existing SRAM-based solutions [19] S. Cho, J. R. Martin, R. Xu, M. H. Hammoud, and R. Melhem, ‘‘CA-RAM:
A high-performance memory substrate for search-intensive applications,’’
of TCAM on FPGAs achieve inefficient memory usage and in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Apr. 2007,
offer lower operational frequencies. We have presented a pp. 230–241.
memory-efficient design of TCAM, based on multipumping- [20] S. V. Kartalopoulos, ‘‘RAM-based associative content-addressable mem-
ory device, method of operation thereof and ATM communication switch-
enabled multiported SRAM, by operating the SRAM blocks
ing system employing the same,’’ U.S. Patent 6 097 724, Aug. 1, 2000.
in the design at a frequency that is multiple times higher [21] M. Somasundaram, ‘‘Circuits to generate a sequential index for an
than that of the overall system. This allows reading from input number in a pre-defined list of numbers,’’ U.S. Patent 7 155 563,
its sub-blocks to take place within one system cycle. The Dec. 26, 2006.
[22] K. Locke. (2011). Xilinx Application Note: XAPP1151—Parameterizable
FPGA implementation results show that the performance per Content-Addressable Memory. [Online]. Available: http://www.xilinx.com
memory of our proposed design is up to 2.85 times higher [23] Z. Ullah, ‘‘LH-CAM: Logic-based higher performance binary CAM archi-
than for existing SRAM-based TCAM solutions on FPGA. tecture on FPGA,’’ IEEE Embedded Syst. Lett., vol. 9, no. 2, pp. 29–32,
Jun. 2017.
Our proposed solution is general, and can be applied to [24] M. Irfan and Z. Ullah, ‘‘G-AETCAM: Gate-based area-efficient
many applications. Our future work will include the appli- ternary content-addressable memory on FPGA,’’ IEEE Access, vol. 5,
cation of the proposed design to various applications. pp. 20785–20790, 2017.
[25] A. Kulkarni and D. Stroobandt, ‘‘MiCAP-Pro: A high speed custom
reconfiguration controller for dynamic circuit specialization,’’ Des. Autom.
REFERENCES Embedded Syst., vol. 20, no. 4, pp. 341–359, 2016.
[1] B. Agrawal and T. Sherwood, ‘‘Ternary CAM power and delay model: [26] A. Ahmed, K. Park, and S. Baeg, ‘‘Resource-efficient SRAM-based ternary
Extensions and uses,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., content addressable memory,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
vol. 16, no. 5, pp. 554–564, May 2008. Syst., vol. 25, no. 4, pp. 1583–1587, Apr. 2017.
[2] M. Imani, A. Rahimi, and T. S. Rosing, ‘‘Resistive configurable associative [27] N. Manjikian, ‘‘Design issues for prototype implementation of a pipelined
memory for approximate computing,’’ in Proc. IEEE Design, Autom. Test superscalar processor in programmable logic,’’ in Proc. IEEE Pacific Rim
Eur. Conf. Exhib. (DATE), Mar. 2016, pp. 1327–1332. Conf. Commun., Comput. Signal Process. (PACRIM), vol. 1. Aug. 2003,
[3] N. Mohan, W. Fung, D. Wright, and M. Sachdev, ‘‘Design techniques and pp. 155–158.
test methodology for low-power TCAMs,’’ IEEE Trans. Very Large Scale [28] H. Yokota, ‘‘Multiport memory system,’’ U.S. Patent 4 930 066,
Integr. (VLSI) Syst., vol. 14, no. 6, pp. 573–586, Jun. 2006. May 29, 1990.
[29] B. A. Chappell, T. I. Chappell, M. K. Ebcioglu, and S. E. Schuster, ‘‘Vir- ZAHID ULLAH (M’16) received the B.Sc. degree
tual multi-port RAM employing multiple accesses during single machine (Hons.) in computer system engineering from
cycle,’’ U.S. Patent 5 542 067, Jul. 30, 1996. the University of Engineering and Technology,
[30] G. S. Ditlow et al., ‘‘A 4R2W register file for a 2.3 GHz wire- Peshawar, Pakistan in 2006, the M.S. degree in
speed POWER processor with double-pumped write operation,’’ in IEEE electronic, electrical, control, and instrumenta-
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2011, tion engineering from Hanyang University, South
pp. 256–258. Korea, in 2010, and the Ph.D. degree in elec-
[31] H. Nakahara, T. Sasao, H. Iwamoto, and M. Matsuura, ‘‘LUT cascades
tronic engineering from the City University of
based on edge-valued multi-valued decision diagrams: Application to
Hong Kong, Hong Kong, in 2014. He is currently
packet classification,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 6,
no. 1, pp. 73–86, Mar. 2016. serving as an Associate Professor with the Depart-
[32] P.-T. Huang and W. Hwang, ‘‘A 65 nm 0.165 fJ/Bit/search 256×144 TCAM ment of Electrical Engineering, CECOS University of IT & Emerging Sci-
macro design for IPv6 lookup tables,’’ IEEE J. Solid-State Circuits, vol. 46, ences, Peshawar, Pakistan. He has authored prestigious journal and confer-
no. 2, pp. 507–519, Feb. 2011. ence papers and holds patents in his name in the field of FPGA-based TCAM.
His research interests include low power/high speed CAM design on FPGA,
low power/high speed VLSI design, and embedded systems.