Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
81 views13 pages

Clock Power Optimization

cts

Uploaded by

pankajmudgil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views13 pages

Clock Power Optimization

cts

Uploaded by

pankajmudgil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Virtual-Tile-Based Flip-Flop Alignment


Methodology for Clock Network
Power Optimization
Taehyun Kwon, Student Member, IEEE, Muhammad Imran, Student Member, IEEE,
David Z. Pan , Fellow, IEEE, and Joon-Sung Yang , Senior Member, IEEE

Abstract— Clock network plays the most significant role in I. I NTRODUCTION


power consumption in IC design. Since a clock network normally
has a high switching ratio, power optimization of the clock
network is one of the best solutions to minimize dynamic power
and total power in modern IC designs. The clock network is
synthesized based on an initial flip-flop placement. The number of
P OWER consumption in IC designs is increasing signifi-
cantly due to an increase in design complexity and design
area. An increased number of flip-flops and a higher operating
clock buffers and their sizes are decided by the initial placement. frequency contribute to a further increase in power consump-
Moreover, clock wires, which are the major sources of clock tion. The performance of CPUs or GPUs operating at a high
power consumption, are also constructed based on the flip- clock frequency is limited by their high power consumption.
flop placement. As a result, the flip-flop placement determines
the quality of the clock network. In this article, we propose a A clock network occupies a small portion of the entire design
new clock network optimization method to reduce the dynamic area, however it consumes up to 40% of the total power of
power consumption of clock network. The method first creates the IC design [1], [2]. Hence, clock network optimization is
virtual tiles over the entire design area and selects the most necessary to efficiently reduce the dynamic and total power
effective columns to align flip-flops in lines. Once the effective of an IC design. Clock power optimization studies have been
columns are determined, flip-flops are relocated based on the
virtual tiles in the columns considering the minimum moving pursued along two main paths. One approach is to optimize
distance. By aligning flip-flops, it is possible to significantly the non-flip-flop elements in a clock network such as clock
reduce both wire capacitance and wire length. Since it does not buffers, wires, vias, or clock network structures. The other is
change the clock structure, unlike the conventional clock network to optimize the clock sink structures (i.e., flip-flops) for power
optimization techniques which use multibit flip-flop or register reduction.
bank, there is no degradation in timing or other constraints.
Experimental results show that the proposed method reduces the Many techniques are under investigation to optimize the
wire capacitance, wire length, and via count up to 23.2%, 10.2%, number of clock buffers and their sizes for power reduc-
and 16.4%, respectively, in five industrial intellectual property tion [3]–[6]. Basically, the number of clock buffers is deter-
(IP) designs. The reduction in clock network power is 14.1% on mined by the clock net wire length. One approach for reducing
average. the number of clock buffers is placing flip-flops close to
Index Terms— Cell placement, clock network optimization, their driving cells. This approach is referred to as flip-flop
clock tree synthesis (CTS), flip-flop alignment, flip-flop relocation, clumping which pulls the flip-flops toward the driving cells.
multibit flip-flop (MBFF). This helps reduce the dynamic power consumption due to
the reduced clock buffers and wires at the leaf-level clock
Manuscript received June 27, 2019; revised October 16, 2019 and Decem- network. However, flip-flop clumping causes considerable
ber 7, 2019; accepted December 29, 2019. This work was supported in timing degradation or routing congestions since it moves
part by the Basic Science Research Program through the National Research flip-flops over long distance. The number of clock buffers
Foundation of Korea by the Ministry of Education under Grant NRF-
2018R1D1A1B07049842 and Grant 2015R1D1A1A01058856, in part by can also be reduced by utilizing different clock network
the Ministry of Trade, Industry Energy (MOTIE) under Grant 10080594, architectures than a conventional clock tree, such as a clock
and in part by the Korea Semiconductor Research Consortium (KSRC) mesh [7]–[10]. The mesh structure requires a lower number of
Support Program for the Development of the Future Semiconductor Device.
(Corresponding author: Joon-Sung Yang.) clock buffers, however this methodology limits the utility of
Taehyun Kwon is with the Department of Semiconductor and Display the skew technique [11], [12] which is a very powerful timing
Engineering, Sungkyunkwan University, Seoul, South Korea, and also with optimization method that is widely used in modern designs.
the System LSI Division, Samsung Electronics, Seoul, South Korea.
Muhammad Imran is with the Department of Electrical and Computer In addition, the long evaluation time required to analyze a
Engineering, Sungkyunkwan University, Seoul, South Korea. clock architecture and design structure has become an issue.
David Z. Pan is with the Electrical and Computer Engineering Department, Clock sink optimizations such as using a low-power flip-
The University of Texas at Austin, Austin, TX 78712 USA.
Joon-Sung Yang is with the Department of Systems Semiconductor flop, multibit flip-flop (MBFF), or register banking is an
Engineering, Yonsei University, Seoul, South Korea (e-mail: js.yang@ effective approach to reduce clock power consumption. The
yonsei.ac.kr). use of low-power flip-flops is a good approach for reducing
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. clock power [13]–[15]. However, because low-power flip-flops
Digital Object Identifier 10.1109/TVLSI.2020.2966912 generally have a long cell delay, their use is highly limited
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

to avoid significant timing issues. An MBFF merges multiple Section II provides background and preliminaries for this
single-bit flip-flops into a single flip-flop [16]–[21]. In this article. Sections III and IV explain the proposed method
case, power consumption can be considerably reduced because in detail. Section V presents the experimental results and
the individual single-bit flip-flops are merged into one such analysis, and Section VI concludes this article.
unit. After placement, single bit flip-flops are found which can
be merged together and they are replaced by MBFFs. The other II. BACKGROUND AND P RELIMINARIES
approach for clock sink optimization is to use register banking. Clock network plays a major role in power consumption in
While MBFF requires the development of a new MBFF cell IC design. Clock networks account for most of the dynamic
library, register banking identifies multiple flip-flops and forms power consumption because a clock signal generally has a
a bank in a similar manner to register files [22], [23]. To intro- higher toggling ratio than other data signals. Therefore, it is
duce MBFF or register banking, the logic signals connected critical to reduce the clock power by reducing the dynamic
to the flip-flops need to be rerouted. This introduces timing power of the clock network. In [24] and [25], dynamic power
violations or routing issues. To resolve these problems, timing- can be modeled as
engineering change order (ECO) or debanking is required and Pdyn = α · Ctot · Vdd
2
· f (1)
the overhead is not negligible.
This article introduces a new clock network optimiza- where α, Ctot , Vdd , and f are the switching activity, total
tion method which lines up flip-flops to reduce clock net- capacitance, supply voltage, and operating frequency, respec-
work power. Unlike conventional methods which involve the tively. The switching activity is determined by a function of
restructuring of flip-flops like MBFF or register banking, the design. Controlling f and Vdd would be good ways to
the proposed method performs an integrated clock gating reduce a dynamic power [26]. However, under the same clock
(ICG) cell-based in-line flip-flop relocation to reduce both specification and clocking method, the capacitance should be
the wire capacitance and wire length of a clock network decreased to reduce the dynamic power. In a clock network,
without any changes in the clock structure. The number of the capacitance is mainly composed of a pin capacitance of
vias is also reduced because the in-lined flip-flops allow for the flip-flops and a wire capacitance between flip-flops. Flip-
straight interconnections. The flip-flop relocation is performed flop sizing or merging could reduce the pin capacitance. On the
over a short distance within a timing error-free region in other hand, the wire capacitance can be decreased by reducing
the same clock network. This helps avoid additional timing the wire length with a decreased number of vias. Wire length
violations or other constraint violations such as maximum optimization of the leaf-level clock network is effective in
capacitance and maximum transition. Hence, the proposed reducing the wire capacitance because 80% of the total wire
method can be effectively applied for timing-critical or turn- capacitance of a clock network is accounted for by the wires
around time (TAT) critical designs. between the flip-flops and their driving cells [27], [28]. This
The proposed flip-flop alignment method is a simple, yet, wire optimization includes reducing the length of the leaf-wire
very effective method for clock power optimization. The and the number of vias.
significant features of this method are summarized as follows. This section explains the preliminaries—ICG cell, MBFF,
and register banking, which are commonly used for clock
1) The proposed method reduces wire capacitance, wire power reduction.
length, and via counts in a clock network by relocating
1) ICG: ICG cell is an integrated clock gating cell
flip-flops in lines. This helps reduce the cell area of clock
which can enable or disable a clock signal [29], [30].
components such as clock buffers and ICGs because of
One or more flip-flops are connected to a single ICG
reduced load capacitance.
cell and their outputs are controlled by the clock enable
2) The proposed method does not cause timing degrada-
signal of the ICG cell. By removing unnecessary clock
tion and other constraint violations such as maximum
toggles, the dynamic power of the clock network can be
capacitance violation and maximum transition viola-
reduced. Most of the recent IC designs use ICG cells
tion because flip-flops are relocated within a timing
for dynamic power reduction.
violation-free region. Since the method relocates flip-
2) MBFF: More than two single-bit flip-flops are merged
flops within a short distance, it does not introduce
and form a single MBFF. Fig. 1(a) shows a 2-bit
additional routing congestions.
MBFF formed by merging two single-bit flip-flops.
3) The proposed method does not have additional run-
Because the MBFF removes an inverter pair in a single-
time and TAT overhead. Since the method uses pre-
bit flip-flop, the cell area and pin capacitance can be
calculated timing information without additional timing
reduced [17]–[19]. As more single-bit flip-flops are
update, the run-time overhead is negligible. Moreover,
merged and changed as a large size MBFF, more conges-
the method does not require additional cell libraries such
tion in placement or routing is expected and this would
as a special library for MBFF.
require effective measures to resolve congestion issues.
Unlike the previous clock network optimization methods It also requires an MBFF cell library development.
such as flip-flop merging or flip-flop clumping techniques, 3) Register Banking: Register banking groups multiple flip-
the proposed algorithm aligns flip-flops to physically simplify flops and forms a bank such as 2 × 2, 4 × 4, 4 × 8, and
the clock network wiring. This article describes the details so on. Fig. 1(b) illustrates a 2 × 4 register bank. This
of the alignment methodology and evaluates the proposed structure can considerably reduce the capacitance and
method. The rest of this article is organized as follows. length of a wire in a leaf-level clock network by placing

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 3

Fig. 2. Examples of movable region (R M ) of a flip-flop f f with (a) line


and (b) tilted rectangular shape.

network by aligning the flip-flops in lines, which help reduce


wire capacitance and wire length. For the proposed flip-flop
alignment, the key is to identify the best locations for flip-flop
Fig. 1. Examples of (a) MBFF and (b) register banking. relocation. If the distance between a flip-flop and its connected
cells is the same or smaller than its original distance after
flip-flops in close proximity. The newly created register relocation, additional constraint violations such as setup timing
bank is considered as a macro block, and hence, flip- violation, maximum capacitance violation, and maximum tran-
flops in the banks cannot be sized or moved individually sition violation are not generated. This investigation formulates
once they are created. This may limit the freedom of the problem using Manhattan distances between a flip-flop and
design optimization. its driving cell/load cell and their timing information.
The methods described above help reduce clock power, Fig. 2 shows examples with a flip-flop f f , a driving cell
however they could cause serious challenges which are dis- D, and a load cell L. Cell locations in the examples are based
cussed as follows. on the actual placements for intellectual property 1 (IP1).
1) There may be significant changes in pin capacitance and The basic placement algorithms of Place and Route (P&R)
the transition time of most of flip-flops since some flip- tools try to place cells with shortest distances as described
flops need to be moved over a long distance and are in Fig. 2(a). However, if the shortest path between two
merged into a single cell or a single macro block. This cells causes any constraint violations like partial cell/channel
would introduce many problems related to timing and congestion or design rule checking (DRC) violation, the tool
other constraints involving capacitance and transition. selects an optimal position considering the constraint vio-
2) Flip-flops in an MBFF or register bank cannot be lations along with the timing slack information. Fig. 2(b)
separately sized or moved. This limits any further design would be a similar case. The Manhattan distances between
optimization after merging or banking. For example, D and f f and between f f and L are denoted as M(D, f f )
if the size of a flip-flop needs to be increased to meet and M( f f, L), respectively. If flip-flops are moved to other
setup timing requirements, all flip-flops in an MBFF also locations, they may introduce constraint violations. To avoid
have to be increased in size. This can cause additional violations, f f should satisfy the following requirements:
timing problems due to the increased load capacitance
Mnew (D, f f ) ≤ Morg (D, f f )
with an unnecessary power increase.
3) Once the flip-flops are merged, the merged cells would Mnew ( f f, L) ≤ Morg ( f f, L) (2)
become too large in size compared to a single-bit flip- where Mnew (D, f f ), Mnew ( f f, L) and Morg (D, f f ),
flop. They can cause partial congestion problems at Morg ( f f, L) are the distances after and before flip-flop
the points where they are placed. In a high-density relocation, respectively. Equation (2) identifies a movable
design, the problem becomes more serious. In nanometer region R M where a flip-flop can be relocated without timing
process technology, routing congestion is a critical factor and other constraint violations. R M can be a line or a tilted
with respect to physical implementation due to a small rectangular shape [31]. R M for flip-flop f f is described
design area and high routing utilization. in Fig. 2(a) and (b).
As described above, flip-flop merging or clumping would Once R M values for the flip-flops are found, the flip-flops
cause serious problems. Hence, it is always preferred to reduce can be relocated to their corresponding R M for alignment.
wire capacitance and wire length without flip-flop merging and However, if the movable regions are too narrow, the flip-flop
clumping methods for clock network optimization. alignment ratio (number of aligned flip-flops/number of total
flip-flops) would be limited. To achieve a high alignment ratio,
III. P ROBLEM F ORMULATION it is preferred to widen R M . This article exploits the fact that
In this article, a new clock network optimization method positive timing slacks for flip-flop input and output paths can
which does not include flip-flop merging and clumping tech- be used for R M and it thus helps increase a flip-flop alignment
niques is proposed. The proposed method optimizes the clock ratio. To determine the Manhattan distance from the timing

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 3. Example of extended movable region (R M ) by M(slack,cell) .

information, the following equation is used which translates


the timing slack to a corresponding wire length using the
Elmore delay model [32]:

c0 2 R 2 + r0 2 C 2 + 2r0 c0 tmax − c0 R − r0 C Fig. 4. Proposed design flow including flip-flop alignment.
l≤ (3)
r 0 c0
where r0 is a unit resistance, c0 is a unit capacitance, R is
for the flip-flop alignment. If all flip-flops are aligned in
a driver strength, C is a driving load, and tmax is a positive
a single line, this may introduce long movements of some
timing slack between a pin and its corresponding flip-flop.
flip-flops. To minimize the movements, the proposed method
The converted Manhattan distance by positive timing slack
divides flip-flops into several groups (see Section IV-B). Then,
information is denoted as M(slack, cell) and this helps a
alignment regions (R A ) are defined considering all R M s of
flip-flop to have a wider R M region than before. Hence,
the flip-flops and the most effective column (Ceff ) among
Mnew (D, f f ) and Mnew ( f f, L) can be updated by including
columns in R A is selected for each group (see Section IV-C).
M(slack, cell) as follows:
Flip-flops in each group are relocated and aligned on Ceff ,
Mnew (D, f f ) ≤ Morg (D, f f ) + M(slack, D) and the cell legalization and incremental optimization are
Mnew ( f f, L) ≤ Morg ( f f, L) + M(slack, L). (4) performed to resolve possible issues such as cell overlap after
alignment, timing, or DRC violations (see Section IV-D).
The movable region newly defined by (4) has a wider range Sections IV-A–IV-D describe each step of the proposed
as shown in Fig. 3. If a flip-flop is relocated within this method in detail.
overlapping region, there is no timing violation introduced by
the proposed flip-flop alignment algorithm. Even though the A. Virtual Tile Creation
constraints such as capacitance and transition of flip-flops can To perform a flip-flop alignment, the virtual tiles are first
be changed after alignment, they can be easily fixed by a later
generated over the entire design area. Virtual tiles are used
optimization process because the differences would be very
to guide flip-flop locations for relocation during alignment.
small. This is discussed further in Section IV. The unit tile size can be selected considering the process
technology and the flip-flops used in the design.
IV. P ROPOSED F LIP -F LOP A LIGNMENT M ETHOD Although the height of the unit tile is determined by the
Fig. 4 describes the proposed design flow which is based height of the flip-flops, the width needs to be carefully selected
on a conventional physical implementation flow. The pro- because it determines the wasted area. Fig. 5 illustrates how
posed method is applied right after placement and only uses much of the wasted area is created with respect to the various
precalculated timing and placement information for flip-flop unit tile widths. Fig. 5(a) shows an initial placement of 12 flip-
alignment. In the proposed algorithm, of course, there is no flops belonging to one ICG cell. Assume that the width of the
timing update or timing optimization. The remaining steps most often used flip-flop in the design is a μm. In Fig. 5(b),
including placement legalization are performed by a P&R the width of a unit tile is determined to be wide enough
tool without any additional steps. Therefore, the method does (1.5 × a μm) for including the largest flip-flop (ff C) and each
not include any additional timing optimizations and the run flip-flop is placed within one unit tile. The vertical lines created
time overhead is negligible. To perform a flip-flop alignment, by virtual tiles are the alignment columns where the flip-
the proposed method first creates virtual tiles on the entire flops are aligned. In Fig. 5(b), all flip-flops are aligned in two
design area and they are used as the base locations where alignment columns. However, the wasted area, highlighted in
flip-flops can be placed for alignment (see Section IV-A). gray, is relatively large compared to those from narrower unit
The proposed flip-flop alignment is performed at an ICG widths in Fig. 5(c). With the unit tile width a μm in Fig. 5(c),
granularity. The movable regions (R M ) of flip-flops belonging three alignment columns are available, while there are only
to the same ICG are analyzed to find a common region two alignment columns in Fig. 5(b). During alignment, ff I has

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 5

Fig. 6. Examples of virtual tile creation around (a) hard macro and
(b) common placement area.

ICG cells by considering design constrains such as setup


timing or max transition time. This allows a clock network
to have less clock buffers at the leaf-level clock network
during clock network synthesis and helps the clock power
optimization.
The proposed method further reduces clock power con-
sumption by aligning the flip-flops. The proposed flip-flop
alignment is performed at the ICG granularity. Because a
single ICG cell drives a number of flip-flops, it is not preferred
to line up all flip-flops in a single line. If they are aligned
Fig. 5. Comparisons of wasted area of (a) original design by different unit in a single line, this would cause some flip-flops move over
tile widths of (b) 1.5 × a μm, (c) a μm, and (d) 0.75 × a μm.
long distances. To avoid this movement, the proposed method
been moved to a lower tile with a small moving distance and divides flip-flops into several groups considering their current
the two alignment columns are used for flip-flop alignment. placement and timing slack. If possible, the flip-flops in the
The benefits of having a narrower unit tile width include reduc- same group would be relocated in a single line.
ing the wasted area and allowing more alignment columns Flip-flop groups are found based on the movable region
which can be used for other flip-flops belonging to other ICG (R M ) discussed in Section III, and Algorithm 1 describes a
cells. The unit tile width can be further reduced and this grouping procedure. Let us consider the example in Fig. 7.
would help the proposed method to achieve less wasted area It shows an initial placement for flip-flops which belong to
and provide more alignment chances with a finer alignment the same ICG cell. For each flip-flop, ff A ∼ ff F, its R M
resolution. However, if a narrower width is wrongly chosen, is highlighted with a gray box. Based on the R M s, horizontal
this could cause problems. In Fig. 5(d), the width is selected boundaries (h f ) where flip-flops can be horizontally moved
as 0.75×a μm. In this case, ff H, ff J, ff K, and ff L cannot be are found and they are indicated by an arrow for each flip-
located on the second alignment column and, instead, they are flop. The leftmost and rightmost X-coordinates for h f are
placed on the third alignment column. This introduces wasted denoted as h f ( f f )min and h f ( f f )max . The left and right most
areas and they could negatively affect the alignment of other X-coordinates for ff A are given as h f Amin and h f Amax .
flip-flops. Because the flip-flop alignment is performed for one For grouping, h f s of flip-flops are compared. The overlap-
ICG at a time, the wasted area generated could impede the flip- ping h f s create flip-flop groups and the horizontal boundaries
flop alignment conducted later. This would limit the overall of the groups (denoted as h g ) can be found by h f ( f f )min
alignment ratio (num. of aligned flip-flops/num. of total flip- and h f ( f f )max of h f s. The flip-flops under the same ICG
flops) and the clock network power optimization. cell are examined to decide whether they can be included
Fig. 6 shows a virtual tile generation for one of the IPs used in the preexisting groups or not. The algorithm examines
for experiments. Virtual tiles are generated on the entire design h f of the current flip-flop to determine whether it overlaps
area with a unit tile width of half of the most often used flip- with the horizontal boundary of each group (h g s). If the
flop width. However, if the area is occupied with pre-placed h f overlaps with one of the h g s of the preexisting groups,
fixed cells such as hard macros, the virtual tiles are not created the current flip-flop can be included in the group. h g (group)min
in that area as shown in Fig. 6(a). In Fig. 6(b), the ICG cell and h g (group)max are referred to as the left and right most
is highlighted in red, while the green boxes indicate flip-flops X-coordinates of h g . Once the current flip-flop is included
connected to the ICG cell. in the group, the left and right most X-coordinates of the
group are updated if h f ( f f )min and h f ( f f )max of the flip-
B. Flip-Flop Grouping flop is bigger and smaller than h g (group)min and h g (group)max ,
During the initial placement by a conventional physical respectively. If there are more than one h g s overlapping with
implementation flow, flip-flops are normally positioned near h f of the current flip-flop, the group with more flip-flops is

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Algorithm 1 Algorithm for Flip-Flop Grouping

Fig. 8. Example of alignment region (R A ).

intersecting h f s of flip-flops in the group. The Y -coordinates


of the groups are determined by overlapping the smallest
and largest Y -coordinates of R M s in the group. This creates
an alignment region, R A , which is composed of unit tiles.
The unit tiles in R A are the candidate locations where the
flip-flops can move for alignment. Once R A s are identified,
the alignment columns can be determined based on the unit
tiles. Even though the columns are found, some of them may
not be available for alignment. If macro blocks are placed in
R A , or some unit tiles could have been used for earlier flip-flop
alignments (note that one or more unit tiles can be occupied
by the flip-flops depending on the unit tile size), there may not
be enough unit tiles. This could make the alignment column
unavailable for alignment.
To maximize the usefulness of the flip-flop alignment,
the most effective column needs to be selected. The proposed
method considers the flip-flop moving distances to minimize
the design change after flip-flop alignment. If flip-flops are
relocated far from their original locations, this may cause
a number of concerns. The relocation may introduce partial
congestion problems when flip-flops are moved. The other
issue could be significant changes in constraints such as
transition time or capacitance. Hence, even though the timing
constraint is guaranteed in R A , the moving distance should be
Fig. 7. Example of movable region (R M ). minimized. To identify the most efficient alignment column,
the proposed method examines the total distance when the flip-
selected to include the flip-flop. However, if the current flip- flops in R A are relocated to one column. For each alignment
flop does not overlap with any h g s, a new group is generated column, all flip-flops in R A are relocated with a minimum
because the current flip-flop cannot be aligned with other flip- distance and the distance is measured. For each alignment
flops in the preexisting groups. Once grouping is completed, column in R A , the shortest moving distance for each flip-
one or more groups would be created for alignment. The flop (dmin ) is measured and summed to find the total distance
example in Fig. 7 identifies two groups, one with ff A, ff (dtotal). The alignment column with the smallest dtotal is
B, ff D, and ff F and the other with ff C and ff E as shown selected as the most effective column (Ceff ) for alignment.
in Fig. 8. The flip-flops in the same group can be aligned in It indicates that it guarantees the least design changes than
a single line. The detailed column selection is described in other columns selected. If multiple columns give the same
Section IV-C. dtotal, an alignment column with a higher alignment ratio is
selected because it makes more flip-flops line up in a single
line, thus helping clock network optimization. In some cases,
C. Column Selection because the number of virtual tiles in a column is smaller than
The flip-flop grouping algorithm finds a set of groups, the number of all flip-flops for relocation, a column may not be
and their left and right most X-coordinates can be found by available for alignment. In this case, the proposed method first

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 7

Algorithm 2 Algorithm for Effective Column Selection virtual tiles are removed for placement legalization followed.
During flip-flop relocations, the attributes for location fixing
of ICG cells are set to 1 in a P&R tool. Therefore, there is no
change in ICG cell locations.
Once flip-flops change their locations, there is a possibility
that the relocated flip-flops would overlap with exiting combi-
national cells which are originally placed on the corresponding
virtual tiles. To resolve the cell overlap concern, the proposed
method performs a two-step placement legalization. First,
it fixes the placement of all flip-flops and then the overlapping
combinational cells are relocated to empty locations where
the flip-flops were originally placed. Next, it legalizes all
cells together including flip-flops to remove the remaining
cell overlaps such as an overlap between flip-flops and the
prefixed cells. Because the flip-flop is moved with the small-
est distance (dmin ), an overlapping combinational cell would
be relocated within a small distance. Thus, this would not
introduce much change in design constrains including timing.
In addition, because it is not a final stage of a physical
implementation flow, few newly generated DRC violations can
be fixed during remaining design steps such as clock network
synthesis or routing.

V. E XPERIMENTAL R ESULTS
This section evaluates the proposed clock network optimiza-
tion. Five industrial IPs (IP1∼IP5) in mobile SoCs are evalu-
ated which are implemented with the state-of-the-art process
technology, 14 and 10 nm. Both CPU and GPU IPs which have
extremely high operating frequency and cell density are also
considered (IP1 and IP3). The gate-level netlists are generated
by Synopsys Design Compiler, and the proposed method is
integrated with Synopsys IC Compiler II. For routing layers,
six metal layers and seven metal layers are used for 14
and 10-nm design, respectively. The numbers of layers for
signal routing are 10 and 11 for each process technology.
Each IP is implemented with multicorner multimode (MCMM)
finds an alignment column (Ceff ) with the most virtual tiles and scenarios. The numbers of scenarios considered are 8, 6, 10,
Ceff determines how many flip-flops can be aligned on Ceff . and 8 for IP1, IP2, IP3, and IP4, respectively. Because all
The flip-flops yielding the smallest dtotal are aligned on Ceff . IPs are evaluated under complete physical implementation
The other flip-flops which are not aligned on Ceff are relocated flows including physical verifications such as signoff DRC
on either side of Ceff to a column with smaller dtotal and the check or electromigration analysis, the experimental results
corresponding column becomes a secondary alignment column are provided with a high confidence.
in R A . Algorithm 2 explains the procedure for finding the
most effective column in R A . Once Ceff is selected, flip-flops A. Flip-Flop Alignment and Routing Result
determines their optimal locations considering the minimum
To illustrate the flip-flop alignment results by the proposed
moving distances.
method, two ICG cells and flip-flops belonging to the ICG are
chosen from IP1. In Fig. 9, the cell colored in red is the ICG
D. FF Alignment and Placement Legalization cell and the flip-flops connected to the ICG cell (referred as
The flip-flop alignment is performed at the ICG granular- fan-out flip-flops) are highlighted in green. The figures are cap-
ity. After finishing each alignment, the algorithm generates tured after finishing all physical implementation steps includ-
commands for flip-flop relocations with a corresponding P&R ing signal routing and post-routing optimization. The clock net
tool command and updates a script for flip-flop relocations. is highlighted in yellow. The left figures in Fig. 9(a) and (b) are
The used tiles (placement blockages) for current alignment are the results obtained from conventional design flows without
removed to avoid their usage for other alignments. Once every the proposed method. The results from the proposed alignment
alignment is finished, the final script which includes relocation method are presented in the right figures in Fig. 9(a) and (b).
commands for flip-flops is applied and flip-flops change their As can be seen, all flip-flops are aligned in three alignment
actual physical locations at a time. Then all the remaining columns. Flip-flop normally has a horizontally long pin shape

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE I
C LOCK N ETWORK I MPROVEMENT ON L EAF L EVEL FOR ICG1 AND ICG2

very difficult to use larger MBFF than 3-bit due to timing


constraints, congestion, and DRC problems [17], [19]. The
problem becomes more critical as the design has a higher
operating frequency or a high cell density. The 14 flip-
flops vertically placed in Fig. 9(a) could be similar to
Fig. 9. Flip-flop placement after proposed alignment method for (a) ICG1
a 1 × 14 register bank. Unlike flip-flops inside the register
and (b) ICG2. bank which cannot be sized or moved individually, the pro-
posed method allows individual sizing or a movement if
needed and achieves a better timing and constraints optimiza-
tion such as transition, capacitance, or DRC.
Fig. 10 shows some of the flip-flops in IP1 aligned by
the proposed method because the IP1 design is too large to
illustrate all flip-flops. As can be seen, the proposed method
aligns most of the flip-flops in the design. The average
alignment ratio (number of aligned flip-flops/number of total
flip-flops) for IP1 is 91.0%.

B. Clock Wire Capacitance and Length Analysis


To evaluate the clock network optimization by the proposed
method, a wire length, capacitance, and via count are given
in Table I. Table I shows their values for ICG1 and ICG2 from
Fig. 9. The lwire denotes the total wire length for wiring
between the ICG cell and its fan-out flip-flops. Cwire and
Fig. 10. Flip-flop alignment result by the proposed alignment algorithm.
Cpin are the sum of wire capacitances and pin capacitances,
respectively. Vtotal reports the number of vias used for leaf-
for a clock pin to have an advantage in clock wiring. This level clock wiring for each ICG cell. For ICG1 and ICG2, lwire
helps the clock pins to be connected to clock wires efficiently is decreased approximately 26.04% and 10.88%, respectively,
by allowing a contact to be placed anywhere on the clock pin. after alignment by the proposed method. A Cwire reduction
The proposed alignment method takes advantage of this clock of 26.05% and 10.63% is achieved by the proposed method,
pin shape. The flip-flop alignment causes the clock net to be which is one of the most important factors for switching
straight lines (yellow lines in the figure), and this helps reduce power reduction. In addition, the number of vias is reduced
the capacitance of clock network with the reduced clock wire by approximately 24.44% and 5.13% for ICG1 and ICG2.
lengths and fewer vias. Although flip-flops slightly change Because a via resistance significantly increases with process
their locations during clock tree synthesis (CTS) or opti- improvement, the use of fewer vias is important for power
mization after CTS, clock wires can keep the straight lines and reliability optimization [33], [34]. Cpin is kept the same
because of the horizontally long pin shape. It should be noted because there are the same number of flip-flops before and
that the aligned flip-flops in Fig. 9 do not always mean that after flip-flop alignment.
they can be replaced with a single MBFF because timing- To confirm how far the proposed method relocates flip-flops
free regions for high-performance (high operating frequency) from their original locations, Fig. 11(a) illustrates the dcenter
designs like CPU or GPU are generally not large enough for values of randomly chosen 50 flip-flops in IP1. dcenter in μm
MBFF merging. Moreover, since actual industrial designs try denotes the distance between a flip-flop and the center of fan-
to use as many of the high-V th cells as possible for leakage out flip-flops corresponding to the same ICG cell. The dotted
power reduction, timing-free regions would be getting smaller. line is the original dcenter by the conventional design flow
In Fig. 9(a), there are 14 flip-flops clustered and they and dcenter by the proposed method is represented with the
are possibly replaced by a 14-bit MBFF. However, it is black line. To evaluate the effectiveness in flip-flop relocation

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 9

Fig. 12. Distribution of moving distance by the proposed alignment method.

It is important to evaluate the routing and placement


overhead by the proposed algorithm. If flip-flops change
their placement with a long distance from their original
locations, it may introduce critical problems such as cell
overlap or routing congestion after alignment. For designs with
a high cell or routing density, the problem would become
more serious. Fig. 12 shows the actual moving distances
of flip-flops from their original locations, which are given
as an absolute value. The X-axis is a moving distance and
the Y -axis gives the number of flip-flops which have the
same moving distance. Similar to dcenter, it is reported that
approximately 91% of the total flip-flops in the design have
a moving distance of less than 5 μm. Some flip-flops are
moved over a distance of more than 5 μm and this happens
Fig. 11. Distance changes between each flip-flop and the center of flip-flops due to the placement legalization or timing optimization after
belonging to the same ICG ( dcenter ) after alignment, for (a) 50 flip-flops alignment. However, unlike other algorithms based on cell
and (b) all flip-flops in IP1.
merging or cell clumping, the timing or constraint violations
caused by long distance movement can be easily fixed because
by the proposed method, dcenter by a state-of-the-art MBFF the flip-flops can be individually sized or repositioned easily
method [16] is drawn together with a gray line. It should during optimization. In addition, because most flip-flops move
be noted that dcenter values for the MBFF method and the over a short distance, a partial congestion is not a critical
proposed method are measured from the original center of a problem.
fan-out flip-flop. As we expected, some flip-flops show large
dcenter difference between the original method and the MBFF
method because they had to move over long distance due to C. Clock Network Improvement
their large cell size. On the other hand, there are only small This section outlines the overall clock network improve-
dcenter (i.e., dcenter after alignment −dcenter before alignment) ment by the proposed alignment method and compares the
differences found before and after the alignment because the results with conventional approaches. Table II provides clock
proposed method relocates flip-flops over a short distance. network improvements for five industrial IPs. IP1/IP2/IP5 and
dcenter values are measured for all flip-flops in IP1 and IP3/IP4 are implemented with 14- and 10-nm process tech-
plotted in Fig. 11(b). The X-axis is dcenter and the number nology, respectively. To compare the results with conventional
of flip-flops with the same dcenter is denoted on the Y -axis. approaches, an original design flow and a design with the state-
The negative dcenter represents that the flip-flop is moved of-the-art MBFF method [16] are considered. Register banking
toward the center of flip-flops by the proposed alignment and is not considered for evaluation because there are huge timing
vice versa. It is important to note that most of the dcenter violations and DRCs generated due to the high operating
values are very small. Approximately 88% of the flip-flops frequency and cell density of CPU and GPU IP. For MBFF,
are found within dcenter of 2.5 μm. It is clearly shown that all MBFF under 4-bit are considered for the best performance.
the proposed alignment method relocates most flip-flops over As explained in Section III, the capacitance is a dominant
a short distance given that 99% of flip-flops used in IP1 have factor of dynamic power consumption on a clock network. The
a width of 1.5–2.5 μm. third and fourth columns in Table II give the wire capacitance

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II
C LOCK N ETWORK I MPROVEMENT FOR F OUR D IFFERENT IP S

Fig. 13. Distribution of wire capacitance change (Cwire ) by the proposed


alignment method. Fig. 14. Comparison of clock power.

and pin capacitance values, respectively. The smallest wire the proposed method is 22.66%, while MBFF method has
capacitance value is found from the proposed method because −34.16% (showing an increased wire capacitance). The reason
the proposed alignment algorithm aligns the flip-flops. Fig. 13 as to why the MBFF method results in an increased wire
illustrates the wire capacitance changes (Cwire ) in the clock capacitance than that of the original method is that there are
leaf nets for all ICG cells in IP1. The negative value means unexpected long clock routing occurrences because IP1 is a
that the wire capacitance has been decreased after alignment. CPU design which has extremely high operating frequency
Fig. 13 shows that although there is an approximate 10% of 2100 MHz and the cell utilization of over 90%. With tight
of increase in Cwire , most of the increase is small and less design constraints (timing, utilization, and so on), it is difficult
than 1 pF. For IP1, the total wire capacitance reduction by for MBFF methods to achieve a high MBFF merging ratio as

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 11

TABLE III
OVERALL D ESIGN I MPROVEMENT FOR F OUR D IFFERENT IP S

reported in Table II. The difficulty of MBFF placement by its 14.1% and 3.5% compared to the original method and the
large cell size limits the wire capacitance reduction. Other MBFF method, respectively, while MBFF has 10.9% of clock
IPs also show a significant reduction in wire capacitances power reduction compared to the original method. The average
of 14.09%, 16.56%, 20.03%, and 23.20% for IP2, IP3, IP4, total power reduction is 9.0% with the maximum of 15.8%.
and IP5, respectively, by the proposed method. Similar pin The operating frequency for high-performance IPs like CPU is
capacitance values are found because the original flow and limited due to higher power consumption, and the clock power
the proposed method do not change the number of flip-flops. has the largest portion of a total power. Considering this fact,
However, MBFF merges multiple single-bit flip-flops as a it is considered that 14.1% of a clock power reduction is a
single flip-flop and this helps reduce the pin capacitance. The significant achievement in real industrial designs. In addition,
number of vias on the clock network shown in the sixth the total power reduction by the proposed method would be
column is also reduced since the proposed method relocates higher when the operating frequency increases.
the flip-flops in a possibly single line. It should be noted that
the area of the clock cells is reduced than that of the original D. Overall Design Analysis
and MBFF cases. This is because the clock network can use Timing analysis for the MBFF and proposed methods is
smaller cells than that of the conventional methods due to shown in Table III. Both worst-negative-slack (WNS) and
the reduced resistance and capacitance of the clock nets. The total-negative-slack (TNS) for setup timing are improved com-
eighth column in Table II shows the merging ratios for the pared to the conventional methods after applying the proposed
MBFF method and alignment ratios for the proposed method. method. It is because the proposed method helps flip-flops
Owing to the tight timing constraints of the industrial IPs, have shorter clock latencies than the original case due to the
the average MBFF merging ratio is 49.46%. On the other hand, reduction in wire length and wire capacitance. This allows
the average alignment ratio of the proposed method is 95.68% a clock latency difference between the launch flip-flop and
due to the simple and efficient relocation algorithm. the capture flip-flop to become smaller because the shortened
The last three columns in Table II show the internal, clock latencies are less affected by the clock uncertainty mar-
switching, and total power for each clock network. The leakage gin and on-chip-variation (OCV) margin. The shortened clock
power is not shown in the table since it is too small (under latencies also help use a useful skew technique which varies
0.1% of total clock power). For all IPs, the proposed method clock latencies to fix the setup timing violations, aggressively.
reduces clock power significantly due to the reduction of wire For this reason, the proposed method is more effective to
capacitance and wire length. The reduction ratios compared optimize timing and power than the original method. There
to their original designs are 15.18%, 10.48%, 17.11%, 7.67%, is no additional overhead on hold violation by the proposed
and 20.13% for IP1, IP2, IP3, IP4, and IP5, respectively. method because the proposed flip-flop alignment is applied
They are much larger than that of the MBFF method. Fig. 14 before CTS.
depicts an average reduction ratio of clock power by the Total cell area is also reported in Table III with a low-
MBFF and the proposed method with a baseline of original voltage-threshold (LVT) cell area and a regular-voltage-
design. The reduction ratios by the proposed method are threshold (RVT) cell area to check the area overhead. In most

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

cases (IP1, IP2, IP3, and IP5), the LVT area is decreased after [7] A. Rajaram and D. Z. Pan, “MeshWorks: A comprehensive framework
applying the proposed method. It is because some LVT cells for optimized clock mesh network synthesis,” IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 29, no. 12, pp. 1945–1958,
can be replaced with RVT cells during power optimization due Dec. 2010.
to the clock network quality improvement by the proposed [8] G. Wilke and R. Reis, “A new clock mesh buffer sizing methodology for
method. The reason as to both LVT cell area and RVT cell skew and power reduction,” in Proc. IEEE Comput. Soc. Annu. Symp.
VLSI, Apr. 2008, pp. 227–232.
area is slightly increased after applying the proposed method [9] M. R. Guthaus, G. Wilke, and R. Reis, “Non-uniform clock mesh
in IP4 is that the proposed method fixed more setup timing optimization with linear programming buffer insertion,” in Proc. Design
violations. In Table III, both WNS and TNS for setup timing Autom. Conf., Jun. 2010, pp. 74–79.
[10] W. Liu, G. Chen, Y. Wang, and H. Yang, “Modeling and optimization
of the proposed method are much smaller than that of the of low power resonant clock mesh,” in Proc. 20th Asia South Pacific
original case. In this case, both LVT cell area and RVT cell Design Autom. Conf., Jan. 2015, pp. 478–483.
area could increase. This results in a less leakage power [11] H. Chou, H. Yu, and S. Chang, “Useful-skew clock optimization for
multi-power mode designs,” in Proc. IEEE/ACM Int. Conf. Comput.-
consumption compared to the original and MBFF methods. Aided Design (ICCAD), Nov. 2011, pp. 647–650.
The reason for the slight increase in the RVT cell area in [12] S. Roy, P. M. Mattheakis, L. Masse-Navette, and D. Z. Pan, “Clock tree
IP1 and IP4 is that some LVT cells are changed to RVT cells resynthesis for multi-corner multi-mode timing closure,” IEEE Trans.
for power optimization. The last column gives the number Comput.-Aided Design Integr. Circuits Syst., vol. 34, no. 4, pp. 589–
602, Apr. 2015.
of DRC violations. Even though there is a small increase [13] R. Shandilya and R. K. Sharma, “Low power positive-edge triggered
in DRC violations in IP1, it would not be critical because d-type flip-flop,” in Proc. Int. Conf. Trends Electron. Informat. (ICEI),
the number of remaining DRC violations is negligibly small. May 2017, pp. 1018–1023.
[14] P. Zhao, T. K. Darwish, and M. A. Bayoumi, “High-performance and
IP3 and IP4 show much smaller DRC violations after applying low-power conditional discharge flip-flop,” IEEE Trans. Very Large
the proposed method. Overall, there is no degradation in Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 477–484, May 2004.
timing or physical constraints by the proposed algorithm while [15] A. G. M. Strollo, E. Napoli, and D. D. Caro, “New clock-gating
techniques for low-power flip-flops,” in Proc. Int. Symp. Low Power
there is a significant clock network improvement. Electron. Design (ISLPED), Jul. 2000, pp. 114–119.
[16] T. Lee, D. Z. Pan, and J.-S. Yang, “Clock network optimization with
VI. C ONCLUSION multibit flip-flop generation considering multicorner multimode timing
constraint,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
In this article, we proposed a novel clock network opti- vol. 37, no. 1, pp. 245–256, Jan. 2018.
mization method to minimize the dynamic power of the clock [17] Y. Shyu, J. Lin, C. Huang, C. Lin, Y. Lin, and S. Chang, “Effective and
network. It creates virtual tiles and finds out the most effective efficient approach for power reduction by using multi-bit flip-flops,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 4,
columns for alignments. Then, flip-flops are relocated at the pp. 624–635, Apr. 2013.
virtual tiles considering the minimum moving distances. This [18] Z. Chen and J. Yan, “Utilization of multi-bit flip-flops for clock power
allows a clock network synthesis to create a clock network reduction,” in Proc. 19th IEEE Int. Conf. Electron., Circuits, Syst.
(ICECS), Dec. 2012, pp. 677–680.
that is much simpler and effective by using straight nets and [19] M. P.-H. Lin, C.-C. Hsu, and Y.-T. Chang, “Post-placement power opti-
fewer vias. The wire capacitance and wire length are signif- mization with multi-bit flip-flops,” IEEE Trans. Comput.-Aided Design
icantly reduced without any degradation in timing or other Integr. Circuits Syst., vol. 30, no. 12, pp. 1870–1882, Dec. 2011.
[20] S.-H. Wang, Y.-Y. Liang, T.-Y. Kuo, and W.-K. Mak, “Power-driven
physical design constraints. Since flip-flops are relocated flip-flop merging and relocation,” IEEE Trans. Comput.-Aided Design
within very short distances and they are not merged to Integr. Circuits Syst., vol. 31, no. 2, pp. 180–191, Feb. 2012.
MBFF or resister bank, the proposed method is more effective [21] C. Hsu, Y. Chen, and M. P. Lin, “In-placement clock-tree aware multi-
bit flip-flop generation for power optimization,” in Proc. IEEE/ACM Int.
than MBFF or register banking method in post optimizations Conf. Comput.-Aided Design (ICCAD), Nov. 2013, pp. 592–598.
after clock network synthesis. Especially, for the designs with [22] W. Shen, Y. Cai, X. Hong, and J. Hu, “Activity-aware registers placement
a high operating frequency or a high cell density, the proposed for low power gated clock tree construction,” in Proc. IEEE Comput.
Soc. Annu. Symp. VLSI (ISVLSI), Mar. 2007, pp. 383–388.
method can be applied effectively. Finally, the MBFF opti- [23] W. Hou, D. Liu, and P. Ho, “Automatic register banking for low-
mization method can be integrated with the proposed align- power clock trees,” in Proc. 10th Int. Symp. Quality Electron. Design,
ment method for further improvement of clock network power. Mar. 2009, pp. 647–652.
[24] A. Tang and N. K. Jha, “GenFin: Genetic algorithm-based multiobjective
statistical logic circuit optimization using incremental statistical analy-
R EFERENCES sis,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 3,
[1] A. Kapoor et al., “Digital systems power management for high per- pp. 1126–1139, Mar. 2016.
formance mixed signal platforms,” IEEE Trans. Circuits Syst. I, Reg. [25] D. Liu and C. Svensson, “Power consumption estimation in CMOS
Papers, vol. 61, no. 4, pp. 961–975, Apr. 2014. VLSI chips,” IEEE J. Solid-State Circuits, vol. 29, no. 6, pp. 663–670,
[2] D. Duarte, N. Vijaykrishnan, and M. Irwin, “A clock power model Jun. 1994.
to evaluate impact of architectural and technology optimizations—A [26] A. Bonetti, N. Preyss, A. Teman, and A. Burg, “Automated integration of
summary,” IEEE Circuits Syst. Mag., vol. 3, no. 3, pp. 36–39, Jul. 2003. dual-edge clocking for low-power operation in nanometer nodes,” ACM
[3] A. Vittal and M. Marek-Sadowska, “Low-power buffered clock tree Trans. Design Autom. Electron. Syst., vol. 22, no. 4, pp. 62:1–62:20,
design,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., May 2017, doi: 10.1145/3054744.
vol. 16, no. 9, pp. 965–975, Sep. 1997. [27] A. Farshidi, L. Behjat, L. Rakai, and D. Westwick, “A multiobjective
[4] V. Sharma, “Minimum current consumption transition time optimization cooptimization of buffer and wire sizes in high-performance clock trees,”
methodology for low power CTS,” in Proc. Design, Autom. Test Eur. IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 64, no. 4, pp. 412–416,
Conf. Exhib. (DATE), Mar. 2015, pp. 412–416. Apr. 2017.
[5] C. Deng, Y. Cai, and Q. Zhou, “Fast synthesis of low power clock trees [28] S. Pullela, N. Menezes, and L. T. Pillage, “Low power IC clock
based on register clustering,” in Proc. 16th Int. Symp. Quality Electron. tree design,” in Proc. IEEE Custom Integr. Circuits Conf., May 1995,
Design, Mar. 2015, pp. 303–309. pp. 263–266.
[6] A. Farshidi, L. Rakai, and L. Behjat, “An efficient optimal clock network [29] Q. Wu, M. Pedram, and X. Wu, “Clock-gating and its application to
buffer sizing with slew consideration,” in Proc. IEEE 30th Can. Conf. low power design of sequential circuits,” IEEE Trans. Circuits Syst. I,
Electr. Comput. Eng. (CCECE), Apr. 2017, pp. 1–4. Fundam. Theory Appl., vol. 47, no. 3, pp. 415–420, Mar. 2000.

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KWON et al.: VIRTUAL TILE-BASED FLIP-FLOP ALIGNMENT METHODOLOGY FOR CLOCK NETWORK POWER OPTIMIZATION 13

[30] G. E. Tellez, A. Farrahi, and M. Sarrafzadeh, “Activity-driven clock David Z. Pan (Fellow, IEEE) received the B.S.
design for low power circuits,” in Proc. IEEE Int. Conf. Comput.-Aided degree from Peking University, Beijing, China, in
Design (ICCAD), Nov. 1995, pp. 62–65. 1992, and the M.S. and Ph.D. degrees from the Uni-
[31] J. Yan and Z. Chen, “Construction of constrained multi-bit flip-flops versity of California at Los Angeles (UCLA), Los
for clock power reduction,” in Proc. Int. Conf. Green Circuits Syst., Angeles, CA, USA, in 1998 and 2000, respectively.
Jun. 2010, pp. 675–678. From 2000 to 2003, he was a Research Staff
[32] Z. Chen and J. Yan, “Routability-driven flip-flop merging process for Member with IBM T. J. Watson Research Center,
clock power reduction,” in Proc. IEEE Int. Conf. Comput. Design, Yorktown Heights, NY, USA. He is currently an
Oct. 2010, pp. 203–208. Engineering Foundation Professor with the Depart-
[33] T. Lee and T. Wang, “Congestion-constrained layer assignment for via ment of Electrical and Computer Engineering, The
minimization in global routing,” IEEE Trans. Comput.-Aided Design University of Texas at Austin, Austin, TX, USA.
Integr. Circuits Syst., vol. 27, no. 9, pp. 1643–1656, Sep. 2008. He has published over 350 journal articles and refereed conference articles
[34] W.-H. Liu and Y.-L. Li, “Negotiation-based layer assignment and is the holder of eight U.S. patents. His research interests include cross-
for via count and via overflow minimization,” in Proc. 16th layer nanometer IC design for manufacturability, reliability, security, machine
Asia South Pacific Design Autom. Conf. (ASPDAC). Piscataway, learning and hardware acceleration, and design/CAD for analog/mixed signal
NJ, USA: IEEE Press, 2011, pp. 539–544. [Online]. Available: designs and emerging technologies.
http://dl.acm.org/citation.cfm?id=1950815.1950924

Taehyun Kwon (Student Member, IEEE) received


the B.S. degree from Kyunghee University, Seoul,
South Korea, in 2005, and the M.S. degree from
Yonsei University, Seoul, in 2007, all in electrical
engineering. He is currently working toward the
Ph.D. degree in semiconductor and display engineer-
ing at Sungkyunkwan University, Seoul.
Since graduation, he has been working for the SoC
Design Team, Samsung Electronics, Suwon, South
Korea. He is a part of the team developing high-
performance CPU and GPU for Samsung mobile
SoC products. His research interests are high-performance processor archi-
tecture, low-power CPU and GPU implementations, high-bandwidth memory,
and reliable architectures for emerging memory technologies.

Joon-Sung Yang (Senior Member, IEEE) received


the B.S. degree from Yonsei University, Seoul, South
Korea, in 2003, and the M.S. and Ph.D. degrees from
Muhammad Imran (Student Member, IEEE) The University of Texas at Austin, Austin, TX, USA,
received the B.S. degree in electrical engineering in 2007 and 2009, respectively, all in electrical and
from the University of Engineering and Technology, computer engineering.
Lahore, Pakistan, in 2012. He is currently working After graduation, he worked at Intel Corporation,
toward the Ph.D. degree in electrical and computer Austin, TX, USA, for four years. He was with
engineering from Sungkyunkwan University, Sungkyunwan University. He is currently an Asso-
Suwon, South Korea. ciate Professor with Yonsei University. His research
He joined Sungkyunkwan University, Suwon, interests are memory architectures and efficient deep
in 2016. His current research interests include learning architecture development.
reliable architectures for the emerging memory Dr. Yang was a recipient of the Korea Science and Engineering Foundation
technologies and computationally efficient (KOSEF) Scholarship in 2005. He received the Best Paper Award at the
implementation of deep learning algorithms. 2008 IEEE International Symposium on Defect and Fault Tolerance in VLSI
Dr. Imran was a recipient of scholarship by the Higher Education Systems and at the 2016 IEEE International SoC Design Conference. He was
Commission of Pakistan for M.S. and Ph.D. studies. nominated for the Best Paper Award at the 2013 IEEE VLSI Test Symposium.

Authorized licensed use limited to: NXP Semiconductors. Downloaded on March 02,2020 at 08:41:53 UTC from IEEE Xplore. Restrictions apply.

You might also like