Tutorial On VLSI Partitioning
Tutorial On VLSI Partitioning
v
i
V
j
s
i
to be the size
of a partition V
j
. Each net e
i
is attached with a
2 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
connectivity c
i
in R
+
. By default, c
i
=1. For a bus
of multiple signal lines, we can represent the bus
with a net e
i
of connectivity c
i
equal to the number
of lines. We can also assign higher weights for
some important nets, this will enable us to keep the
modules of these nets in the same partition.
In this tutorial, we will assume that circuits are
represented as hypergraphs except when stated
otherwise, hence, the terms circuit, netlist, and
hypergraph are used interchangeably throughout
the tuorial.
(ii) Partitions and Cuts The set of hyperedges
connecting any two-way partition (V
1
, V
2
) of two
disjoint vertex sets V
1
and V
2
is denoted by a cut
E(V
1
, V
2
)={e
j
E[ 0 <[e
j
V
1
[ and 0 <[e
j
V
2
[},
i.e., e
j
E(V
1
, V
2
) if there exist some pins of e
j
in V
1
and some dierent pins of e
j
in V
2
. We dene
C(V
1
. V
2
) =
e
i
E(V
1
.V
2
)
c
i
to be the cut count of
the partition (V
1
, V
2
).
For a multiway partition (V
1
, V
2
, F F F , V
k
)
where k >2, a cut E(V
1
, V
2
, F F F , V
k
)={e
j
E[ i
s.t. 0 <[e
j
V
i
[ <[e
j
[}. For each subset V
i
, we
denote its external cut set E(V
i
)={e
j
E[0 <[
e
j
V
i
[ <[e
j
[}. We denote its adjacent net set to be
the nets with some pin contained in V
i
, i.e.,
I(V
i
)={e
i
[ [e
i
V
i
[ >0}.
(iii) Replication Cuts and Directed Cuts For
replication cuts and performance driven partition-
ing, the direction of the nets makes a dierence in
the process. We characterize the pins of each net
into two types: source and sink. A directed net e
i
is
denoted by (a
i
, b
i
) where a
i
V are the source pins
of the net and b
i
V are the sink pins of the net.
We assume that [a
i
b
i
[ _2, [a
i
[ _1 and [b
i
[ _1.
Usually, each net has one source pin and multiple
sink pins. However, some nets may have multiple
sources which share the same interconnect line.
Furthermore, one pin can be both a source pin and
sink pin of the same net. Therefore, a
i
and b
i
may
have a nonempty intersection.
For two disjoint vertex sets X and Y, we shall use
E(XY) to denote the directed cut set from X to
Y. Net set E(XY) contains all the nets e
i
=(a
i
, b
i
)
such that X intersects the source pin set a
i
and Y
intersects the sink pin set b
i
, i.e., E(XY)=
{e
i
[ e
i
=(a
i
, b
i
), a
i
X,=O, b
i
Y,=O}. We use the
function C(XY) to denote the total cut count
of the nets in E(XY), i.e., C(X Y) =
e
i
E(XY)
c
i
.
(iv) Performance Driven Partitioning In perfor-
mance driven partitioning [106], modules are
distinguished into two types: combinational ele-
ments and globally clocked registers. In illustra-
tion, we shall use circles to represent the com-
binational elements and rectangles to represent the
registers in gures (Fig. 13). Each module v
i
has an
associated delay d
i
.
A path of length k from a module v
i
to a module
v
j
is a sequence v
i
0
. v
i
1
. F F F . v
i
k
) of modules such
that v
i
= v
i
0
, v
j
= v
i
k
and for each l {1, 2, F F F , k},
modules v
i
l1
and v
i
l
are a souce pin and a sink pin
of a net in E, respectively.
(v) Clustering Given a hypergraph H(V, E),
highly connected modules in V can be grouped
FIGURE 1 Hypergraph example.
3 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
together to form some single supermodules called
clusters. After this process, a clustering ={V
1
,
V
2
, F F F , V
k
} of the original hypergraph H is
obtained and a contracted (i.e., coarser) hypergraph
H
(V
, E
) is induced, where V
= v
1
. v
2
. F F F .
v
k
. For every e
j
E, the contracted net e
j
E
if
[e
j
[ _ 2, where e
j
= v
i
[e
j
V
i
,= O, that is, e
j
spans the set of clusters containing modules of e
j
. A
contracted hypergraph, of course, can be used to
induce another coarser contracted hypergraph
based on the same clustering process. On the other
hand, a contracted hypergraph H
(V
, E
) can be
unclustered to return to a ner hypergraph H(V, E).
3. PROBLEM FORMULATIONS
In this section, we describe dierent formulations
of the partitioning problems addressed in this
tutorial. We will cover two-way partitioning,
multiway partitioning, multiple level partitioning,
partitioning with replication, and performance
driven partitioning.
3.1. Two-way Partitioning or Bipartitioning
We consider several possible variations on the size
constraints and cost functions in the formulation.
Additionally, in certain formulations, we x two
modules v
s
and v
t
to be on the opposite sides of the
cut as two seeds.
3.1.1. Min-cut Separating Two Modules
v
s
and v
t
Given a hypergraph, we x two modules denoted
as v
s
and v
t
at two sides. A min-cut is a partition
(V
1
, V
2
), v
s
V
1
and v
t
V
2
such that the cut count
C(V
1
, V
2
) is minimized, i.e.,
min
v
s
V
1
.v
t
V
2
C(V
1
. V
2
) (1)
where V
1
and V
2
are disjoint and the union of the
two sets is equal to V.
This partitioning is strongly related to a linear
placement problem. In a linear placement, we have
[V[ equally spaced slots on a striaght line (Fig. 2).
Modules v
s
and v
t
are xed at the two extreme
ends, i.e., v
s
on the rst slot (left end) and v
t
on the
last slot (right end). The goal is to assign all
modules to distinct slots to minimize the total wire
length. Let us use x
i
to denote the coordinate of
module v
i
after it is assigned to the slot. The length
of a net e
i
can be expressed as the dierence of the
maximum coordinate and the minimum coordi-
nate of the modules in the net, i.e., max
v
j
e
i
x
j
min
v
k
e
i
x
k
. The total wire length can be expressed
as follows.
e
i
E
(max
v
j
e
i
x
j
min
v
j
e
i
x
j
) (2)
The relation between partitioning and place-
ment can be derived under the assumption that all
nets are two pin nets [50].
THEOREM HEOREM 3.1 Given a graph G(V, E) with modules
v
s
and v
t
in V, let (V
1
, V
2
) be a min-cut partition
separating modules v
s
and v
t
. Let v
s
and v
t
be the two
modules locating at the two extreme ends of a linear
placement. Then, there exists an optimal linear
placement solution such that all modules in V
2
are
on the slots right of all modules in V
1
(Fig. 2).
Thus, we can use the min-cut to partition a linear
FIGURE 2 Suppose partition (V
1
, V
2
) is a min-cut separating
modules v
s
and v
t
. There exists an optimal linear placement that
modules in V
2
are at the right side of modules in V
1
.
4 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
placement into two smaller problems and still
maintain optimality. Conceptually, we can conceive
that modules in V
1
or V
2
have stronger internal
connection within the set than its mutual connec-
tion to the other set. Thus, if the span of modules in
V
1
and in V
2
are mixed in a linear placement, we can
slide all modules in V
1
to the left and all modules in
V
2
to the right to reduce the total wire length. In
fact, this is the procedure to prove the theorem.
The min-cut with no size constraints can be
found in polynomial time using classical maximum
ow techniques [1]. However, it may happen that
the optimal solution separates only v
s
or v
t
from
the rest of the modules, i.e., V
1
={v
s
} or V
2
={v
t
}.
This result is very likely to happen because most
VLSI basic modules have very small degrees of
connecting nets (e.g., the degree of a 3-input
NAND gate=4).
3.1.2. Minimum Cost Ratio Cut
The cost ratio cut formulation supplies a partition
dierent from the min-cut that separates two xed
modules. Thus, if the min-cut cannot provide any
nontrivial solution, we may adopt the cost ratio
cut to perform another trial.
In cost ratio cut, we x two modules v
s
and v
t
at
two dierent sides. Our objective is to nd a vertex
set A to minimize a cost ratio function:
C(A. V A v
s
) C(A. v
s
)
S(A)
(3)
where vertex set A does not contain v
s
and v
t
.
Vertex set A is non-empty, i.e., S(A) >0.
Cost ratio cut is also strongly related to a linear
placement. Assuming that all nets are two pin nets,
we can derive the following theorem [22]:
THEOREM HEOREM 3.2 Given a graph G(V, E) with modules
v
s
and v
t
in V, let (V
1
, V
2
) be an optimal cost ratio
cut partition. There exists an optimal linear
placement solution such that all modules in A are
on the slots left of all modules in VA{v
s
}.
Conceptually, we can conceive that C(A, V
A{v
s
}) is the force to pull A to the right and
C(A, {v
s
}) is the force to push A to the left. The
denominator S(A) is the inertia of the set A. Aset A
with the minimum cost ratio moves with the fastest
acceleration toward left end of the slots
Example In Figure 3, the circuit contains six
modules. The optimum cost ratio cut solution has
A={v
1
, v
2
, v
3
} The cost ratio value is
C(A. V A v
s
) C(A. v
s
)
S(A)
=
4 3
3
=
1
3
.
(4)
The cost ratio value of any other choice of set A is
larger than expression 4.
FIGURE 3 A six module circuit to illustrate the cost ratio cut.
5 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
The cost ratio cut solution can be found in poly-
nomial time for a special case of serial parallel
graphs [22]. We are unaware of algorithms for
general cases. Note that, the solution may have
VA{v
s
} equal to set {v
t
}. In such case, the
partitioning result is not useful for decomposing the
circuit.
3.1.3. Min-cut with Size Constraints
For min-cut with size constraints, we have lower
and upper bounds on the partition size S
l
and S
u
,
where 0 <S
l
_S
u
<S(V) and S
l
S
u
=S(V). The
bipartitioning problem is to divide vertex set V
into two nonempty partitions V
1
, V
2
, where
V
1
V
2
=O and V
1
V
2
=V, with the objective of
minimizing cut count C(V
1
, V
2
) and subject to the
following size constraints:
S
l
_ S(V
b
) _ S
u
for b = 1. 2 (5)
The min-cut problem with size constraints is NP
complete [43]. However, because of the importance
of the problem in many applications, many
heuristic algorithms have been developed.
Random Partitioning We use a random parti-
tion estimation of min-cut with size constraints to
demonstrate that the quality variation of parti-
tioning results can be signicant. Let us simplify
the case by assigning the modules with uniform
size, i.e., s
i
=1 for all v
i
in V, and the nets with
uniform connectivity, i.e., c
i
=1 for all e
i
in E.
Let us assume that the modules are partitioned
into two sets V
1
, V
2
with equal sizes: S(V
1
)=S(V
2
).
The partition is performed with an independent
random process [10] so that each module has a
50% chance to go to either side. For a net e
i
of two
pins, we can derive that net e
i
belongs to the cut set
E(V
1
, V
2
) with a 0.5 probability (Fig. 4). Similarly,
we can derive that for a net e
i
of k pins (k >2), the
probability that net e
i
belongs to cut set E(V
1
, V
2
)
is (2
k
2),2
k
. This probability is larger than 0.5
and approaches one as k increases. In other words,
the expected cut count C(V
1
, V
2
) is equal to or
larger than half the number of nets. For example, a
circuit of one million modules usually has an
asymptotic number of nets, i.e., [E[=O([V[ )=
1,000,000. The expected cut count would be
C(V
1
, V
2
) _500,000. This number is much worse
than the results we can achieve. In practice, the cut
counts on circuits of a million of modules are
usually no more than several thousands [34, 36]. In
other words, the probability that a net belongs to a
cut set is small, below one percent for a circuit of
one million gates.
Suppose the two bounds of partitioned sizes are
not equal, S
l
,=S
u
. Using the proposed random
graph model, the expected cut count C(V
1
, V
2
) is
proportional to the product of two sizes, i.e.,
S(V
1
) S(V
2
). Consequently, the expected cut
count is smallest if the size of one partition appro-
aches the upper bound S(V
i
)=S
u
and the size of
another partition approaches the lower bound
S(V
j
)=S
l
. In practice, we do observe this behavior.
One partition is fully loaded to its maximum
capacity, while another partition is under utilized
with a large capacity left unused. This phenomena is
not desirable for certain applications.
3.1.4. Ratio Cut
Ratio cut formulation integrates the cut count and
a partition size balance criterion into a single
objective function [87, 109]. Given a partition
(V
1
, V
2
) where V
1
and V
2
are disjoint and
V
1
V
2
=V, the objective funtion is dened as
FIGURE 4 Four possible congurations of net e
i
={a, b} in a
random placement.
6 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
C(V
1
. V
2
)
S(V
1
) S(V
2
)
(6)
The numerator of the objective function minimizes
the cut count while the denominator avoids
uneven partition sizes. Like many other partition-
ing problems, nding the ratio cut in a general
network belongs to the class of NP-complete
problems [87].
Example Figure 5 shows a seven module example.
The modules are of unit size and the nets are of unit
connectivity. Partition (V
1
, V
2
) has a cost C
(V
1
. V
2
),(S(V
1
) S(V
2
)) = 2,(4 3) = (1,6). Any
other partition corresponds to a much larger cost.
The Clustering Property of the Ratio Cut The
clustering property of the ratio cut can be
illustrated by a random graph model. Let us
assume that the circuit is a uniformly distributed
random graph. with uniform module sizes, i.e.,
s
i
=1. We construct the nets connecting each pair
of modules with identical independent probability
f. Consider a cut which partitions the circuit into
two subsets V
1
and V
2
with comparable sizes c
[V[ and (1 c) [V[ respectively, where c<1.
The expected cut count equals the probability f
multiplied by the number of possible nets between
V
1
and V
2
.
Expec(C(V
1
. V
2
)) = f [V
1
[ [V
2
[
= c(1 c)[V[
2
f . (7)
On the other hand, if another cut separates only
one module v
s
from the rest of the modules, the
expected cut count is
Expec(C(v
s
. V v
s
)) = ([V[ 1) f (8)
As [V[ approaches innity, the value of Eq. (7)
becomes much larger than 8.
This derivation provides another explanation
why the min-cut separating two xed modules tends
to generate very uneven sized subsets. The very
uneven sized subsets naturally give the lowest cut
value. Therefore, the ratio value C(V
1
. V
2
),
(S(V
1
) S(V
2
)) is proposed to alleviate the hidden
size eect. As a consequence, the expected value of
this ratio is a constant with respect to dierent cuts:
Expec
_
C(V
1
. V
2
)
S(V
1
) S(V
2
)
_
=
f [V
1
[ [V
2
[
[V
1
[ [V
2
[
= f
(9)
Thus, if the nets of the graph are uniformly
distributed, all cuts have the same ratio value. In
other words, the choice of the cuts and the
partition sizes does not make dierence in such a
uniformly distributed random graph. In a general
circuit dierent cuts generate dierent ratios. Cuts
that go through weakly connected groups corre-
spond to smaller ratio values. The minimum of all
cuts according to their corresponding ratios
denes the sparsest cut since this cut deviates the
most from the expectation on a uniformly
distributed graph.
3.2. Multi-way Partitioning
For multi-way partitioning, we discuss a k-way
partitioning with xed size constraints and a
cluster ratio cut. These two problems are the
extensions of the min-cut with xed size con-
straints and the ratio cut from two-way to multi-
way partitioning, respectively.
3.2.1. K-way Partitioning
For multi-way partitioning, we separate vertex set
V into k disjoint subsets where k >2, i.e.,
(V
1
, V
2
, F F F , V
k
). There is an upper bound S
u
and
a lower bound S
l
on the size of each subset V
i
, i.e.,
S
l
_S(V
i
) _S
u
.
FIGURE 5 An example of seven modules, where partition
(V
1
, V
2
) is a minimum ratio cut.
7 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
There are dierent ways to formulate the cut
cost because of the dierent criteria used to count
the cost of multiple pin nets. In the following we
list a few possible objective functions.
(i) Minimize the cut count,
C(V
1
. V
2
. F F F . V
k
) =
e
i
E(V
1
.V
2
.FFF.V
k
)
c
i
(10)
(ii) Minimize the sum of cut counts of all vertex
sets. Let us denote the cut count of vertex set
V
i
to be C(V
i
) =
e
i
E(V
i
)
c
i
. The sum of cut
counts of all subsets can be expressed as
k
i=1
C(V
i
) =
k
i=1
e
j
E(V
i
)
c
j
(11)
Thus, the cost of a net connecting three
subsets is more expensive than the same net
connecting two subsets.
(iii) Minimize the maximum cut count of all
subsets, i.e.,
max
1_i_k
C(V
i
) (12)
3.2.2. Cluster Ratio Cut
Cluster ratio cut is an extension of ratio cut from
two-way partition to multiway partition. There is
no bound on the size of each subset. Furthermore,
the number of partitions, k, is not xed, and
instead is part of the objective function.
R
C
= min
k1
C(V
1
. V
2
. F F F . V
k
)
1_i_k1
j_i
S(V
i
) S(V
j
)
(13)
Note that we can rewrite the denominator to
reduce complexity of the derivation.
R
C
= min
k1
C(V
1
. V
2
. F F F . V
k
)
(1,2)
1_i_k
S(V
i
) [S(V) S(V
i
)[
(14)
If the number of partitions is one, the denomi-
nator becomes zero. Thus, k is restricted to be
larger than one.
Example Figure 6 shows a fteen module circuit.
The modules are of unit size and the nets are of
unit connectivity. The square dot in the gure
represents a hypernet. The partition shown by the
dashed line is a minimum cluster ratio cut. The
cost of the cut is
C(V
1
.V
2
.FFF.V
4
)
(1,2)
1_i_4
S(V
i
)[S(V)S(V
i
)[
=
4
(1,2)[4(154)3(153)4(154)4(154)[
=
1
21
(15)
FIGURE 6 A fteen module example to demonstrate cluster ratio cut.
8 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
The physical intuition of cluster ratio can be
explained using a random graph model [10]. Let G
be a uniformly distributed random graph. We
construct the nets connecting each pair of modules
with identical independent probability f. Since the
nets are uniformly distributed, the probability of
nding a subgraph which is signicantly denser
than the rest of the graph is very small, meaning
that there is no distinct cluster structure in G.
Consider a cut E(V
1
, V
2
, F F F , V
k
), the expected
value of C(V
1
, V
2
, F F F , V
k
) equals
Expec(C(V
1
. V
2
. F F F . V
k
)) = f
k
i=j1
k1
j=1
[V
i
[ [V
j
[
(16)
and the expected value of cluster ratio equals
Expec(R
C
) = Expec
_
C(V
1
. V
2
. F F F . V
k
)
k
i=j1
k1
j=1
[V
i
[ [V
j
[
_
=
f
k
i=j1
k1
j=1
[V
i
[ [V
j
[
k
i=j1
k1
j=1
[V
i
[ [V
j
[
= f (17)
Since f is a constant, all cuts have the same
expected cluster ratio value. Therefore, if we use
cluster ratio as the metric, all cuts would be
equally favored, which is consistent with the fact
that G has no distinct clusters. However, in a
general circuit, dierent cuts generate dierent
ratio values. Cuts that go through weakly con-
nected groups correspond to smaller ratio values.
The minimum of all cuts according to their cluster
ratio values denes the cluster structure of the
circuit since this cut deviates the most from the
cuts of a uniformly distributed graph.
3.3. Multi-level Partitioning
In multi-level partitioning [4, 23, 47, 58, 67, 68,
109, 110], the nal result is represented by a tree
structure. All the modules are assigned to the
leaves of the tree. The tree is directed from the root
toward the leaves. The level of the nodes is dened
to be the maximum number of nodes to traverse to
reach the leaves. Thus, the leaves are ranked level
zero. Each node is one level above the maximum
level of its children. When the level of the root is
only one, the problem is degenerated to two-way
or multiway partitioning.
Each net e
i
spans a set of leaves. Given a set of
leaves, there is a unique lowest common ancestor.
The level of the lowest ancestor is dened to be the
level l(e
i
) of the net.
The cost of a net e
i
is dened to be the
multiplication of its connectivity c
i
and the weight
w(l(e
i
)) of level l(e
i
) for net e
i
to communicate, i.e.,
c
i
w(l(e
i
)). The cost of the multi-level partition is
the sum of the cost of all nets, i.e.,
e
i
E
c
i
w(l(e
i
)).
3.3.1. J-level K-way Partitioning
When the root of the partitioning tree is level j and
the number of branches of each node is no more
than k, we say it a j-level k-way partition. We can set
dierent communication weights for each level.
Usually, the function is monotone, i.e., w(l) is larger
when level l increases. The vertex set V
i
of each leaf i
has its size bounded by S
l
_S(V
i
) _S
u
.
For electronic packaging, the tree is bounded by
the number of external connections. We call a leaf
is covered by a node if there is a directed path from
the node to the leaf in the tree representation. For
each node n
i
, we dene T
i
to be the union of the
modules in the leaves covered by node ni. Let E(T
i
)
be the external nets of T
i
, i.e., E(T
i
) ={e
i
[ 0 <[
e
i
T
i
[ <[e
i
[}. The cut count of each node should
not exceed the capacity of the external connection
of the packaging, i.e.,
C(T
i
) =
e
j
E(T
i
)
c
j
_ Cap(l(n
i
)) (18)
where Cap(l(n
i
)) is the capacity of the external
connection of level l(n
i
).
Example Figure 7 shows an example of a 3-level
5-way partitioning structure. The leaves are at
level 0 and the root is at level 3. Each node has at
most ve children. Net e
i
={v
1
, v
2
, v
3
} is covered by
node n
a
at level l(n
a
)=2.
9 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
3.3.2. Generic Binary Tree
A generic binary tree structure [110] is proposed to
simplify the multi-level partitioning. There is only
one constant S
u
to set in the binary tree. Thus, it is
much easier to make a fair comparison between
dierent algorithms.
In a generic binary tree, each internal node has
exactly two children. The weight of each level is
dened to be w(l)=2
l
. Thus, we have the objective
function
min
e
i
E
c
i
2
l(e
i
)
subject to the constraint on the capacity of the
leaves, i.e., S(V
i
) _S
u
where V
i
is the vertex set of
leaf i. The level of the root is adjusted according to
the minimization of the objective function.
Example Figure 8 illustrates a generic binary tree
for partitioning. In this gure, the root is at level
three. Each node has at most two children.
3.4. Replication Cut
In the replication cut problem, a subset of the
circuit may be replicated to reduce the cut count of
a partition [54, 64, 82]. In this section, we use a
two-way partition to illusturate the problem. We
x two modules v
s
and v
t
at two sides of the cut.
We use three vertex sets to represent the partition,
V
1
, V
2
, and R, where V
1
, V
2
, and R are disjoint
and V
1
V
2
R=V, v
s
V
1
, v
t
V
2
. Subsets V
1
and V
2
are separated by the cut and subset R is to
be replicated at both sides (Fig. 9).
Each copy of R needs to collect a complete set of
input signals in order to compute the function
FIGURE 7 An example of a 3-level 5 way partitioning tree structure.
FIGURE 8 An example of a generic binary tree.
10 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
properly. Thus, the nets from V
1
to R and from V
2
to R are duplicated. However, the output signals
of R can be obtained from either copy of R. For
example, nets from the right side R to V
1
in Figure
9(b) are not duplicated because V
1
gets inputs
from the left side R. For the same reason, we do
not replicate the nets from the left side R to V
2
.
Given two disjoint sets V
1
and V
2
, let a replication
cut R(V
1
, V
2
) denote the cut set of a partitioning
with R=VV
1
V
2
being duplicated. From
Figure 9(b), we can see that R(V
1
, V
2
) is the union
of four directed cuts, that is,
R(V
1
. V
2
) = E(V
1
V
2
) E(V
2
V
1
)
E(V
1
R) E(V
2
R).
Let S
l
and S
u
denote the size limits on the two
partitioned subsets. We state the Replication Cut
Problem as follows:
Given a directed circuit G, we want to nd a
replication cut R(V
1
, V
2
) with an objective
min C
R
(V
1
. V
2
) =
e
i
R(V
1
.V
2
)
c
i
(19)
subject to the size constraints
S
l
_ S(V
1
R) _ S
u
and S
l
_ S(V
2
R) _ S
u
,
and the feasible condition
V
1
V
2
= O. R = V V
1
V
2
.
Interpretation of the Replication Cut Suppose
we rewrite the replication cut in the format:
R(V
1
. V
2
) = E(V
1
R) E(V
1
V
2
)
E(V
2
V
1
) E(V
2
R)
= E(V
1
"
V
1
) E(V
2
"
V
2
)
where
"
V
1
and
"
V
2
denote the complementary sets of
V
1
and V
2
, i.e.,
"
V
1
= V V
1
and
"
V
2
= V V
2
. The
cut set becomes the union of E(V
1
"
V
1
) and
E(V
2
"
V
2
). We can interpret the cut set of the
replication cut R(V
1
, V
2
) as two directed cuts on
the original circuit G as shown in Figure 10.
3.5. Performance Driven Partitioning
The goal of performance driven partitioning is to
generate a partition that satises some timing
constraints. Due to the physical geometric distance
and interface technology limitations, inter-parti-
tion delay contributes the dominant portion of
signal propagation delay. Consequently, instead of
minimizing the number of the crossing nets as the
only objective during partitioning, we should take
into account the interpartition delay to satisfy the
timing constraints.
Clock period is a major measurement for circuit
performance. It is determined by the longest signal
propagation delay between registers. Each cross-
FIGURE 9 Replication cut problem: (a) the three sets of nodes V
1
, R and V
2
; (b) the duplicated circuit with R being replicated.
11 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
ing net is associated with an interpartition delay c
determined by VLSI technologies. Given a path p
from one register to another register with no
interleaving registers, let d
p
be the sum of
combinational block delays and
d
p
be the sum of
interpartition delays along path p. The longest
delay d
p
d
p
among all paths p should be smaller
than the clock period T, i.e.:
max
p
d
p
d
p
_ T. (20)
Now we state the performance-driven partition-
ing problem as follows:
Given hypergraph H(V, E), clock period T, two
bounds of sizes S
l
and S
u
, and interpartition delay c,
nd a partition (V
1
, V
2
) with the minimum cut count,
subject to S
l
_S(V
1
) _S
u
, S
l
_S(V
2
) _S
u
, and
max
p
d
p
d
p
_ T.
Example In Figure 11, path p starts at register v
i
and ends at register v
j
. The path crosses between
the partition (V
1
, V
2
) three times. Thus, the
interpartition delay
d
p
= 3c.
Replication can improve the performance of the
partitioned results [83]. In Figure 12(a), vertex set
R locates at the side of V
2
. Path p crosses between
the partition (V
1
, RV
2
) three times. By replicat-
FIGURE 10 An interpretation of the replication cut, R(V
1
. V
2
) = E(V
1
"
V
1
) E(V
2
"
V
2
).
FIGURE 11 An illustration of performance driven partitioning.
12 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
ing vertex set R (Fig. 12(b)), path p needs to cross
the partition only once.
3.5.1. Retiming
Retiming shifts the locations of the registers to
improve the system performance [76]. It is an
eective approach to reduce the clock period.
Moreover, the process also reduces the primary
input to primary output latency which is another
important measurement for circuit performance.
As in [85], we assume that the combinational
blocks are ne-grained. A module is called ne-
grained, if it can be split into several smaller
modules. Alternatively, if a module cannot be
split, it is called coarse-grained. The interpartition
delay c on crossing nets is inherently coarse-
grained and cannot be split.
Given a path p, we use r
p
to denote the number
of registers on the path. Let W(i, j) denote the
minimum r
p
among all possible paths p from i to j,
i.e.,
W(i. j) = min r
p
[ p P
ij
.
where P
ij
is the set of all paths from module v
i
to v
j
.
We dene a path p from v
i
to v
j
as a W-critical path
if r
p
equals W(i, j); W-critical path p is also called
an IO-W-critical path if modules v
i
and v
j
are the
primary input and output, respectively.
(i) Iteration Bound While retiming can reduce
the clock period of a circuit, there is a lower bound
imposed by the feedback loops in the hypergraph
[92]. Given a loop l, let d
l
,
d
l
and r
l
be the sum of
combinational block delays, the sum of interparti-
tion delays, and the number of registers in loop l,
respectively. The delay-to-register ratio of a loop l
is equal to (d
l
d
l
),r
l
. The iteration bound is de-
ned as the maximum delay-to-register ratio, i.e.:
J(V
1
. V
2
) = max
_
d
l
d
l
r
l
[l L
_
. (21)
where L is the set of all loops. Note that the
iteration bound of a given circuit yields a lower
bound on the achieved clock period by retiming.
(ii) Latency Bound Let p denote the IO-W-
critical path with maximum path delay among all
IO-W-critical paths from v
i
to v
j
. Since the number
of registers in path p is equal to W(i, j), the IO
latency (i.e. (W(i, j) 1) T) between v
i
and v
j
is
not less than d
p
d
p
, where T denotes the clock
period, and d
p
and
d
p
are the sum of combina-
tional block delays and the sum of interpartition
delays on path p, respectively. Thus, we dene
latency bound M as follows [85, 86]:
M(V
1
. V
2
) = maxd
p
d
p
[ p P
IOW
. (22)
where P
IOW
is the set of all IO-W-critical paths.
Latency bound also imposes a lower bound on the
system latency achieved by using retiming. An all-
pair shortest-path algorithm can be used to
calculate the latency bound.
We have two reasons to use the iteration and
latency bounds. (i) It is faster to calculate these
bounds. (ii) The iteration and latency bounds
stand for the lower bounds of the clock period and
system latency achieved by adopting retiming,
respectively. The partition with lower iteration and
FIGURE 12 Illustration of replication and its eect on
partitioning. The gure shows path p (a) before and (b) after
vertex set R is replicated.
13 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
latency bounds can achieve better clock period and
system latency by using retiming. Therefore, we
want to generate a partition with small iteration
and latency bounds.
Statement of the Problem Now we state the
performance-driven partitioning problem as fol-
lows:
Given hypergraph H(V, E), two numbers
~
J and
~
M,
bounds of sizes S
l
and S
u
, and interpartition delay c,
nd a partition (V
1
, V
2
) with the minimum number
of cut count, subject to S
l
_S(V
1
) _S
u
, S
l
_
S(V
2
) _S
u
, J(V
1
. V
2
) _
~
J, and M(V
1
. V
2
) _
~
M.
Example Figure 13 illustrates the eect of repli-
cation on the iteration bound. Let us assume that
the interpartition delay is c=4. Before replication,
the iteration bound is dominated by loop l
1
. The
bound is equal to
d
l
1
d
l
1
r
l
1
=
8 2 4
4
= 4. (23)
After replication [85], the bound contributed by
loop l
1
is equal to
d
l
1
d
l
1
r
l
1
=
8
4
= 2. (24)
The iteration bound now is dominated by the
union of loops l
1
and l
2
,
d
l
1
l
2
d
l
1
l
2
r
l
1
l
2
=
18 2 4
8
= 3.25. (25)
which is smaller than the iteration bound before
replication.
3.6. Clustering
Clustering [6] is similar to multiway partitioning in
that the process groups modules into k subsets.
However, for clustering the number of subsets is
usually much greater than for a typical multiway
partitioning problem, e.g., k _10.
Often, a clustering process is used as part of a
divide and conquer approach. Thus, it is impor-
tant to choose an objective function that ts the
target application. If the goal is to reduce problem
complexity, we set the objective function to be:
min
k
i=1
C(V
i
)
C
I
(V
i
)
. (26)
where V
i
's are disjoint vertex sets and their union
is equal to V. Function C(V
i
) is the external cut
count of cluster V
i
and C
I
(V
i
) is the count of nets
connecting vertex set V
i
, i.e.,
e
i
I(V
i
)
c
i
.
For performance driven clustering, the objective
function is to minimize the number of cuts
between registers.
4. MULTIPLE PIN NET MODELS
The handling of multiple pin nets strongly depends
on the partitioning approach [102]. Aproper model
is needed to reect the correct cut count and im-
prove the eciency. In this section, we rst intro-
duce a shift model which is used for iterations of
FIGURE 13 Illustration of replication and its eect on iteration bound.
14 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
shifting a module or swapping a pair of modules.
We then describe a clique model which is used to
replace a multiple pin net. The star and loop models
are variations of two pin net models, however, with
less complexity than the clique model. Finally, a
ow model is introduced for network ow appro-
aches.
4.1. Shift Model
The shift model [101] for multiple pin net is useful
when we perturb the partition by shifting one
module to a dierent vertex set or by swapping
two modules between dierent vertex sets. Let us
simplify the description by assuming only one
module is shifted to a dierent vertex set. A swap
of a pair of modules can be treated as two steps of
module shifting.
For each shift, we want to update the cut count.
We also want to update the potential change in
cost for each module if it were to be shifted, so that
we can rank the modules for the next move. Such
cost revision can be expensive if the circuit has
large nets which contain huge numbers of pins,
e.g., hundreds of thousand pins.
The shift model reduces the complexity of the
cost revision by utilizing the property that for huge
nets most shifts of its pins do not change the cost
of the other pins in the net.
Let us simplify the description by considering a
two way partitioning. The model can be extended to
multiple way partitioning according to the choice of
objective functions. Let module v
j
be shifted from
vertex set V
1
to V
2
. The conguration of nets
e
i
E({v
j
}) connecting module v
j
is revised. For each
net e
i
, we denote k
i
to be the number of pins of e
i
in
V
1
and [e
i
[ k
i
the number of pins of e
i
in V
2
(Fig.
14). With respect to net e
i
, we update the pin
numbers k
i
and [e
i
[ k
i
after module v
j
is shifted.
We also update the cost of modules in nets e
i
.
1. If the revised k
i
_2, the potential cost of pins
due to net e
i
is zero. For the case that
[e
i
[ k
i
=1, we increase the cut count by c
i
and set the potential cost of pins in e
i
.
Otherwise, the move has no eect on the cut
count and potential cost.
2. If the revised pin count k
i
=1, the shift of the
last pin of e
i
in V
1
will decrease the cut count by
c
i
. We then update the potential cost of this last
pin.
3. If k
i
=0, the cut count reduces by c
i
. However,
the shift of any pin v
k
e
i
from V
2
to V
1
will
increase the cut count. Thus, in this case, we
reect the cost of potential shift on the pins of
e
i
, which takes O([e
i
[) operations.
4.2. Clique of Two Pin Nets
Some researchers use cliques of two pin nets to
model multiple pin nets. Given a multiple pin net
e
i
, we construct a clique of (1/2)[e
i
[([e
i
[ 1) two
pin nets to connect all pairs of pins in the net. The
clique model maintains the symmetric relation of
the modules of the same net in the sense that the
order of the pins in the net has no eect on the
cost.
The weight of two pin nets in the clique module
is adjusted by some factor. One approach is to use
2/[e
i
[ to scale down the connectivity. The total
weight of all the nets in the clique is (2/[e
i
[) (1/2)
[e
i
[([e
i
[ 1)c
i
=([e
i
[ 1)c
i
. Note that it takes [e
i
[ 1
two pin nets to form a spanning tree of [e
i
[
modules.
Other factor has been proposed such as 1/
([e
i
[ 1) which is based on a dierent probability
model. However, no factor can exactly reect the
cost of a multiple pin net model.
Complexity of the Clique Model The complex-
ity of the clique model is high. There are O([e
i
[
2
)
two pin nets in a clique model. Suppose the
FIGURE 14 Multiple pin net model of shifting process.
15 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
process of each two pin net takes a constant time.
It takes O([e
i
[
2
) operations to process a multiple
pin net e
i
. Therefore, in practice, if the pin number
is larger than a threshold, the net is ignored in the
process.
4.3. Star of Two Pin Nets
A star model introduces less complexity than a
clique model. Given a net e
i
, we create a dummy
module ~v
i
. The dummy module ~v
i
connects every
pin in e
i
with a two pin net. This module maintains
the symmetry of the net. However, we need only
[e
i
[ two pin nets.
For the clique and star models, the cost of the
partition depends on the number of pins on the
two sides of the partition. The cost is higher when
the pins are distributed more evenly on the two
sides of the cut. Thus, these models discourage
even partitioning of the pins in the nets.
4.4. Loop Model of Two Pin Nets
A loop model reects the exact cut count [22],
however, it is sensitive to the order of the pins. We
can derive heuristic ordering of the pins using a
linear placement. Modules are sequenced accord-
ing to their x coordinates in the placement. We
nd the partition by collecting the modules
according to the sequence.
Following the order of the modules in the x
coordinates, we link the modules of a multiple pin
net with two pin nets into a loop. We link the pins
in a sequence (Fig. 15) alternating on every other
module. The loop is formed by the two connec-
tions at the two ends.
A factor of (1/2) is assigned to the two pin nets
so that the cut count separating modules according
to the sequence is one. The model remains correct
even if any two consecutive modules in the
sequence swap their order.
4.5. Flow Model
For the network ow approach, we consider each
net e
i
as a pipe. A set of saturated pipes forms a
bottleneck of the ow. The union of the saturated
pipes becomes the cut of the circuit. In such a
model, we set the capacity of the pipe equal to the
corresponding connectivity c
i
[52].
Let x
iu
be the amount of ow from pin v
i
to net
e
u
and x
uj
be the amount of ow from net e
u
to pin
v
j
(Fig. 16). The total ow injected into the net
should be smaller than or equal to its capacity and
the incoming ow is equal to the outgoing ow,
i.e.,
v
i
e
u
x
iu
_ c
u
. (27)
v
i
e
u
x
iu
v
i
e
u
x
ui
= 0. (28)
5. APPROACHES
In this section we introduce several approaches to
partitioning. We rst discuss two methods for
optimal solutions: a branch and bound method
and a dynamic programming algorithm. The
branch and bound method is eective in searching
exhaustively for the optimal solution for small
circuits. The dynamic programming method pre-
sented runs in polynomial time and nds an
optimal partition for a special class of circuits.
We then explain a few heuristic algorithms:
FIGURE 15 A loop model of multiple pin net where modules
are placed on an x axis. FIGURE 16 A ow model with respect to net e
u
.
16 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
group migration, network ow, nonlinear pro-
gramming, Lagrangian, and clustering methods.
The group-migration approach is a popular
method in practice due to its exibility and
eectiveness. The network ow method gives us
a dierent view of the partitioning problem by
transforming the minimization of the cut count
into the maximization of the ow via a duality in
linear programming. This approach derives ex-
cellent results with respect to certain objective
functions. The nonlinear programming method
provides a global view of the whole problem. The
Lagrangian method is a useful approach for
performance driven problems. Finally, we depict
a clustering method for the partitioning.
In most cases, we illustrate the method in
question using two-way partitioning as the target
problem. However, many methods can be ex-
tended to other problems or dierent objective
functions. For example, we can apply group
migration to multiway [98, 99] or multiple level
partitioning problems [68, 67] with modication to
the cost of the moves. Furthermore, some methods
may be combined to solve a problem. For
example, we can use clustering to reduce the size
of an input circuit and then use group migration to
nd a partition of the reduced circuit with much
greater eciency [24, 59]. In fact, this strategy
derives the best results in terms of CPU time and
cut count in recent benchmark [2].
5.1. Branch and Bound Method
The branch and bound method is an exhaustive
search technique that may be eectively applied to
the min-cut problem with size constraints for small
cases. In the branch and bound process, the
modules are rst ordered in a sequence. For each
module, we try placing it to either side of the cut.
The process can be represented by a complete
binary tree with [V[ levels. The root of the tree is
the rst module in the sequence. The nodes in the
kth level of the tree correspond to the kth module
in the sequence. The two branches at each node
represent the two trials where the kth module is
placed on each of the two dierent sides. A path in
the tree from the root to a leaf corresponds to one
assignment for the partition.
We use a depth rst search approach to traverse
the binary tree. We prune the search space
according to the size constraint and a partial cut
count. In the binary tree, a node at level k along
with the path from the root to the node represents
a partition assignment of the rst k modules. Let
V
1
and V
2
be the two vertex sets of the partitions
of the rst k modules. If S(V
i
) >S
u
for i=1 or 2,
the size constraint is violated, and there is no need
to proceed. Thus, we prune the branches below.
We also use a partial cut count to prune the
binary tree. The cut of the partial partition is
expressed as: E(V
1
, V
2
)={e
i
[ [e
i
V
1
[ >0 and
[e
i
V
2
[ >0}. The partial cut count is described
as: C(V
1
. V
2
) =
e
i
E(V
1
.V
2
)
c
i
. If the partial cut
count C(V
1
, V
2
) is larger than the cut count of a
known solution, the partition results below this
node are going to be worse than the existing
solution. We prune the branches of such a node.
Complexity of the Method Suppose the circuit
has unit size s
i
=1 on each module and the
constraint requires an even size S
l
=S
u
=[V[/2
(assuming that [V[ is even). Applying Stirling's
approximation [63], we have the number of
possible partitions:
[V[3
([V[,2)3
2
~
2
[V[
2
[V[
. (29)
Although the number of combinations is huge,
we have found that the application to small circuits
is practical. We improve the eciency of the
pruning by ordering the modules according to their
degrees, i.e., the number of nets connecting to the
modules, in a descending order. With an elegant
implementation, we can nd optimal solutions
when the number of modules is small, e.g., [V[ _60.
5.2. Dynamic Programming for a Serial
and Parallel Graph
For the special case where the circuit can be
17 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
represented by a serial and parallel graph of unit
module size, we can nd a minimum two way
partition (V
1
, V
2
) with size constraints in poly-
nomial time. In this section, we rst describe the
serial and parallel graph. We then depict a
dynamic programming algorithm that solves the
partitioning problem on this class of graphs. We
assume that all modules are of unit size, i.e., s
i
=1.
A serial and parallel graph can be constructed
from smaller serial and parallel graphs by serial or
parallel process. Each serial and parallel graph has
a source module v
s
and a sink module v
t
. A graph
G(V, E) with two modules, V={v
s
, v
t
} and one
edge E={e}, e={v
s
, v
t
} is a basic serial and parallel
graph. A serial and parallel graph is constructed
from the basic graph by a series of serial and
parallel processes.
Serial Process Given two serial and parallel
graphs, G
1
(V
1
, E
1
) and G
2
(V
2
, E
2
), we construct a
serial and parallel graph G(V, E) by merging the
sink module v
t1
of G
1
and the source module v
s2
of
G
2
(Fig. 17(a)). The source module v
s1
of graph G
1
becomes the source module of graph G, i.e.,
v
s
=v
s1
. The sink module v
t2
of graph G
2
becomes
the sink module of graph G, i.e., v
t
=v
t2
.
Parallel Process Given two serial and parallel
graphs, G
1
(V
1
, E
1
) and G
2
(V
2
, E
2
), we construct a
serial and parallel graph G(V, E) by merging the
source module v
s1
of G
1
and the source module v
s2
of G
2
and by merging the sink module v
t1
of G
1
and the sink module v
t2
of G
2
(Fig. 17(b)). The
merged source module and merged sink module
become the source module v
s
and the sink module
v
t
of graph G, respectively.
Dynamic Programming The dynamic program-
ming algorithm performs a bottom up process
according to the construction of the serial and
parallel graph. It starts from the basic serial and
parallel graph. For each graph G(V, E), we derive
two tables.
a(i, j): the minimum cut count with i modules on
the left hand side and j modules on the
right hand side under the condition that
source module v
s
is on the left hand side
and sink module v
t
is on the right hand
side.
b(i, j): the minimum cut count with i modules on
the left hand side and j modules on the
right hand side under the condition that
both source module v
s
and sink module v
t
are on the left hand side.
Let graph G(V, E) be constructed with
G
1
(V
1
, E
1
) and G
2
(V
2
, E
2
) by one of the serial
and parallel processes. Let a
1
, b
1
be the tables of
graph G
1
and a
2
, b
2
be the tables of graph G
2
. We
construct the tables a, b of graph G(V, E) as
follows.
Table Formulas for Parallel Process
a(i. j) = min
km=[V
2
[
a
1
(i 1 k. j 1 m)
a
2
(k. m). \i j = [V[. (30)
b(i. j) = min
km=[V
2
[
b
1
(i 2 k. j m)
b
2
(k. m). \i j = [V[. (31)
For table a(i, j), we try all combinations of
tables a
1
and a
2
with the constraint that the
number of modules on the left hand side is i and
the number of modules on the right hand side is j.
Note that the extra addition of 1 in the index is
used to compensate the merging of the two source
modules or the sink modules. For table b(i, j), we
try all combinations of tables b
1
and b
2
with the
same size constraint.
FIGURE 17 Construction of serial and parallel graphs.
18 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
Table Formula for Serial Process
a(i. j) = min(min
km=[V
2
[
a
1
(i k. j 1 m)
b
2
(k. m). min
km=[V
2
[
b
1
(i 1 k. j m)
a
2
(k. m)). \i j = [V[. (32)
b(i. j) = min(min
km=[V
2
[
a
1
(i k. j 1 m)
a
2
(m. k). min
km=[V
2
[
b
1
(i 1 k. j m)
b
2
(k. m)). \i j = [V[. (33)
For table a(i, j), we try all combinations of
tables a
1
and b
2
and all combinations of tables b
1
and a
2
. For the combinations of tables a
1
and b
2
,
the merged module (by merging v
t1
and v
s2
) is on
the right hand side. For the combinations of tables
b
1
and a
2
, the merged module is on the left hand
side. For table b(i, j), we try all combinations of
tables a
1
and a
2
and all combinations of tables b
1
and b
2
. For the combinations of tables a
1
and a
2
,
the merged module is on the right hand side. In
terms of G
2
, its source module v
s2
is on the right
hand side and its sink module v
t2
is on the left
hand side. Thus, the indices of table a
2
are
reversed, i.e., a
2
(m, k) instead of a
2
(k, m). For the
combinations of tables b
1
and b
2
, the merged
module is on the left hand side.
5.3. Group Migration Algorithms
The group migration algorithm was rst proposed
by Kernighan and Lin [60] in 1970. Since then,
many variations [15, 26, 27, 33, 39, 45, 49, 84, 97
99, 108, 111, 116] have been reported to improve
the eciency and eectiveness of the method.
Today, it is still a popular method in practice.
The probability of nding the optimum solution
in a single trial drops exponentially as the size of
the circuit increases [60]. Using the original
version, Kernighan and Lin showed that the
probability of obtaining an optimal solution is a
function of the problem size, p([V[ )=2
n/30
. In
other words, if the circuit size is large, then the
heuristic Kernighan Lin algorithm is unlikely to
jump out of local minima, and so the optimum
solution will not be found. The progress of the
method has denitely pushed the envelope further.
In this section, we concentrate on two-way min-
cut with size constraints. The method is exible
and can be extended to other partitioning pro-
blems with modications of the moves and the cost
function.
The algorithm performs a series of passes. At
the beginning of a pass, each module is labeled
unlocked. Once a module is shifted, it becomes
locked in this pass. The group migration algorithm
iteratively interchanges a pair of unlocked modules
or shifts a single module to a dierent side with the
largest reduction (gain) of the cost function. This
continues until all modules are locked. The lowest
cost along the whole sequence of swapping is
recorded. The group migration takes the subse-
quence that produces the lowest cut count and
undoes the moves after the point of the lowest
cost. This partitioning result is then used as the
initial solution for the next pass. The algorithm
terminates when a pass fails to nd a result with a
cost lower than the cost of the previous pass.
Group Migration Algorithm Input: Hypergraph
H(V, E) and an initial partition. Cost function and
size constraints.
1. One pass of moves.
1.1. Choose and perform the best move.
1.2. Lock the moved modules.
1.3. Update the gain of unlocked modules.
1.4. Repeat Steps 1.1 1.3 until all modules are
locked or no move is feasible.
1.5. Find and execute the best subsequence of
the move. Undo the rest of the sequence.
2. Use the previous result as an initial partition.
3. Repeat the pass (Steps 1 and 2) until there is no
more improvement.
Figure 18 illustrates the cost of a sequence of
moves. This algorithm escapes from local optima
by a whole sequence of the moves even when a
single move may produce a negative gain.
19 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
In the following, we discuss variations of several
parts in the process: basic moves (Step 1.1), data
structure, gains (Steps 1.1 and 1.3). At the end of
this subsection, we introduce a net based move and
a simulated annealing approach.
5.3.1. Basic Moves
Basic moves cover the shifting of a single module
and the swapping of a pair of modules. A
swapping can be conceived as two consecutive
shifts, however, with consideration of the mutual
eect between the two shifts.
(i) Module Shifting For each unlocked module,
we check its gain: the cost function reduction
by shifting the module to a dierent side
assuming that the rest of the modules are
xed. To select the best module to shift, we
order on each side the modules according to
their shift gains. If the size constraints are
violated after the shift, the move is not
feasible. We search for the best feasible
module to move [40].
(ii) Pairwise Swapping We exchange two mod-
ules in two vertex sets of the partition. Note
that the gain of the swap is not equal to the
sum of the gains of two shifts. The mutual
eect between the two modules needs to be
included when we derive the gain. Thus, the
best pair may not be the two modules on the
top of the two sides. The search of all pairs
takes O([V
1
[[V
2
[) operations. In practice, we
order modules according to their shift gain.
The search of the best pair is limited to the top
k modules on each side, e.g., k=3. Thus, the
complexity is actually O(k
2
).
Pairwise swapping is a natural adoption when
the size constraint is tight. When no single shift is
feasible, we can use swapping to balance the size of
the partition.
5.3.2. Data Structure
The choice of data structure strongly depends on
the cost functions, gains, and the characteristic of
VLSI circuitry. A sorting structure such as heap or
AVL tree is a natural choice to sort for the top
modules. However, for the case that the gain
diers by a very limited quantities, an array struc-
ture can simplify the coding and the complexity.
(i) Heap or AVL Tree We can use a heap or
AVL tree to sort the modules according to
their shift gain. Each side of the partition
keeps a heap. The top of the heap is the
module of the maximum gain. The sorting of
each module takes O([V[log([V[ )) operations.
(ii) Array (Bucket) of Link List Figure 19
illustrate a bucket list data structure. The gain
is transformed to the index of the bucket [40].
Modules of the same gain are stored in the
same bucket by a link list. A bucket is an
eective data structure when the objective
function is the cut count. The gain of cut
count is limited by the maximum degrees of
the modules, i.e., deg
max
= max
v
i
V
eE(v
i
)
c
e
. Thus, the dimension of the bucket is set to
be 2deg
max
.
For VLSI applications, the degree of modules is
much smaller than the number of modules. Thus,
the dimension of the bucket is small. It is very
ecient to search and revise the module order in
the bucket structure. In fact, it is proven that using
the bucket structure and cut count as the objective
FIGURE 18 Cost of a sequence of moves and subsequence
selection.
20 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
function, it takes linear time proportional to the
total number of pins to perform each pass [40].
5.3.3. Gains
In this subsection, we use cut count as the
objective function. The extension to other cost
functions is possible. However, we may loose
eciency.
(i) Shift Gain We use shift model for multiple
pin nets. Given a module v
i
, we check the set
E({v
i
}) of nets connecting to this module. The
contribution of each net e E({v
i
}) by shifting
module v
i
is the gain g
e
(v
i
) of the net with respect
to module v
i
. The gain g(v
i
) of module v
i
is the total
gains of all its adjacent nets, i.e.,
g(v
i
) =
eE(v
i
)
g
e
(v
i
))
(ii) Swap Gain The swap gain is the sum of the
gains of two modules v
i
and v
j
, deducting the eect
on common nets, i.e., g(v
i
) g(v
j
)
z
eE(v
i
)E(v
j
)
(g
e
(v
i
) g
e
(v
j
)).
(iii) Weights of Multipin Nets The sequence of
the move depends much on the gain calculation.
For a circuit of 1,000,000 modules, suppose the
degree of most modules is less than 100 and each
net is of unit weight. We have roughly 1,000,000
modules/200 gain levels =5,000 modules per gain
level. To dierentiate these 5,000 modules, we have
to adjust the weight of multiple pin nets.
(iii) (a) Levels with Priority The rst level gain is
identical to the shift gain of cut count. The second
level gain is equal to the number of nets that have
one more pins on the same side. Thus, the kth level
gain is equal to the number of nets that have k
more pins on the same side [65]. The pins on the
other side will increase by one after the module is
shifted. Thus, the negative gain of level k is
contributed by the nets with k 1 pins on the
other side.
Let us assume that module v
i
is in vertex set V
1
to simplify the notation. For each net e
j
E({v
i
}),
we denote k
j
=[e
j
V
1
[ the number of pins in V
1
.
Let us dene E(+, i, k) to be the set of nets
e
j
E({v
i
}) with k
j
=k1 pins in V
1
(the extra one
is used to count module v
i
itself ) and nonzero pins
in V
2
, i.e., [e
j
[ >k
j
. And E(, i, k) to be the set of
nets e
j
E({v
i
}) with no other pins in V
1
and k 1
pins in V
2
, i.e., [e
j
[=k and k
j
=1. Then, the kth
level gain of module v
i
, g
i
(k), is the weight
dierence of the two sets, E(+, i, k) and E(, i, k).
g
i
(k) =
eE(.i.k)
c
e
eE(.i.k)
c
e
(34)
E(. i. k) = e
j
[ e
j
E(v
i
). k
j
= k 1. [e
j
[ k
j
(35)
E(. i. k) = e
j
[ e
j
E(v
i
). k
j
= 1. [e
j
[ = k
(36)
FIGURE 19 Bucket list.
21 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
We compare the modules with a priority on the
lower level gain. In other words, we compare the
rst level rst. If the modules are equal at the rst
level gain, we then compare the second level and so
on. In practice, we limit the number of levels by a
threshold, e.g., l _3.
(iii) (b) Probabilistic Gain In probabilistic gain
model [37], each module v
i
is assigned a weight
p(v
i
). The weight p(v
i
) is a function of the gain g(v
i
)
of module v
i
to reect the belief level ( potential)
that the shift of module v
i
will be executed at the
end of the pass. Thus, if module v
i
is unlocked,
p(v
i
) = f (g(v
i
)). (37)
Otherwise, p(v
i
)=0. Figure 20 illustrates function
f, which increases monotonically. The slope within
g
0
and g
up
amplies the dierence of gains. The
slope is clamped at two ends p
max
and p
min
(0 _p
min
<p
max
_1) which represent the maxi-
mum potential that the module will shift or stay.
For each net e E({v
i
}), its contribution g
e
(v
i
) to
the gain of module v
i
is the tendency that the whole
net will shift with module v
i
to the other side. To
simplify the notation, let us assume that module v
i
is in V
1
. Thus, we have the following expression.
g
e
(v
i
) = c
e
_
j,=i.v
j
eV
1
p(v
j
)
v
j
eV
2
p(v
j
)
_
(38)
where
v
j
S
p(v
j
) = 1 if S is an empty set. The rst
term
j,=i.v
j
eV
1
p(v
j
) in the parentheses is the
potential that all the pins will shift with module v
i
to V
2
. Hence, c
e
j,=i.v
j
eV
1
p(v
j
) is the expected
gain if module v
i
is shifted. The second term
v
j
eV
2
p(v
j
) is the potential that the pins in V
2
will shift to V
1
. Thus, c
e
v
j
eV
2
p(v
j
) is the
expected loss if module v
i
is shifted.
The gain of a module v
i
is the total gains of the
adjacent nets with respect to this module, i.e.,
g(v
i
) =
eE(v
i
)
g
e
(v
i
). (39)
Net gain g
e
(v
i
) and module potential p(v
i
) are
mutually dependent. We derive the values via
iterations. Initially, we use the plain shift gain (by
cut count) to derive the potential p(v
i
)=f (g(v
i
)).
From these initial potentials, we derive the
probabilistic net gain. The net gain is then used
to derive the module gain. In practice, we stop
after a limited number of cycles, e.g., two
iterations ([37]). Note that there is no guarantee
that the iteration will converge.
After each move, the associated module poten-
tial and probabilistic net gains are updated and the
plain cut count is recorded. Exact cut count is used
when we select the subsequence of move to
execute.
It has been shown via benchmarks released by
ACM/SIGDA, the probabilistic gain model pro-
duces excellent partitioning results; it outperforms
the other gain models by wide margins.
5.3.4. Net-based Move
The net based process [115, 32] is similar to the
module based approach except that all operations
are based on the concept of the critical and
complementary critical sets. The main dierences
are (1) Instead of a single module, each move now
shifts one critical or complementary critical set,
depending on the type of objective function. For
convenience, we say a move is initiated by a net e
u
if this move is composed of shifting the critical or
complementary critical set associated with e
u
. (2)
The locking mechanism is operated on a net, that
is, if the critical or complementary critical set of a
FIGURE 20 Function of probabilistic gain.
22 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
net has been moved then all the moves initiated by
this net will be prohibited thereafter.
Given a net e
u
and a vertex set V
b
, let us dene
the critical set of net e
u
with respect to set V
b
as
s
ub
= e
u
V
b
. (40)
and the complementary critical set of e
u
with
respect to set V
b
as
s
u
"
b
= e
u
"
V
b
(41)
For a move associated with a net e
u
, we can
either place the critical set S
ub
into a partition
other than V
b
, or the complementary critical set
S
u
"
b
into the partition V
b
. The gain of each move is
then computed by evaluating the change of the
cost due to the move of the critical or comple-
mentary critical set.
Usage of Basic Module Moves Although the
net-based move model provides a dierent process
to improve current partition, it is more expensive
than the module-based move model because more
modules are involved in each move.
We can mimic the net based move by adding
weights to the connectivity of desired nets [38]. The
basic move is still based on the modules. However,
after module v
i
is moved, we add more weights on
the nets connecting to v
i
, i.e., E({v
i
}). These extra
weights encourage the adjacent modules to go
along with module v
i
and thus achieves the eect
of net based move. Empirical study nds improve-
ment on the partitioning results.
5.3.5. Simulated Annealing Approach
For simulated annealing [20, 81, 62, 56], we can
adopt the basic moves such as module shifting and
pairwise swapping. There is no need of lock
mechanism. To allow a larger searching space,
we incorporate the size constraints into objective
function, e.g.,
C(V
1
. V
2
) c(S(V
1
) S(V
2
))
2
. (42)
where c is a coecient. We can adjust it according
to the annealing temperature. As temperature
drops, we gradually increase c to enforce the size
balance.
5.4. Flow Approaches
In this section, we assume that the circuit can be
represented by a graph G(V, E) with unit module
size, i.e., s
i
=1 and all nets are two pin nets. The
ow approach can be extended to multiple pin nets
using a ow model.
We rst go through maximumowminimum cut
[1, 73] to introduce the duality [30] and the concept
of shadow price. The derivation is then extended to
a weighted cluster ratio cut and a replication cut.
Finally, we introduce heuristic algorithms that
accelerate the ow calculation. The ow approach
can derive excellent results. Furthermore, exploit-
ing its duality formulation, we can derive a tight
bound of the optimal solutions.
5.4.1. Maximum Flow Minimum Cut
In maximum ow minimum cut formulation, the
ow injects into module v
s
and drains from module
v
t
. The ow is conservative at all other modules.
The capacity of the nets e
ij
is equal to its
connectivity, c
ij
. We set c
ij
=0 if there is no net
connecting modules v
i
and v
j
. The notation x
ij
denotes the amount of ow from module v
i
to
module v
j
and x
ji
denotes the amount of ow from
module v
j
to module v
i
on net e
ij
. The objective is
to maximize the ow injection f into v
s
.
Obj X max f (43)
subject to the constraints,
x
ij
x
ji
_ c
ij
. \1 _ i. j _ [V[ (44)
[V[
j=1
x
js
[V[
j=1
x
sj
f = 0 (45)
[V[
j=1
x
jt
[V[
j=1
x
tj
f = 0 (46)
23 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
[V[
j=1
x
ij
[V[
j=1
x
ji
= 0. \1 _ i _ [V[ (47)
x
ij
_ 0. \1 _ i. j _ [V[. (48)
To derive the duality, we use shadow prices: a
bidirectional distance d
ij
for each net e
ij
Eq. (44),
potential `
i
for each module v
i
Eqs. (45) (47) The
dual problem can be expressed as follows [30].
Obj X min
e
ij
E
c
ij
d
ij
(49)
subject to
d
ij
_ [`
i
`
j
[. \1 _ i. j _ [V[. (50)
`
t
`
s
= 1. (51)
Figure 21 illustrates the formulation. As we
increase the ow, certain nets are going to
saturate, i.e., the two sides of inequality expression
(44) become equal. Once the saturated nets
become a bottleneck of the ow, the set of nets
forms a cut E(V
1
, V
2
) with v
s
V
1
and v
t
V
2
. In
duality, the potential of modules in V
2
increases to
one, and the potential of modules in V
1
remains to
be zero, i.e., `
i
=1, \v
i
V
2
and `
i
=0, \v
i
V
1
.
The distance of nets in the cut is one, while the
distance of nets outside the cut is zero, i.e., d
ij
=1,
\c
ij
E(V
1
, V
2
) and d
ij
=0, \c
ij
, E(V
1
, V
2
).
5.4.2. The Weighted Cluster Ratio Metric
and a Uniform Multi-commodity
Flow Problem
In a uniform multi-commodity ow problem
[74, 75], the demand of ow between each pair of
modules is equal to an identical value f. As we
keep increasing f, some of the nets become
saturated. These saturated nets form a bottleneck
of communication and thus prescribes a potential
clustering of the communication system [71].
We simplify the notation by assuming a graph
model G(V, E). From each module v
p
, we inject
ow f/2 to each of the rest modules. Summing up
the ow in two directions, the ow between each
pair of modules is f. We dene the ow originated
from module v
p
as commodity p. Let x
(p)
ij
be the
ow for commodity p on net e
ij
. The objective is to
maximize f:
Obj X max f (52)
subject to the ow demand from module v
p
to the
other modules v
i
,
[V[
j=1
x
(p)
ij
[V[
j=1
x
(p)
ji
=
f ,2 if i ,= p. and 1 _ i. p _ [V[.
([V[ 1)f ,2 if i = p. and 1 _ i. p _ [V[.
_
(53)
and the net capacity constraint,
FIGURE 21 Illustration of maximum ow minimum cut formulation.
24 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
[V[
p=1
x
(p)
ij
[V[
p=1
x
(p)
ji
_ c
ij
. 1 _ i. j _ [V[. (54)
We transform the above linear programming
problem to its dual expression by assigning dual
variables `
(p)
i
to module v
i
with respect to
commodity p Eq. (53), and distance d
ij
to net e
ij
Eq. (54), then we have:
Obj X min
e
ij
E
c
ij
d
ij
(55)
subject to
d
ij
_
`
(p)
i
`
(p)
j
. 1 _ i. j. p _ [V[ (56)
1
2
[V[
p=1
[V[
i=1.i,=p
_
`
(p)
i
`
(p)
p
_
_ 1 (57)
The Properties of Shadow Prices The shadow
price d
ij
can be viewed as bidirectional, i.e., d
ij
=d
ji
.
It represents the distance of net e
ij
, which
corresponds to the cost to transmit ow through
e
ij
. Variable `
(p)
i
is the potential of module v
i
with
respect to commodity p.
From constraints (56), (57), we can derive two
properties for distance function d
ij
and potential
`
(p)
i
[71].
Property I: Triangular Inequality The distance
metric d
ij
satises the triangular inequality:
d
ij
d
jk
_ d
ik
. \v
i
. v
j
. v
k
V (58)
Property II: Potential Function The term `
(p)
i
`
(p)
p
in expression (56) is equal to the shortest
distance between modules v
i
and v
p
based on net
distances d
ij
. In fact, from triangular inequality, we
obtain `
(p)
i
`
(p)
p
= d
ip
.
We normalize the objective function (55) with
the left hand side terms of inequality (57). The
objective function can be expressed as:
Obj X min
e
ij
E
c
ij
d
ij
(1,2)
[V[
p=1
[V[
i=1.i,=p
_
`
(p)
i
`
(p)
p
_
=
e
ij
E
c
ij
d
ij
(1,2)
[V[
p=1
[V[
i=1.i,=p
d
ip
(59)
In the solution of linear programming problem
(52) (56), the nets with positive d
ij
values parti-
tion V into vertex sets V
1
, V
2
, F F F , V
k
. More speci-
cally, nets connecting modules in dierent sets,
V
i
, V
j
, i ,=j, have the same distance d
ij
values (we
use d
ij
to denote the distance between vertex sets V
i
and V
j
when this does not cause confusion), while
nets connecting only modules in the same sub-
graph have zero distance, d
ij
=0 (Fig. 22). We can
rewrite the denominator of the objective function
and state the problem as follows.
Statement of Weighted Cluster Ratio Cut
[103] Find the distance d
ij
and the number of
partition k with an objective function of weighted
cluster ratio:
min
d
ij
.k
W
C
(V
1
. V
2
. F F F . V
k
)
= min
d
ij
.k
k
i=j1
k1
j=1
d
ij
C(V
i
. V
j
)
k
i=j1
k1
j=1
d
ij
S(V
i
) S(V
j
)
(60)
where distance d
ij
is subject to the property of
triangular inequality.
According to the mechanism of the duality, the
objective functions of the primal and dual
formulations are equal when the solution is
optimal [25].
THEOREM HEOREM 5.1 For feasible solutions, we have the
inequality f _W
C
(V
1
, V
2
, F F F , V
k
). The equality
holds when the solution is optimal, i.e., the
maximum uniform multicommodity ow equals the
FIGURE 22 Distance between clusters.
25 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
minimum weighted cluster ratio of any cut,
max
x
ij
f _ min
d
ij
.k
W
C
(V
1
. V
2
. F F F . V
k
).
Expression (60), weighted cluster ratio [103], is
similar to cluster ratio with a weighted metric d
ij
.
In general, the solution for the minimum weighted
cluster ratio does not directly correspond to the
partition of optimum cluster ratio. However, if
distance d
ij
is a constant value between all pairs of
vertex sets V
i
and V
j
then the weighted cluster ratio
provides the solution for cluster ratio.
When the nets with positive distance d
ij
form a
two-way partition, we can show that the partition
denes the ratio cut. When the nets with positive
distances form a k-way partition with k _4, we
also nd that there exists a two-way partition that
again denes the ratio cut [28].
THEOREM HEOREM 5.2 Let net set D={e
ij
[d
ij
>0} dene a
cut that separates the circuit into k disconnected
subsets. If k _4, then there exists a ratio cut that is
a subset of D.
5.4.3. A Replication Cut for Two-way
Partitioning
We adopt the linear programming formulation of
network ow problem [1, 30], where each module
is assigned a potential and a cut is represented by
the dierence of module potentials as shown in
Figure 23. With respect to the directed cut
E(V
1
"
V
1
), we use w
ij
to denote the potential
dierence between the cut from module v
i
V
1
to
module v
j
, V
1
. The potential of each module v
i
is
denoted by p
i
. For module v
i
in V
1
, p
i
=1, and for
modules v
i
in
"
V
1
, p
i
=0. Thus all nets e
ij
E(V
1
"
V
1
) have w
ij
=1. The remaining nets have
w
ij
=0.
With respect to the directed cut E(V
2
"
V
2
), we
use u
ji
with a reversed subscript ji to denote the
potential dierence between the cut from module
v
i
V
2
to module v
j
, V
2
(Fig. 23). The potential of
each module v
i
is denoted by q
i
. For modules v
i
in
"
V
2
, q
i
=1, and for modules v
i
in V
2
, q
i
=0. The
potential dierence u
ji
has a reverse direction with
net e
ij
because we set the potential on
"
V
2
side high
and the potential on V
2
side low. All nets
e
ij
E(V
2
"
V
2
) have u
ji
=1. The remaining nets
have u
ji
=0.
Primal Linear Programming Formulation The
problem is to minimize the total weight of crossing
nets:
Obj X min
e
ij
E
c
ij
w
ij
e
ij
E
c
ji
u
ij
(61)
subject to
w
ij
p
i
p
j
_ 0 \1 _ i. j _ [V[ (62)
u
ij
q
i
q
j
_ 0 \1 _ i. j _ [V[ (63)
q
i
p
i
_ 0 \v
i
V. v
i
,= v
s
. v
t
(64)
p
s
= 1 (65)
q
s
= 1 (66)
p
t
= 0 (67)
q
t
= 0 (68)
w
ij
. u
ij
_ 0 \1 _ i. j _ [V[ (69)
To minimize objective function (61), the equality
of constraint (62) holds, i.e., w
ij
=p
i
p
j
, if p
i
_p
j
,
otherwise, w
ij
=0. Similarly, constraint (63) re-
quires u
ij
=q
i
q
j
if q
i
_q
j
, otherwise u
ij
=0.
Expression (64) demands potential q
i
be not less
than potential p
i
for any module v
i
V. Since high
FIGURE 23 p potential and q potential of each module.
26 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
potential p
i
corresponds to set V
1
, and high
potential q
i
corresponds to set
"
V
2
, inequality (64)
enforces V
1
be a subset of
"
V
2
. Consequently, the
requirement that V
1
V
2
=O is satised.
Constraints (65) (68) set the potentials of
modules v
s
and v
t
. Constraint (69) requires
potential dierence w
ij
and u
ij
be nonnegative.
Figure 23 shows one ideal potential conguration
of the solution.
Dual Linear Programming Formulation If we
assign dual variables (Lagrangian multiplier) x
ij
to
inequality (62) with respect to each net, x
/
ij
to
inequality (63), `
i
to inequality (64) with respect
to module v
i
, and a
s
, b
s
, a
t
, b
t
to inequalities (65)
(68), respectively, then we have the dual formula-
tion.
Obj X max a
s
b
s
(70)
subject to
x
ij
_ c
ij
\1 _ i. j _ [V[ (71)
x
/
ij
_ c
ji
\1 _ i. j _ [V[ (72)
[V[
j=1
x
ij
x
ji
`
i
= 0 \v
i
V. v
i
,= v
s
. v
t
(73)
[V[
j=1
x
/
ij
x
/
ji
`
i
= 0 \v
i
V. v
i
,= v
s
. v
t
(74)
[V[
j=1
x
sj
x
js
a
s
= 0 (75)
[V[
j=1
x
tj
x
jt
a
t
= 0 (76)
[V[
j=1
x
/
sj
x
/
js
b
s
= 0 (77)
[V[
j=1
x
/
tj
x
/
jt
b
t
= 0 (78)
`
i
. x
ij
. x
/
ji
_ 0 \1 _ i. j _ [V[. v
i
,= v
s
. v
t
(79)
a
s
. a
t
. b
s
. b
t
unrestricted (80)
where inequalities (71), (72) are derived with
respect to each w
ij
and u
ij
respectively. Similarly,
Eqs. (73) (78) are derived with respect to each p
i
,
q
i
, p
s
, p
t
, q
s
and q
t
. The equality of Eqs. (73) (78)
holds because p
i
, q
i
, p
s
, p
t
, q
s
and q
t
are not
restricted on sign in the primal formulation.
Variables `
i
, x
ij
, and x
/
ij
are positive in Eq. (79)
because their corresponding expressions (62) (64)
are inequality constraints.
We can view G(V, E ) as a network ow problem
and interpret c
ij
as the ow capacity, x
ij
as the ow
of net e
ij
. Constraint (71) requires that the ow x
ij
be not larger than the ow capacity c
ij
on each net
e
ij
. In constraint (72), the set of nets are in a
reversed direction and ow x
/
ij
is not larger than
the capacity of the capacity c
ji
of net e
ji
in E.
Corresponding to G(V, E ), we use G
/
(V
/
, E
/
) to
denote the reversed graph.
Constraint (73) has the total ow x
ij
injected
from module v
i
into G be equal to `
i
. On the
other hand, constraint (74) has the total ow x
/
ij
injected from module v
i
/ into G
/
be equal to `
i
.
Suppose we combine Eqs. (73) and (74), we have
j
x
ij
x
ji
= `
i
=
j
x
/
ij
x
/
ji
. (81)
This means that the amount of ow `
i
which
emanates from module v
i
in G enters its corre-
sponding module in v
i
/ in G
/
.
Constraints (75) (78) indicate that a
s
and b
s
are
the ow injections to module v
s
in G and its
reversed circuit G
/
; a
t
and b
t
are the ow ejections
from module v
t
in G and its reversed circuit G
/
,
respectively. Combining circuit G and G
/
together,
we have the maximum total ow, a
s
b
s
, be the
optimum solution of the minimum replication cut
problem.
5.4.4. The Optimum Partition
In this subsection, we describe the construction of
replication graph and take an example to describe
27 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
it. We then apply the maximum ow algorithm on
the constructed replication graph to derive an
optimum replication cut. The optimality of the
derived replication cut is proved by using a
network ow approach.
Construction of Replication Graph Given a circuit
G(V, E ) and modules v
s
and v
t
, we construct
another circuit G
/
(V
/
, E
/
) where [ V
/
[=[ V[ with
each module v
/
i
in V
/
corresponding to a module v
i
in V, and [ E
/
[=[ E[ with each directed net e
ij
in E
/
in the reverse direction of net e
ij
in E. We create
super modules v
+
s
and v
+
t
and nets (v
+
s
. v
s
), (v
+
s
. v
/
s
),
(v
t
. v
+
t
), and (v
/
t
. v
+
t
) with innite capacity as shown
in Figure 24. From every module v
i
in V except v
s
and v
t
, we add a directed net of innite capacity to
the corresponding module v
/
i
in V
/
. We refer to the
combined circuit as G
+
.
Polynomial-time Algorithm The optimum repli-
cation cut problem with respect to module pair v
s
and v
t
and without size constraints can be solved
by a maximum-ow minimum-cut solution of the
circuit G
+
with v
+
s
as the source and v
+
t
as the sink of
the ow (Fig. 24). Suppose the maximum-ow
minimum-cut nds partition (X.
"
X) of V with
v
s
X and v
t
"
X and partition (X
/
.
"
X
/
) of V
/
with
v
/
s
X
/
and v
/
t
"
X
/
. Then a replication cut (V
1
, V
2
)
of the original circuit with V
1
=X, V
2
= i[i
/
"
X
/
and R=VV
1
V
2
is an optimum solution. Note
that V
2
is derived from the cut in vertex set V
/
. To
simplify the notation, we shall use (X.
"
X
/
) to denote
the derived replication cut of G.
Example Given a circuit in Figure 25, its replica-
tion graph G
+
is constructed as shown in Figure 26.
The maximum-ow minimum-cut of G
+
derives
(X.
"
X) = (v
s
. v
a
. v
b
. v
c
. v
t
) and (X
/
.
"
X
/
) = (v
/
s
.
v
/
a
. v
/
b
. v
/
c
. v
/
t
) with a ow amount, 5 (Fig. 26).
Thus the sets V
1
={v
s
, v
a
} and V
2
={v
t
} dene an
optimum replication cut R(V
1
, V
2
) with R={v
b
, v
c
}
and a cut cost equal to 5 (Fig. 27).
The network ow approach leads to the opti-
mality of the solution as stated in the following
theorem.
THEOREM HEOREM 5.3 The replication cut R(X.
"
X
/
) derived
from the transformed circuit G
+
generates the
minimum replication cut count C
R
(X.
"
X
/
) (expression
(19)).
5.4.5. Heuristic Flow Algorithms
We introduce the heuristic approaches that accel-
erate the ow calculation and take advantage the
optimality properties of the ow methods. We rst
introduce an approach that utilizes the maximum
ow minimum cut method for the min cut with
FIGURE 24 The replication graph G
+
.
28 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
size constraints. We then explain a shortest path
method for multiple commodity ow calculation.
(i) Usage of Maximum Flow Minimum Cut We
adopt a heuristic approach [113] to get around the
unbalanced partition of the maximum ow and
minimum cut method. First, we nd two seeds as
the source and the sink modules, v
s
, v
t
. We then
use the maximum ow and minimum cut method
to nd partition (V
1
, V
2
) with v
s
V
1
and v
t
V
2
.
Suppose the size S(V
1
) of V
1
is larger than the size
S(V
2
) of V
2
, we nd from V
1
a module v
i
to merge
with V
2
and shrink set V
2
as a new sink module.
Otherwise, we nd from V
2
a module v
i
to merge
with V
1
and shrink set V
1
as a new source module.
We repeat the maximumowminimum cut process
on the graph with new source or sink module until
the size of the partition ts the size constraint.
Two Way Partitioning using Maximum Flow
Minimum Cut
1. Find two seeds as v
s
and v
t
.
2. Call Maximum Flow Minimum Cut to nd
partition (V
1
, V
2
).
3. If S(V
1
) >S(V
2
), nd a seed v
i
V
1
, merge
{v
i
} V
2
into a new sink module v
t
.
4. Else nd a seed v
i
V
2
, merge {v
i
} V
1
into a
new source module v
s
.
5. Repeat Steps 1 4, until S
l
<S(V
1
) <S
u
and
S
l
<S(V
2
) <S
u
.
We can use parametric ow approach recur-
sively to the maximum ow minimum cut pro-
blems recursively (Step 2). The total complexity is
equivalent to a single maximum ow minimum
cut.
The seeds are chosen according to its connectiv-
FIGURE 25 A ve module circuit to demonstrate the
replication cut.
FIGURE 26 The constructed replication graph of the circuit shown in Figure 25.
29 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
ity to the vertex set in the other side. The result is
sensitive to the choice of the seeds. We can make
multiple trials and choose the best results. Other
methods such as programming approach can serve
as a guideline on the choice of the seeds [79, 80].
The method has shown to derive excellent results
with reasonable running time.
(ii) Approximation of Multiple Commodity Flow
Based on the multicommodity ow formulation
[103], we try to solve a multiple way partitioning
by deriving approximate multiple commodity ow
with a stochastic process [13, 55, 114, 117].
Given a circuit H(V, E ), the ow increment ,
and the distance coecient c, the algorithm starts
with procedure Saturate-Network to saturate the
circuit with ows. A stochastic ow injection
algorithm is adopted to reduce the computational
complexity. Then, Select-Cut is activated to select
a set of nets by the ow values to constitute a cut.
The conversion from weighted ratio cut to cluster
ratio cut is performed by a Select-Cut routine
which selects the subset of the cut derived from
Saturate-Network with a greedy approach.
Multiple Commodity Flow Approximation
(H, , c)
1. Iterate the following procedures
1.1. Saturate-Network (H, , c).
1.2. Select-Cut (H) until the clustering result
are satisfactory
2. Output clustering result.
Procedure Saturate-Network (H, , c)
1. Set the distance of each net e to be one.
2. While (H is connected) do 2.1 to 2.3.
2.1. Randomly pick two distinct modules v
s
and v
t
.
2.2. Find the shortest path between v
s
and v
t
.
2.3. For each net e on the shortest path, let f (e)
and d
e
be the ow and distance of net e.
2.3.1. If n is not saturated, increase f (e) by
and set d
e
=exp ((c f (e))/c
e
).
2.3.2. If e is saturated, set d
e
to be .
3. Output E with ow informations.
The initial distance of each net is one since there is
no ow being injected (see the distance formulation
in Step 2.3.1). Step 2.1 uses a random process with
even distribution over all modules to pick two
distinct modules, and Steps 2.2 2.3 inject
amount of ows along the shortest path between
the modules. In Steps 2.3.1 2.3.2, the distances of
the nets whose ow has been increased are
recomputed using an exponential function d
e
=exp
FIGURE 27 The duplicated circuit of the circuit shown in Figure 25.
30 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
((c f (e))/c
e
) to penalize the congested nets, where
d
e
and f (e) are the distance and ow of net e,
respectively. Steps 2.1 2.3 are iteratively executed
until a pair of modules are chosen where all possible
paths between them are saturated by ows. These
saturated nets identify a partition of the circuit.
Figure 28 shows a sample circuit saturated by
ows after executing Saturate-Network with
=0.01 and c=10. The ow values are shown
by the numbers right beside each net. The dashed
lines indicate the cut lines along the set of
saturated nets to form the three clusters. These
saturated nets dene an approximate weighted
cluster ratio cut which are potential set of nets for a
selection of cluster ratio cut.
5.5. Programming Approaches
For programming approaches [7, 18, 35, 41, 46, 44],
we adopt two way minimum cut with size
constraints as the target problem. We assume that
the nets are two pin nets and thus, the circuit can
be described as a graph G(V, E). We also assume
the modules are of unit size, i.e., s
i
=1.
The two way partition (V
1
, V
2
) is represented by
a linear placement with only two slots at coordi-
nates 1 and 1. For an even sized partition, half
of the modules are assigned to each slot. Let x
i
denote the coordinate of module v
i
. If v
i
V
1
,
x
i
=1, else x
i
=1 for v
i
V
2
. The cut count can be
expressed as follows.
C(V
1
. V
2
) =
1
4
c
ij
(x
i
x
j
)
2
=
1
4
X
BX (82)
where X is a vector of x
i
, and X
is the transpose
of vector X. Matrix B has its entry b
ij
=c
ij
if i ,=j,
else b
ii
=
1_j_[V[
c
ij
. Suppose we relax the slot
constraint by enforcing only the rules of the
gravity center and the norm. The constraint of
vector X can be expressed as:
1
X = 0. (83)
X
X = [V[ (84)
Matrix B is symmetric and diagonally semido-
minant. Thus, it is semipositive denite, i.e., all
eigenvalues are nonnegative. And its eigenvectors
are orthogonal. Let us order its eigenvalues from
FIGURE 28 The ow and partition generated by saturate-network.
31 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
small to large, i.e., `
0
_`
1
_`
[V[1
. The smal-
lest eigenvalue `
0
=0 with its eigenvector X
0
=1.
The second eigenvalue `
1
is nonnegative with its
eigenvector orthogonal to the rst eigenvector, i.e.,
X
0
X
1
= 1
X
1
= 0. Therefore, the second eigenvec-
tor X
1
is an optimal solution to objective function
(82) with constraints (83) [46]. Since X
X=[V[ Eq.
(84) the solution
1
4
X
1
BX
1
=
1
4
`
1
X
1
X
1
=
1
4
`
1
[V[. (85)
which is a lower bound of the min-cut problem.
To push for a higher lower bound, we can adjust
the diagonal term of matrix B by adding constants
d
i
. Let
~
C(V
1
. V
2
) = C(V
1
. V
2
)
1
4
1_i_[V[
d
i
x
2
i
1
4
1_i_[V[
d
i
=
1
4
_
X
~
BX
1_i_[V[
d
i
_
.
(86)
where matrix
~
B has its entry
~
b
ij
= b
ij
if i ,=j, else
~
b
ii
= b
ii
d
i
. Either x
i
=1 or x
i
=1, the last two
terms cancel each other. The modication thus
does not alter the optimal partition solution.
The new nonlinear programming problem is to
nd the assignment of d
i
to maximize the objective
function [11]:
1
4
_
~
`
1
[V[
1_i_[V[
d
i
_
(87)
where
~
`
1
is the second smallest eigenvalue of
matrix
~
B. The solution is an upper bound of the
partition. It is larger than `
1
in the sense that `
1
can serve as an initial feasible solution to maximize
expression (87).
Remarks The programming approach nds a
global view of the problem [9, 79, 80, 118]. How-
ever, the formulation is very restricted. The
extension to multiple pin nets and the incorpora-
tion of xed modules will destroy the nice
structure based on which we have the eigenvalue
and eigenvector as optimal solutions. Therefore, it
is dicult to utilize the approach recursively.
For a general case, we can view the problem as
nonlinear programming with Boolean quadratic
objective function. Nonlinear programming tech-
niques are adopted to derive the results [16, 107].
5.6. A Lagrange Multiplier Approach for
Performance Driven Partitioning
Lagrange multiplier is one useful tool for perfor-
mance optimization. In this section, we demon-
strate the usage of Lagrange multiplier for
performance driven partitioning. The problem is
to optimize the performance of a two-way parti-
tion (V
1
, V
2
) with retiming [86].
We rst introduce a vector of binary variables to
represent a partition. The performance-driven
partitioning problem is thus represented by a
Boolean quadratic programming formulation with
nonlinear constraints. We then absorb the non-
linear constraints into the objective function as a
Lagrangian. We use primal and dual subproblems
to decompose the Lagrangian and derive the
partitions. Lagrange multiplier is adjusted in each
iteration via a subgradient method to monitor the
timing criticality and improve the performance.
5.6.1. Programming Formulation with Lagrange
Multiplier
We assume that the circuit can be represented by a
graph G(V, E) with two pin nets and unit module
size. The two-way partition is described by a vector
x=(x
1,1
, F F F , x
1,n
, x
2,1
, F F F , x
2,n
), where x
b,i
is 1 if
module v
i
is assigned to vertex set V
b
, otherwise x
b,i
is 0. If modules v
i
and v
j
are in dierent vertex set,
the value of the term x
1,i
x
2, j
x
2,i
x
1, j
is equal to 1.
This contributes one interpartition delay c into the
delay of the net e
ij
. Let g
l
(x) denote the delay to
register ratio of loop l. Delay ratio g
l
(x) can be
written as the following formula:
32 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
g
l
(x) =
d
/
e
ij
l
c (x
1.i
x
2. j
x
2.i
x
1. j
)
r
l
(88)
Given a path p, the total delays h
p
(x) of p is as
follows:
h
p
(x) = d
p
e
ij
p
c (x
1.i
x
2. j
x
2.i
x
1. j
) (89)
To formulate the problem, we use an objective
function of cut count:
min
e
ij
E
c
ij
(x
1.i
x
2. j
x
2.i
x
1. j
). (90)
subject to the following constraints:
C1 (Size Constraints)
[V[
i=1
x
b.i
s
i
_ S
u
\ b 1. 2. (91)
C2 (Variable Assignment Constraints)
2
b=1
x
b.i
= 1 \ v
i
V. (92)
C3 (Iteration Bound Constraints)
g
l
(x) _
~
J \ loop l. (93)
C4 (Latency Bound Constraints)
h
p
(x) _
~
M \ IO-critical path p. (94)
Actually, we don't need to consider all loops in C3.
Because all loops are composed of simple loops,
we have the following lemma:
LEMMA EMMA 1 Given a number
~
J, if g
l
(x) is less than or
equal to
~
J for any simple loop l, then g
l
(x) is less
than or equal to
~
J for all loops l.
Let
c
and
p
represent the number of the simple
loops and the number of IO-critical paths,
respectively. Let ` denote the vector (`
g
1
. F F F .
`
g
c
. `
h
1
. F F F . `
h
p
). Using Lagrangian Relaxation
[104], we absorb the constraints (93) and (94) into
the objective function (90). The Lagrangian-
relaxed problem is as follows.
max
`_0
min
x
L(x. `) (95)
subject to constraints C1 and C2, where
L(x. `) =
e
ij
E
c
ij
(x
1.i
x
2. j
x
2.i
x
1. j
)
\ simple loop l
`
g
l
(g
l
(x)
~
J)
\ IO-critical path p
`
h
p
(h
p
(x)
~
M)
(96)
(i) The Dual Problem Given vector x, we can
represent (96) as a function of variable `, i.e.,
L
x
(`). Thus, the dual problem can be written as:
max
`_0
L
x
(`) (97)
(ii) The Primal Problem Let F
ij
and Q
ij
denote the
sets of the simple loops and IO-critical paths
passing the net e
ij
. The cost a
ij
of net e
ij
is
composed of connectivity c
ij
and the penalty of
the timing constraints.
a
ij
= c
ij
lF
ij
c
r
l
`
g
l
pQ
ij
c`
h
p
(98)
Given vector `, we can represent (96) as a function
of vector x, i.e., L
`
(x). Thus, the primal problem
can be rewritten as:
min L
`
(x) = min
e
ij
E
a
ij
(x
1.i
x
2. j
x
2.i
x
1. j
) u
(99)
subject to C1 and C2, where u represents the
constant contributed by `.
5.6.2. Subgradient Method using Cycle Mean
Method
We solve the partitioning problem through primal
and dual iterations on the Lagrangian. A Quad-
ratic Boolean Programming, QBP, [16] is used to
33 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
solve the primal problem and generate a solution x
(Step 2).
For the dual problem based on x, we select the
set of loops and paths that violates the timing
constraints as active loops and paths. The nets
contained in the active loops or paths are termed
active nets.
Active Loops and Paths Given a solution x, a
loop l is called active, if g
l
(x) is not less than
~
J. A
path p is called active, if h
p
(x) is not less than
~
M.
Active Nets Given a net e, we dene e to be an
active net, if net e is covered by an active loop or
an active path.
We call a minimum cycle mean algorithm [57]
and an all-pairs shortest-paths algorithm to mark
all the nets on active loops and paths, respectively
(Step 3). For every net e
ij
on active paths, we
record q
ij
: the maximum path delay among all
paths passing through e
ij
. For every net e
ij
on
active loops, we record p
ij
: the maximum delay-to-
register ratio among all loops passing through e
ij
.
We then calculate the subgradient on the marked
nets and update the constants a
ij
for the next
primal dual iteration (Steps 4 5). We increase the
costs of active nets using subgradient approach
[104]. The iteration proceeds until the bound of all
loops and paths are within the given limits.
Algorithm using Lagrange Multiplier Input: Con-
stants
~
J.
~
M. c = 1.3 and an initial partition
_
V
(0)
1
. V
(0)
2
_
.
1. Initialize k 1; a
(0)
ij
= c
ij
.
2. Run QBP [16] to nd a partition
_
V
(k)
1
. V
(k)
2
_
with an object to minimize cut count
C
_
V
(k)
1
. V
(k)
2
_
=
eE(V
(k)
1
.V
(k)
2
)
a
(k)
ij
.
3. Calculate the iteration and latency bounds of
the partition
_
V
(k)
1
. V
(k)
2
_
, respectively. Stop if
timing constraints are satised. Otherwise,
revise p
ij
and q
ij
for all nets e
ij
.
4. Compute
t
(k)
=
c
C
_
V
(k)
1
. V
(k)
2
_
C
_
V
(0)
1
. V
(0)
2
_
e
ij
E
(p
ij
~
J)
2
e
ij
E
(q
ij
~
M)
2
5. Revise shadow price a
ij
for all nets e
ij
E:
a
(k1)
ij
= a
(k)
ij
;
if net e
ij
is in active loop, then a
(k1)
ij
= a
(k)
ij
t
(k)
(p
ij
~
J);
if net e
ij
is in active path, then a
(k1)
ij
= a
(k)
ij
t
(k)
(q
ij
~
M).
6. While k _MaxNumIter, set k k1 and goto
2.
5.7. Clustering Heuristics
We rst discuss the usage of clustering heuristics.
We then discuss top down clustering and bottom
up clustering approaches. At the last, we discuss
some variations of clustering metrics.
5.7.1. Usage of Clustering Heuristics
The usage of clustering heuristics plays an
important role in determining the quality of the
nal results. In the following, we discuss the issue
in dierent topics. We use a two-way partitioning
with size constraints as the target problem.
1. Top Down Clustering versus Bottom Up
Clustering: Top down clustering approach
provides a global view of the solution. The
operations are consistent with the target pro-
blem. However, it is more time consuming
because the clustering operates on the whole
circuit [29]. Bottom up clustering is ecient.
However, because the process operates locally,
the target solution is sensitive to the clustering
heuristics [59].
2. The Level of the Clustering: Suppose we
represent the clustering results with a hierarch-
ical tree structure. Let the root correspond to
the whole circuit, the leaves correspond to the
smallest clusters, and the internal nodes corre-
spond to the intermediate clusters. Hence, the
size of the clusters grows with the level of the
nodes. Top down clustering creates clusters
corresponding to nodes in high levels, while
bottom up clustering creates clustering corre-
sponding to nodes in low levels.
34 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
For example, in [60], Kernighan and Lin
proposed a top down clustering approach,
which divides the whole circuit into four clusters
only. In [59], Karypis et al., used a bottom up
clustering which starts with clusters of two
modules or a net. If we continue the application
of bottom up clustering on intermediate clus-
ters, the quality of the clusters degenerates as the
size of the clusters grows bigger.
3. Iteration of Clustering and Unclustering: We go
through the iterations of clustering and unclus-
tering to improve the quality of the results. At
each level of the hierarchical tree, we derive an
intermediate target solution, e.g., a two-way
partition. In unclustering, we go down the level
of tree hierarchy to nd an expanded circuit with
more modules. In clustering, we go up the level
of tree hierarchy with a circuit of a smaller
number of modules. The previous partitioning
result becomes the initial of the newpartitioning
problem. Note that the hierarchical tree is
constructed dynamically. For each clustering,
the modules can be grouped based on the current
partitioning conguration.
4. The Clustering Operations and the Target
Solution: The clustering operation has to be
consistent with the target solution. For example,
suppose the target is nding a two-way min-cut
with size constraints. Then, it is natural to cluster
modules based on net connectivity because the
probability that a net is in an optimal cut set is
small (see the subsection of min-cut with size
constraints in problem formulations). More-
over, it is important that the clustering follows
the current partitioning results, i.e., only mod-
ules in the same partition are clustered.
5.7.2. Top Down Clustering Approach
for Partitioning
We use an application to two-way cut with size
constraints to illustrate the top down clustering
approach [24, 29]. The partitioning of huge designs
is complicated and the results can be erratic. Our
strategy (Fig. 29) is to reduce the circuit complex-
ity by constructing a contracted hypergraph. The
clusters for the contracted hypergraph are
searched via a recursive top down partitioning
method. The number of modules is much reduced
after we contract the clusters. Hence, a group
FIGURE 29 Strategy of top down clustering.
35 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
migration approach can derive excellent two way
cut results on the contracted hypergraph with
much eciency. Furthermore, since the clusters
are grouped via a top down partitioning, concep-
tually a minimum cut on the hypergraph can take
advantage of the previous results and generate
better solutions.
In this section, we describe a top down clustering
algorithm. Aratio cut is adopted to performthe top
down clustering process. Other partition ap-
proaches can also be used to replace the ratio cut.
A group migration method is used to nd a
minimum cut of the contracted hypergraph with
size constraint. Finally, we apply a last run of the
group migration algorithm to the original circuit to
ne tune the result.
Input a hypergraph H(V, E ), an integer k for
the number of expected clusters, an integer
num_of_reps for repetition, and S
l
, S
u
for the size
constraints of two resultant subsets.
1. Initialize ={V} and V
+
=V.
2. Apply ratio cut [109] to obtain a partition
(A, A
/
) of V
+
=AA
/
.
3. Set =(V
+
}) {A, A
/
}. Set V
+
to be a
vertex set in such that S(V
+
) = max
V
i
S(V
i
).
4. While S(V
+
) >((S(V ))/k), repeat Steps 2, 3.
5. Construct a contracted hypergraph H
(V
, E
).
6. Apply num_of_reps times of a group migration
algorithm to H
i
((C(V
i
))/(C
I
(V
i
)))=4/124/12=2/3. For this
case, we can nd a better solution of clusters
{a, b, c, d, e, f } and {g, h, i, j, k, l} of which the
cluster cost is equal to zero.
Figure 31 shows another example of twelve
modules with connectivities attached to the nets.
The connectivity is 1 if not specied. Figure 31(a)
shows an optimum cut with cut count 6.6. If a
maximum matching [61] criterion is adopted in the
bottom up clustering approach, then modules with
a net of weight 1.1 between them will be merged. A
minimum cut on the merged modules yields a cut
count of 18 (Fig. 31(b)). In general, a 2n module
circuit having a symmetric conguration as in
Figure 31 will have a cut count of n
2
/2 if the
maximum matching criterion is applied to perform
the clustering; while the optimum solution will
have a cut weight of 1.1 n. From this extreme
case, we can claim the following theorem:
THEOREM HEOREM 5.4 There is no constant factor of error
bound of the cut count generated by the maximum
matching approach, from the cut count of a
minimum cut.
Proof As shown in the above example, the factor
of error bound is (n
2
/2)/(1.1 n)=n/2.2, which is
not a constant. Q.E.D.
(iv) Maximum Pairing The maximum pairing is
FIGURE 30 Clustering of two module circuit.
37 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
similar to maximum matching, except that it does
not enforce the matching of all modules. Only the
top q percent of the modules are paired. Thus, we
can avoid the enforced pairing of unrelated
modules.
However, this strategy may cause certain
modules to keep on growing and produce very
uneven cluster results. Thus, we need to choose a
proper cost function that discourages unlimited
growth of the cluster size, e.g., cost function (26).
5.7.4. Variations of Clustering Metric
In order to identify good clusters, we need to look
beyond the direct adjacency between modules. It is
useful if we can also extract the relation between
the neighbors' neighbors, or even several levels of
neighbors' neighbors. The probabilistic gain model
of group migration approach is one good example
of such approach [37, 42].
In this section, we will discuss a few dierent
clustering metrics. For the case of k connectivity,
we count the number of k-hop paths between two
modules. Or, we use an analogy of a resistive
network to check the conductance between the
modules. Furthermore, we check beyond the
hypergraph and use other information such as
the module functions, pin locations, and control
signals.
(i) kth Connectivity The number of k-hop paths
between two modules provides a dierent aspect of
information on the adjacency. Suppose the circuit
has only two-pin nets. We can derive the kth
connectivity with sparse matrix multiplication. Let
C be the connectivity matrix with connectivity c
ij
as its elements at row i column j, and at row j
column i, and its diagonal entry c
ii
=0. Note
that we set c
ij
=0 if there is no net connecting
modules v
i
and v
j
.
Let c
(2)
ij
be the element of the square of matrix C
(C
2
), and c
(k)
ij
be the element of the kth order of
matrix C (C
k
). Then we have c
(k)
ij
representing the
number of distinct k-hop paths connecting mod-
ules v
i
and v
j
.
(ii) Conductivity We use a resistive network
analogy [21, 93] to derive the relation between
modules. Suppose the circuit has only two pin
nets. We replace each net e
ij
with a resistor of
conductance c
ij
. Hence, we can view the whole
system as a resistive network and derive the
conductance between modules. The system con-
ductance between two modules v
i
and v
j
reveals the
adjacency relation between the two modules.
The network conductance can be derived using
circuit analysis. We can also approximate the
conductance with a random walk approach. In a
random network model, we start walking from a
module v
i
. At each module v
k
, the probability to
walk via net e
kl
to module v
l
is proportional to the
connectivity, i.e., (c
kl
/
m
c
km
). We can derive the
relation between the random walk and the con-
ductivity [89]:
FIGURE 31 A twelve module example to demonstrate
maximum matching.
38 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
h
ij
h
ji
=
2
e[E[
c
e
o
ij
. (100)
where h
ij
denotes the expected number of hops to
walk from modules v
i
and v
j
, and o
ij
denotes the
conductance between v
i
and v
j
.
(iii) Similarity of Signatures We can use certain
features beyond connectivity for the clustering
metric [88, 91]. For example, the index of data bits,
sequence of the pins, function of logic, and
relation with common control signals can serve
as signatures of function blocks in data path
designs. All these features form the rst level
adjacency. We can extend the relation to multiple
levels. For example, two modules connecting a set
of modules with strong similarity makes these two
modules similar.
Example As shown in Figure 32, modules A and
B are similar in signature because they are of the
same OR function, connected to consecutive bit
number at the same pin location, and controlled
by the same control signal at the same pin
location.
Modules C and D become similar because
module C obtains signal from A, module D
obtains signal from B, and modules A and B are
similar.
6. RESEARCH DIRECTIONS
Partitioning remains to be an important research
problem. Many applications such as oorplan-
ning, engineering change orders, and performance
driven emulation demand eective and ecient
partitioning solutions.
Recent eorts released benchmarks with reason-
able complexity [3]. However, more design cases
are still needed to represent the class of huge
circuitry with details of functions and timing.
In this section, we touch on a few interesting
research problems regarding the correlation be-
tween the partition of logic and physical designs,
the manipulation of hierarchical tree structure,
and the performance driven partitioning.
6.1. Correlation of Hierarchical Partitioning
Structure Between Logic Synthesis and
Physical Layout
It is desired to correlate the logic hierarchy with
the physical design hierarchy. The main reason is
the control of timing for huge designs. Currently,
the design turnaround takes 2 8 months for ASIC
and much longer for custom designs. Throughout
the design process, designs keep on changing. We
don't want to lose control of timing as design
changes. A tight correlation of logic and physical
hierarchies makes timing predictable. Without this
kind of mechanism, the timing characteristics of a
oorplan may become erratic after iterations of
design changes.
6.2. Manipulation of Hierarchical Partitioning
Structure
One main issue in mapping a huge hierarchical
circuit is the utilization of the hierarchy to reduce
the mapping complexity. We can drastically
improve the eciency of the mapping process, if
FIGURE 32 Signature identies data structure.
39 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
we properly exploit the structure of the design
hierarchy. The generic binary tree is a good
formulation to start with.
The handling of a hierarchy tree gives rise to
many fundamental research problems. For exam-
ple, nding k shortest-paths or exploring the
maximum-ow minimum-cut of the whole circuit
[51] embedded in a hierarchical tree can be useful
for interconnect analysis and optimization. Such
research can also benet many dierent elds
which have to handle huge hierarchical systems.
6.3. Performance Driven Partitioning
For performance driven partitioning, we need a
fast evaluation on the hierarchical tree structure.
The analysis needs to be incremental with incor-
poration of signal integrity.
The network ow method is a potential
approach for the partitioning with timing con-
straints. More eorts are needed to improve the
speed and derive desired results.
Acknowledgements
The authors thank the editor for the encourage-
ment of preparing this manuscript. The authors
would also like to thank Ted Carson, Lung-Tien
Liu, and John Lillis for helpful discussions.
References
[1] Ahuja, R. K., Magnanti, T. L. and Orlin, J. B., Network
Flows, Prentice Hall, 1993.
[2] Alpert, C. J., ``The ISPD98 circuit benchmark suite'', Int.
Symp. on Physical Design, pp. 80 85, April, 1998.
[3] Alpert, C. J., Caldwell, A. E., Kahng, A. B. and Markov,
I. L., ``Partitioning with Terminals: a ``New'' Problem
and New Benchmarks'', Int. Symp. on Physical Design,
pp. 151 157, April, 1999.
[4] Alpert, C. J., Huang, J. H. and Kahng, A. B., ``Multi-
level circuit partitioning'', In: Proc. ACM/IEEE Design
Automation Conf., June, 1997, pp. 530 533.
[5] Alpert, C. J. and Kahng, A. B., ``Recent directions in
netlist partitioning: a survey'', Integration: The VLSI J.,
19(1), 1 81, August, 1995.
[6] Alpert, C. J. and Kahng, A. B., ``A general framework
for vertex orderings with applications to circuit cluster-
ing'', IEEE Trans. VLSI Syst., 4(2), 240 246, June,
1996.
[7] Alpert, C. J. and Yao, S. Z., ``Spectral partitioning: the
more eigenvectors, the better'', In: Proc. ACM/IEEE
Design Automation Conf., June, 1995, pp. 195 200.
[8] Bakoglu, H. B., Circuits, Interconnections, and Packaging
for VLSI, MA: Addison-Wesley, 1990.
[9] Blanks, J. (1989). ``Partitioning by Probability Conden-
sation'', ACM/IEEE 26th Design Automation Conf., pp.
758 761.
[10] Bollobas, B. (1985). Random Graphs, Academic Press
Inc., pp. 31 53.
[11] Boppana, R. B. (1987). ``Eigenvalues and Graph
Bisection: An Average Case Analysis'', Annual Symp.
on Foundations in Computer Science, pp. 280 285.
[12] Breuer, M. A., Design Automation of Digital Systems,
Prentice-Hall, NY, 1972.
[13] Bui, T., Chaudhuri, S., Jones, C., Leighton, T. and
Sipser, M. (1987). ``Graph bisection algorithms with
good average case behavior'', Combinatorica, 7(2),
171 191.
[14] Bui, T., Heigham, C., Jones, C. and Leighton, T.,
``Improving the performance of the Kernighan-Lin and
simulated annealing graph bisection algorithms'', In:
Proc. ACM/IEEE Design Automation Conf., June, 1989,
pp. 775 778.
[15] Buntine, W. L., Su, L., Newton, A. R. and Mayer, A.,
``Adaptive methods for netlist partitioning'', In: Proc.
IEEE Int. Conf. Computer-Aided Design, November,
1997, pp. 356 363.
[16] Burkard, R. E. and Bonniger, T. (1983). ``A Heuristic for
Quadratic Boolean Programs with Applications to
Quadratic Assignment Problems'', European Journal of
Operational Research, 13, 372 386.
[17] Camposano, R. and Brayton, R. K. (1987). ``Partitioning
Before Logic Synthesis'', Int. Conf. on Computer-Aided
Design, pp. 324 326.
[18] Chan, P. K., Schlag, D. F. and Zien, J. Y., ``Spectral
k-way ratio-cut partitioning and clustering'', IEEE
Trans. Computer-Aided Design, 13(9), 1088 1096, Sep-
tember, 1994.
[19] Charney, H. R. and Plato, D. L., ``Ecient Partitioning
of Components'', IEEE Design Automation, July, 1968,
pp. 16.0 16.21.
[20] Chatterjee, A. C. and Hartley, R., ``A new Simultaneous
Circuit Partitioning and Chip Placement Approach
based on Simulated Annealing'', In: Proc. ACM/IEEE
Design Automation Conf., June, 1990, pp. 36 39.
[21] Cheng, C. K. and Kuh, E. S., ``Module Placement Based
on Resistive Network Optimization'', IEEE Trans. on
Computer-Aided Design, CAD-3, 218 225, July, 1984.
[22] Cheng, C. K., ``Linear Placement Algorithms and
Applications to VLSI Design'', Networks, 17, 439 464,
Winter, 1987.
[23] Cheng, C. K. and Hu, T. C., ``Ancestor Tree for
Arbitrary Multi-Terminal Cut Functions'', Porc. Integer
Programming/Combinatorial Optimization Conf., Univ.
of Waterloo, May, 1990, pp. 115 127.
[24] Cheng, C. K. and Wei, Y. C. (1991). ``An Improved
Two-Way Partitioning Algorithm with Stable Perfor-
mance'', IEEE Trans. on Computer Aided Design, 10(12),
1502 1511.
[25] Cheng, C. K. (1992). ``The Optimal Partitioning of
Networks'', Networks, 22, 297 315.
[26] Cherng, J. S. and Chen, S. J., ``A Stable Partitioning
Algorithm for VLSI Circuits'', In: Proc. IEEE Custom
Integrated Circuits Conf., May, 1996, pp. 9.1.1 9.1.4.
40 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
[27] Cherng, J. S., Chen, S. J. and Ho, J. M., ``Ecient
Bipartitioning Algorithm for Size-Constrained Circuits'',
IEEE Proceedings-Computers and Digital Techniques,
145(1), 37 45, January, 1998.
[28] Cheng, C. K. and Hu, T. C. (1992). ``Maximum
Concurrent Flow and Minimum Ratio Cut'', Algorith-
mica, 8, 233 249.
[29] Chou, N. C., Liu, L. T., Cheng, C. K., Dai, W. J. and
Lindelof, R., ``Local Ratio Cut and Set Covering
Partitioning for Huge Logic Emulation Systems'', IEEE
Trans. Computer-Aided Design, pp. 1085 1092, Septem-
ber, 1995.
[30] Chvatal, V. (1983). Linear Programming, W. H. Freeman
and Company.
[31] Cong, J. and Ding, Y., ``FlowMap: An Optimal
Technology Mapping Algorithm for Delay Optimization
in Lookup-Table Based FPGA Designs'', IEEE Trans.
Computer-Aided Design, January, 1994, 13, 1 12.
[32] Cong, J., Labio, W. and Shivakumar, N., ``Multi-way
VLSI circuit partitioning based on dual net representa-
tion'', In: Proc. IEEE Int. Conf. Computer-Aided Design,
November, 1994, pp. 56 62.
[33] Cong, J., Li, H. P., Lim, S. K., Shibuya, T. and Xu, D.,
``Large scale circuit partitioning with loose/stable net
removal and signal ow based clustering'', In: Proc.
IEEE Int. Conf. Computer-Aided Design, November,
1997, pp. 441 446.
[34] Donath, W. E. and Homan, A. J. (1973). ``Lower
Bounds for the Partitioning of Graphs'', IBM J. Res.
Dev., pp. 420 425.
[35] Donath, W. E. and Homan, A. J. (1972). ``Algorithms
for partitioning of graphs and computer logic based on
eigenvectors of connection matrices'', IBM Technical
Disclosure Bulletin 15, pp. 938 944.
[36] Donath, W. E. (1988). ``Logic partitioning'', In: Physical
Design Automation of VLSI Systems, Preas, B. and
Lorenzetti, M. (Eds.) Menlo Park, CA: Benjamin/
Cummings, pp. 65 86.
[37] Dutt, S. and Deng, W., ``A Probability-based Approach
to VLSI Circuit Partitioning'', In: Proc. ACM/IEEE
Design Automation Conf., June, 1996, pp. 100 105.
[38] Dutt, S. and Deng, W., ``VLSI Circuit Partitioning by
Cluster-Removal Using Iterative Improvement Techni-
ques'', In: Proc. IEEE Int. Conf. Computer-Aided Design,
November, 1996, pp. 194 200.
[39] Enos, M., Hauck, S. and Sarrafzadeh, M., ``Evaluation
and optimization of Replication Algorithms for logic
Bipartitioning'', IEEE Trans. on Computer-Aided Design,
September, 1999, 18, 1237 48.
[40] Fiduccia, C. M. and Mattheyses, R. M., ``A Linear-Time
Heuristic for Improving Network Partitions'', In: Proc.
ACM/IEEE Design Automation Conf., June, 1982,
pp. 175 181.
[41] Frankle, J. and Karp, R. M. (1986). ``Circuit Placement
and Cost Bounds by Eigenvector Decomposition'', Proc.
Int. Conf. on Computer-Aided Design, pp. 414 417.
[42] Garbers, J., Promel, H. J. and Steger, A. (1990).
``Finding clusters in VLSI circuits'', In: Proc. IEEE Int.
Conf. Computer-Aided Design, pp. 520 523.
[43] Garey, M. R. and Johnson, D. S., Computers and
Instractability: A Guide to the Theory of NP-Complete-
ness, W.H. Freeman, San Francisco, CA, 1979.
[44] Hagen, L. and Kahng, A. B., ``New spectral methods for
ratio cut partitioning and clustering'', IEEE Trans.
Computer-Aided Design, 11(9), 1074 1085, September,
1992.
[45] Hagen, L. and Kahng, A. B., ``Combining problem
reduction and adaptive multistart: a new technique for
superior iterative partitioning'', IEEE Trans. Computer-
Aided Design, 16(7), 709 717, July, 1997.
[46] Hall, K. M., ``An r-dimensional Quadratic Placement
Algorithm'', Management Science, 17(3), 219 229,
November, 1970.
[47] Hamada, T., Cheng, C. K. and Chau, P., ``An Ecient
Multi-Level Placement Technique Using Hierarchical
Partitioning'', IEEE Trans. Circuits and Systems, 39,
432 439, June, 1992.
[48] Hennessy, J. (1983). ``Partitioning Programmable Logic
Arrays Summary'', Int. Conf. on Computer-Aided Design,
pp. 180 181.
[49] Homann, A. G., ``The Dynamic Locking Heuristic A
New Graph Partitioning Algorithm'', In: Proc. IEEE Int.
Symp. Circuits and Systems, May, 1994, pp. 173 176.
[50] Adolphson, D. and Hu, T. C., ``Optimal Linear
Ordering'', SIAM J. Appl. Math., 25(3), 403 423,
November, 1973.
[51] Hu, T. C., ``Decomposition Algorithm'', pp. 17 22, In:
Combinatorial Algorithms, Addison Wesley, 1982.
[52] Hu, T. C. and Moerder, K., ``Multiterminal ows in a
hypergraph'', In: VLSI Circuit Layout: Theory and
Design, Hu, T. C. and Kuh, E. (Eds.) NY: IEEE Press,
1985, pp. 87 93.
[53] Hur, S. W. and Lillis, J. (1999). ``Relaxation and
Clustering in a Local Search Framework: Application
to Linear Placement'', Design Automation Conference,
pp. 360 366.
[54] Hwang, J. and Gamal, A. E., ``Optimal Replication for
Min-Cut Partitioning'', Proc. IEEE/ACM Intl. Conf.
Computer-Aided Design, November, 1992, pp. 432 435.
[55] Iman, S., Pedram, M., Fabian, C. and Cong, J.,
``Finding uni-directional cuts based on physical parti-
tioning and logic restructuring'', In: Proc. ACM/SIGDA
Physical Design Workshop, May, 1993, pp. 187 198.
[56] Johnson, D. S., Aragon, C. R., McGeoch, L. A. and
Schevon, C. (1989). ``Optimization by Simulated Anneal-
ing: an Experimental Evaluation, Part I, Graph Parti-
tioning'', Operations Research, 37(5), 865 892.
[57] Karp, R. M. (1978). ``A Characterization of The
Minimum Cycle Mean in A Digraph'', Discrete Mathe-
matics, 23, 309 311.
[58] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S.,
``Multilevel Hypergraph Partitioning: Application in
VLSI Domain'', In: Proc. ACM/IEEE Design Automa-
tion Conf., June, 1997, pp. 526 529.
[59] Karypis, G., Aggarwal, R., Kumar, V. and Shekhar, S.
(1998). ``Multilevel Hypergraph Partitioning: Application
in VLSI Domain'', Manuscript of CS Dept., Univ. of
Minnesota, pp. 1 25 (http://www.users.cs.umn.edu/kar-
ypis/metis/publications/ ).
[60] Kernighan, B. W. and Lin, S., ``An Ecient Heuristic
Procedure for Partitioning Graphs'', Bell Syst. Tech. J.,
49(2), 291 307, February, 1970.
[61] Khellaf, M., ``On The Partitioning of Graphs and
Hypergraphs'', Ph.D. Dissertation, Indus. Engineering
and Operations Research, Univ. of California, Berkeley,
1987.
[62] Kirkpatrick, S., Gelatt, C. and Vechi, M., ``Optimization
by Simulated Annealing'', Science, 220(4598), 671 680,
May, 1983.
[63] Knuth, D. E., The Art of Computer Programming,
41 VLSI PARTITIONING
I207T001015 . 207
T001015d.207
Addison Wesley, 1997.
[64] Kring, C. and Newton, A. R. (1991). ``A Cell-Replicating
Approach to Mincut Based Circuit Partitioning'', Proc.
IEEE Int. Conf. on Computer-Aided Design, pp. 2 5.
[65] Krishnamurthy, B., ``An Improved Min-Cut Algorithm
for Partitioning VLSI Networks'', IEEE Trans. Compu-
ters, C-33(5), 438 446, May, 1984.
[66] Krupnova, H., Abbara, A. and Saucier, G. (1997). ``A
Hierarchy-Driven FPGA Partitioning Method'', Design
Automation Conf., pp. 522 525.
[67] Kuo, M. T. and Cheng, C. K., ``A New Network Flow
Approach for Hierarchical Tree Partitioning'', In: Proc.
ACM/IEEE Design Automation Conf., June, 1997, pp.
512 517.
[68] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Network
Partitioning into Tree Hierarchies'', In: Proc. ACM/
IEEE Design Automation Conf., June, 1996, pp.
477 482.
[69] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Finite State
Machine Decomposition for I/O Minimization'', In:
Proc. IEEE Int. Symp. on Circuits and Systems, May,
1995, pp. 1061 1064.
[70] Kuo, M. T., Wang, Y., Cheng, C. K. and Fujita, M.,
``BDD-Based Logic Partitioning for Sequential Cir-
cuits'', In: Proc. ASP/DAC, Chiba, Japan, January,
1997, pp. 607 612.
[71] Lomonosov, M. V. (1985). ``Combinatorial Approaches
to Multiow Problems'', Discrete Applied Mathematics,
11(1), 1 94.
[72] Landman, B. S. and Russo, R. L., ``On a Pin Versus
Block Relationship for Partitioning of Logic Graphs'',
IEEE Trans. on Computers, C-20, 1469 1479, Decem-
ber, 1971.
[73] Lawler, E. L., Combinatorial Optimization: Networks and
Matroids, Holt, Rinehart and Winston, New York, 1976.
[74] Leighton, T. and Rao, S. (1988). ``An Approximate
Max-Flow Min-cut Theorem for Uniform Multicom-
modity Flow Problems with Applications to Approx-
imation Algorithms'', IEEE Symp. on Foundations of
Computer Science, pp. 422 431.
[75] Leighton, T., Makedon, F., Plotkin, S., Stein, C.,
Tardos, E. and Tragoudas, S., ``Fast Approximation
Algorithms for Multicommodity Flow Problems'', Tech.
report no. STAN-CS-91-1375, Dept. of Computer
Science, Stanford University.
[76] Leiserson, C. E. and Saxe, J. B. (1991). ``Retiming
Synchronous Circuitry'', Algorithmica, 6(1), 5 35.
[77] Lengauer, T. and Muller, R. (1988). ``Linear Arrange-
ment Problems on Recursively Partitioned Graphs'',
Zeitschrift fur Operations Research, 32, 213 230.
[78] Lengauer, T., Combinatorial Algorithms for Integrated
Circuit Layout, Wiley, 1990.
[79] Li, J., Lillis, J. and Cheng, C. K., ``Linear decomposition
algorithm for VLSI design applications'', In: Proc. IEEE
Int. Conf. Computer-Aided Design, November, 1995, pp.
223 228.
[80] Li, J., Lillis, J., Liu, L. T. and Cheng, C. K., ``New
Spectral Linear Placement and Clustering Approach'',
In: Proc. ACM/IEEE Design Automation Conf., June,
1996, pp. 88 93.
[81] Liou, H. Y., Lin, T. T., Liu, L. T. and Cheng, C. K.,
``Circuit Partitioning for Pipelined Pseudo-Exhaustive
Testing Using Simulated Annealing'', In: Proc. IEEE
Custom Integrated Circuits Con., May, 1994, pp. 417
420.
[82] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C., ``A
Replication Cut for Two-Way Partitioning'', IEEE
Trans. Computer-Aided Design, May, 1995, pp. 623 630.
[83] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C.,
``Performance-Driven Partitioning Using a Replication
Graph Approach'', In: Proc. ACM/IEEE Design Auto-
mation Conf., June, 1995, pp. 206 210.
[84] Liu, L. T., Kuo, M. T., Huang, S. C. and Cheng, C. K.,
``A gradient method on the initial partition of Fiduccia-
Mattheyses algorithm'', In: Proc. IEEE Int. Conf.
Computer-Aided Design, November, 1993, pp. 229 234.
[85] Liu, L. T., Shih, M., Chou, N. C., Cheng, C. K. and Ku,
W., ``Performance-Driven Partitioning Using Retiming
and Replication'', In: Proc. IEEE Int. Conf. Computer-
Aided Design, November, 1993 pp. 296 299.
[86] Liu, L. T., Shih, M. and Cheng, C. K., ``Data Flow
Partitioning for Clock Period and Latency Minimiza-
tion'', In: Proc. ACM/IEEE Design Automation Conf.,
June, 1994, pp. 658 663.
[87] Matula, D. W. and Shahrokhi, F., ``The Maximum
Concurrent Flow Problem and Sparsest Cuts'', Tech.
Report, southern Methodist Univ., 1986.
[88] McFarland, M. C., S.J.,``Computer-aided partitioning of
behavioral hardware descriptions'', In: Proc. ACM/
IEEE Design Automation Conf., June, 1983, pp. 472
478.
[89] Motwani, R. and Raghavan, P. (1995). Randomized
Algorithms, Cambridge University Press.
[90] Ng, T. K., Oldeld, J. and Pitchumani, V., ``Improve-
ments of a mincut partition algorithms'', In: Proc. IEEE
Int. Conf. Computer-Aided Design, November, 1987, pp.
470 473.
[91] Nijssen, R. X. T., Jess, J. A. G. and Eindhoven, T. U.,
``Two-Dimensional Datapath Regularity Extraction'',
Physical Design Workshop, April, 1996, pp. 111 117.
[92] Parhi, K. K. and Messerschmitt, D. G. (1991). ``Static
Rate-Optimal Scheduling of Iterative Data-Flow Pro-
grams via Optimum Unfolding'', IEEE Trans. on
Computers, 40(2), 178 195.
[93] Riess, B. M., Doll, K. and Johannes, F. M., ``Partition-
ing very large circuits using analytical placement
techniques'', In: Proc. ACM/IEEE Design Automation
Conf., June, 1994, pp. 646 651.
[94] Roy, K. and Sechen, C., ``A Timing Driven N-Way Chip
and Multi-Chin Partitioner'', Proc. IEEE/ACM Int.
Conf. on Computer-Aided Design, pp. 240 247, Novem-
ber, 1993.
[95] Russo, R. L., Oden, P. H. and Wol, P. K. Sr., ``A
heuristic procedure for the partitioning and mapping of
computer logic graphs'', IEEE Trans. on Computers,
C-20, 1455 1462, December, 1971.
[96] Saab, Y., ``A fast and robust network bisection
algorithm'', IEEE Trans. Computers, 44(7), 903 913,
July, 1995.
[97] Saab, Y. and Rao, V. (1989). ``An Evolution-Based
Approach to Partitioning ASIC Systems'', ACM/IEEE
26th Design Automation Conf., pp. 767 770.
[98] Sanchis, L. A., ``Multiple-Way Network Partitioning'',
IEEE Trans. Computers, 38(1), 62 81, January, 1989.
[99] Sanchis, L. A., ``Multiple-Way Network Partitioning
with Dierent Cost Functions'', IEEE Trans. on
Computers, pp. 1500 1504, December, 1993.
[100] Schuler, D. M. and Ulrich, E. G. (1972). ``Clustering and
Linear Placement'', Proc. 9th Design Automation Work-
shop, pp. 50 56.
42 S.-J. CHEN AND C.-K. CHENG
I207T001015 . 207
T001015d.207
[101] Schweikert, D. G. and Kernighan, B. W. (1972). ``A
Proper Model for the Partitioning of Electrical Circuits'',
Proc. 9th Design Automation Workshop, pp. 57 62.
[102] Sechen, C. and Chen, D. (1988). ``An Improved Objec-
tive Function for Mincut Circuit Partitioning'', Proc. Int.
Conf. on Computer-Aided Design, pp. 502 505.
[103] Shahrokhi, F. and Matula, D. W., ``The Maximum
Concurrent Flow Problem'', Journal of the ACM, 37(2),
318 334, April, 1990.
[104] Shapiro, J. F. (1979). Mathematical Programming:
Structures and Algorithms, Wiley, New York.
[105] Sherwani, N. A. (1999). Algorithms for VLSI Physical
Design Automation, 3rd edn., Kluwer Academic.
[106] Shih, M., Kuh, E. S. and Tsay, R.-S. (1992). ``Perfor-
mance-Driven System Partitioning on Multi-Chip Mod-
ules'', Proc. 29th ACM/IEEE Design Automation Conf.,
pp. 53 56.
[107] Shih, M. and Kuh, E. S. (1993). ``Quadratic Boolean
Programming for Performance-Driven System Partition-
ing'', Proc. 30th ACM/IEEE Design Automation Conf.,
pp. 761 765.
[108] Shin, H. and Kim, C., ``A Simple Yet Eective
Technique for Partitioning'', IEEE Trans. on Very Large
Scale Integration Systems, pp. 380 386, September,
1993.
[109] Wei, Y. C. and Cheng, C. K. (1991). ``Ratio Cut
Partitioning for Hierarchical Designs'', IEEE Trans. on
Computer-Aided Design, 10(7), 911 921.
[110] Wei, Y. C., Cheng, C. K. and Wurman, Z., ``Multiple
Level Partitioning: An Application to the Very Large
Scale Hardware Simulators'', IEEE Journal of Solid
State Circuits, 26, 706 716, May, 1991.
[111] Woo, N. S. and Kim, J. (1993). ``An Ecient Method of
Partitioning Circuits for Multiple-FPGA Implementa-
tion'', Proc. ACM/IEEE Design Automation Conf., pp.
202 207.
[112] Yang, H. and Wong, D. F. (1994). ``Edge-Map: Optimal
Performance Driven Technology Mapping for Iterative
LUT Based FPGA Designs'', Int. Conf. on Computer- A
Aided Design, pp. 150 155.
[113] Yang, H. and Wong, D. F., ``Ecient Network Flow
based Min-Cut Balanced Partitioning'', In: Proc. IEEE
Int. Conf. Computer-Aided Design, November, 1994, pp.
50 55.
[114] Yeh, C. W., ``On the Acceleration of Flow-Oriented
Circuit Clustering'', IEEE Trans. Computer-Aided De-
sign, 14(10), 1305 1308, October, 1995.
[115] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``A general
purpose, multiple-way partitioning algorithm'', IEEE
Trans. Computer-Aided Design, 13(12), 1480 1488,
December, 1994.
[116] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y.,
``Optimization by iterative improvement: an experimen-
tal evaluation on two-way partitioning'', IEEE Trans.
Computer-Aided Design, 14(2), 145 153, February,
1995.
[117] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``Circuit
clustering using a stochastic ow injection method'',
IEEE Trans. Computer-Aided Design, 14(2), 154 162,
February, 1995.
[118] Zien, J. Y., Chan, P. K. and Schlag, M., ``Hybrid
spectral/iterative partitioning'', In: Proc. IEEE Int. Conf.
Computer-Aided Design, November, 1997 pp. 436 440.
Authors' Biographies
Sao-Jie Chen has been a member of the faculty in
the Department of Electrical Engineering, Na-
tional Taiwan University since 1982, where he is
currently a full professor. During the fall of 1999,
he held a visiting appointment at the Department
of Computer Science and Engineering, University
of California, San Diego. His current research
interests include: VLSI circuits design, VLSI
physical design automation, and object-oriented
software engineering. Dr. Chen is a member of the
Association for Computing Machinery, the IEEE,
and the IEEE Computer Society.
Chung-Kuan Cheng received the B.S. and M.S.
degrees in electrical engineering from National
Taiwan University, and the Ph.D. degree in
electrical engineering and computer sciences from
University of California, Berkeley in 1984. From
1984 to 1986 he was a senior CAD engineer at
Advanced Micro Devices Inc. In 1986, he joined
the University of California, San Diego, where he
is a Professor in the Computer Science and
Engineering Department, an Adjunct Professor
in the Electrical and Computer Engineering
Department. He served as a chief scientist at
Mentor Graphics in 1999. He is an associate editor
of IEEE Trans. on Computer Aided Design since
1994. He is a recipient of the best paper award,
IEEE Trans. on Computer-Aided Design 1997, the
NCR excellence in teaching award, School of
Engineering, UCSD, 1991. His research interests
include network optimization and design automa-
tion on microelectronic circuits.
43 VLSI PARTITIONING
I207T001015 . 207
T001015d.207