Design and Implementation of
a High-Speed, Low-Power VLSI
Chip for the DCT Transform
Project Advisor:
Professor A. Doboli
Participates:
Tak Yuen Lam,
Back Ground
The
DCT application can have many
purposes:
Filtering
Teleconferencing
high-definition television (HDTV)
speech coding, image coding
data compression, and more.
Back Ground
All
of these use DCT algorithm for
compression and/or filtering purposes.
The DCT has
energy packing capabilities
approaches the statistically optimal
transform in de-correlating a signal.
It was implemented with discrete
components in a chip.
Goal
.
Implementation of a VLSI chip with:
-high speed
-low power
compute the 2-D Discrete Cosine
Transform (DCT) function of an 8 x 8
element matrix is presented.
Goal
Save Power Consumption during Computing
Operation in the chip:
-Specially design multiplier with less
computation.
-Less switching
-Simplify of the equations.
High Speed Processing:
- using pipeline technology.
- Ignore zeros in the multiplier.
Basic Formula
Forward
DCT:
( N 1) ( N 1)
F (u, v) C (u )C (v)[
x 0
Inverse
y 0
(2 x 1)u
(2 y 1)v
f ( x, y ) cos
cos
]
2N
2N
DCT:
(2 x 1)u (2 y 1)v
v0 C (u)C (v)F (u, v) cos 2 N cos 2 N ]
( N 1) ( N 1)
f ( x, y ) [
u 0
Basic Formula
C(u)
1
N
=
,C(v) =
C(u) = 2 ,C(v) =
N
through N-1;
N = 4, 8, or 16
1
N
2
N
for u,v = 0
for u,v = 1
1-D DCT Matrix
[Y ] [C ] * [ X ]
Y (0)
Y (1)
Y (2)
Y
(
3
)
Y (4)
Y (5)
Y (6)
Y (7)
C4
C
1
C2
1 C3
4 C4
C5
C
6
C 7
C4
C4
C4
C4
C4
C4
C3
C5
C7
C7
C5
C3
C6
C6
C2
C2
C6
C6
C7
C1
C5
C5
C1
C7
C4
C4
C4
C4
C4
C4
C1
C7
C3
C3
C7
C1
C2
C2
C6
C6
C2
C2
C5
C3
C1
C1
C3
C5
C4
C1
C2
X (0)
X (1)
X (2)
C3
C4
C5
C6
C 7
X (3)
X (4)
X (5)
X (6)
X (7)
Simplification:
C4 0
Y ( 0)
0 C
Y ( 2)
4
0 0
Y ( 4)
Y ( 6) 1 0 0
Y (1) 4 0 0
0 0
Y (5)
0 0
Y ( 6)
Y (7)
0 0
0 0
0 0
C 2 C6
C6 C2
0 0
0 0
0
0
0
0
0 0
0 0
0 0
0 0
C1 C3
C3 C7
0
0
0
0
C5
C1
0
0
0
0
C7
C5
C5 C1 C7 C3
C7 C5 C3 C1
X (0) X (1) X (2) X (3) X (4) X (5) X (6) X (7)
X (0) X (7) X (3) X (4) X (1) X (2) X (5) X (6)
X (0) X (7) X (3) X (4)
X (1) X (6) X (2) X (5)
*
X ( 0) X ( 7 )
X (1) X (6)
X (2) X (5)
X (3) X (4)
The following equations are derived
from the matrix above
Y(7) =
1
C7 (X(0)-X(7)) - C5 (X(1)-X(6)) + C3 (X(2)-X(5)) - C1 (X(3)-X(4))
4
1
C3 (X(0)-X(7)) - C7 (X(1)-X(6)) - C1 (X(2)-X(5)) - C5 (X(3)-X(4))
4
1
Y(1) =
C1 (X(0)-X(7)) + C3 (X(1)-X(6)) + C5 (X(2)-X(5)) + C7 (X(3)-X(4))
4
Y(3) =
Y(6) =
Y(4) =
Y(2) =
Y(5) =
Y (0)
1
C6 (X(0)+X(7)-X(3)-X(4)) - C 2 (X(1)+X(6)-X(2)-X(5))
4
1
-C4 (X(0)+X(7)+X(3)+X(4)-X(1)-X(2)-X(5)-X(0))
4
1
C2 (X(0)+X(7)-X(3)-X(4)) + C6 (X(1)+X(6)-X(2)-X(5))
4
1
C5 (X(0)-X(7)) - C1 (X(1)-X(6)) + C 7 (X(2)-X(5)) + C3 (X(3)-X(4))
4
1
C4 (X(0)+X(1)+X(2)+X(3)+X(4)+X(5)+X(0)+X(7))
4
Simplified Equations
Y(0) = c4[j + k+l+m]
Y(2) = c2[j-k] + c6[m-l]
Y(4) = -c4[j+k+l+m]
Y(6) = c6[j-k] c2[m-l]
Y(1) = e + f + h + [c+c3-c5-c7]a
Y(3) = e + g + I + [c1+c3+c5-c7]b
Y(5) = e + g + h + [c1+c3-c5+c7]c
Y(7) = e + f + I + [-c1+c3+c5-c7]d
Simplified Equations
a = x0x7; b= x1-x6;
c = x2x5; d = x3-x4
j = x0+x7; k = x1+x6;
l = x2+x5; m= x3+x4
e = c3[a+b+c+d]
f=[c7-c3][a-d]
g=[-c1-c3][b+c]
h=[c5-c3][a+c]
I=[-c5-c3][b+d]
1D DCT Flow Chart
Shifter
Pixel
Memory
Cosine
Matrix
Memory
Multiplier
DCT Coefficients
Register
Bank
2D DCT Flow Chart
1 D DCT
Transpose
1 D DCT
Control
1-D DCT Architecture(First Version)
X(0) X(7)
X(3) X(4)
Y(0)
Y(4)
X(2) X(5)
X(1) X(6)
Y(2)
Y(6)
X(0) X(7)
Y(7)
X(1) X(6)
Y(5)
X(3) X(4)
X(2) X(5)
Y(1)
Y(3)
1-D DCT Architecture(Final Version)
X0 X1
X3 X4 X2 X5 X1X6
X0 X7 X3 X4 X2 X5 X1 X6
Y4
Y2
Y1
Y7
+
State 1
Y6
Y0
Y5
+
Y3
+
+
State
2
State 3
Transpose Architecture
In 7
In 6
In 5
In 4
In 3
In 2
In 1
In 0
Transpose
Component
Hardware: Fuller Adder
20 Transistors
Full Adder
A
C I
C I
C I
CI
-
CI
Co
CI
Hardware: Multipier
Z6
Z5
Z4
3
2
7 4 0 8
HA
Z3
3
2
U 1 A
Y3
3
74 0 8
Z1
HA
U 1 A
FA
3
2
3
3
2
7 4 0 8
74 0 8
Z0
Y2
U 1 A
U 1 A
Z7
1
7 4 08
7 4 0 8
U 1 A
HA
X0
U 1 A
FA
X0
7 4 0 8
U 1 A
X1
3
2
X1
X2
74 0 8
FA
FA
Y1
U 1 A
FA
74 0 8
FA
FA
7 4 0 8
U 1 A
7 4 0 8
U 1 A
X0
7 4 08
U 1 A
X1
Y0
7 4 0 8
U 1 A
3
1
X2
X3
HA
X0
U 1 A
3
2
X2
74 0 8
FA
7 4 0 8
U 1 A
U 1 A
X3
7 4 0 8
U 1 A
X3
X1
4x4 bit
array
multiplier
X2
X3
Example:
Z2
Simplified Multiplication
Example:
x
x
- ignore
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Output
1
1
0
1
0
0
1
1
1
0
0
0
Comparison
Convential Design
Numbers of addition
56
Numbers of Multiplier
32
Array(16x16)
92.6
43.5
First version
Final version
28
32
16
14
Speed of Multiplier (ns)
Split Array(16x16) Wallace(16x16) Modified Booth(16x16) Our Design (12x8)
62.9
54.5
45.4
~23
Power (mW)
38
32
41.3
~22.5
VLSI: Full Adder (from Library)
VLSI: Multiplier
Multiplier Simulation
VLSI: 1 bit Transpose
VLSI: 8x8 Transpose
Transpose Simulation
VLSI: 1D DCT Part One
1 D DCT Part One Simulation
VLSI: 1D DCT Part Two
1 D DCT Part two Simulation
Java simulation
Java
Code:
public void transform() {
g = new int[8][8];
for ( int i = 0; i < 8; i++ ) {
for ( int j = 0; j < 8; j++ ) {
double ge = 0.0;
for ( int x = 0; x < 8; x++ ) {
for ( int y = 0; y < 8; y++ ) {
double cg1 = (2.0*(double)x+1.0)*(double)i*Math.PI/16.0;
double cg2 = (2.0*(double)y+1.0)*(double)j*Math.PI/16.0;
ge += ((double)f[x][y]) * Math.cos(cg1) * Math.cos(cg2);
}
}
double ci = ((i==0)?1.0/Math.sqrt(2.0):1.0);
double cj = ((j==0)?1.0/Math.sqrt(2.0):1.0);
ge *= ci * cj * 0.25;
g[i][j] = (int)Math.round(ge);
}
}
}
Simulation Result
INPUT MATRIX:
48
1
67
57
178
48
23
54
123
5
89
67
175 234
2
98
56
3
89
23
45
56
32
67
155 43 27
51 93 187
39 52 12
120 87 65
44 45 47
90 128 235
89
67
78
34
23
50
78 190
125 7
0 129
189 1
49 90
167 203
90 89
190 210
Simulation Result
Output matrix after DCT:
127
66
68
54
67
76
3
11
69
64
50
57
16
76
35
75
16
63
41
97
44
83
7
9
14
73
7
65
115
2
37
45
5
24
165
41
6
18
101
63
27
83
9
110
13
6
33
19
23
63
49
30
51
120
19
37
110
663
31
86
17
143