02 Thesis
02 Thesis
COMPRESSION
USING DISCRETE COSINE TRANSFORM
A THESIS
Submitted by
VIJAYAPRAKASH A M
DOCTOR OF PHILOSOPHY
DECLARATION
I declare that the thesis entitled “LOW POWER VLSI ARCHITECTURE FOR
carried out by me during the period from August 2004 to July 2012 under the
guidance of Dr. K. S GURUMURTHY and has not formed the basis for the
award of any degree, diploma, associate-ship, fellowship titles in this or any other
BONAFIDE CERTIFICATE
Certified that this thesis titled. “LOW POWER VLSI ARCHITECTURE FOR
knowledge, the work reported herein does not form part of any other thesis or
Dr K.S.GURUMURTHY
Professor
DOS in Electronics and Communication
Engineering
University Visvesvaraya College of Engineering.
Bangalore -560001
iv
ABSTRACT
used to represent image is reduced to meet a bit rate requirement (below or at most
equal to the maximum available bit rate), while the quality of the reconstructed
transmission. It is the process of converting an input data stream into another data
stream that has a smaller size. In image compression we reduce the irrelevance and
redundant image data in order to store or transmit data in an efficient form. Image
coding algorithms and techniques are developed that optimize the bit rate and
quality of the image. Image compression has got applications in many fields like
digital video, video conferencing and video over wireless networks and internet
to the high correlation of the image data. For the increasing number of portable
wireless devices, the key design constraint is power dissipation.Limited battery life
not grow as fast as the density and the operating frequency of ASICs. The ever
growing circuit densities and operating frequencies of ASICs only result in higher
power dissipation. Since early studies have focused only on high throughput DCT
with variable length coders, low-power DCT and low-power Variable Length
v
Coders have not received much attention.The target of the multimedia systems is
moving towards portable applications like laptops, mobiles and IPods etc. These
systems highly demand for low power operations, and thus require low power
functional units.
Discrete Cosine Transform for image compression. This architecture uses row -
products and basic common computations are identified and shared to reduce
the proposed DCT consumes less power. ASIC implementation of DCT and IDCT
architecture for Two Dimensional DCT and Variable Length coding for image
compression. The 2-D DCT calculation is performed by using the 2-D DCT
separability property, Such that the whole architecture is divided in to two 1-D
processing method are regular structures, simple control and interconnect and good
Variable length coding that maps input source data onto code words with
probability and long code words to those of low probability. Variable length
coding can be successfully used to relax the bit-rate requirements and storage
spaces for many multimedia applications. For example, a variable length coder
(VLC) employed in MPEG-2 along with the Discrete Cosine Transform (DCT),
In this work researcher is going with the ASIC design for the image
compression system. Firstly, the system compresses the image using DCT and
Quantization. Next apply the Variable Length Coding for the compressed image
so that a further compression is achieved, finally use the IDCT and Variable
compression algorithms require high processing power. So in this work, the present
researcher is concentrating on the low power VLSI design for the image
compression system and at the same time obtaining a good compression ratio. The
In this research work, algorithms and architecture have been developed for
These algorithms have been subsequently verified and the corresponding hardware
architectures are proposed so that they are suitable for ASIC implementation.
The DCT and IDCT architecture was first coded in Matlab in order to prove
the concepts and design methodology proposed for the work. After it was
successfully coded and tested, the VLSI design of the architecture was coded in
compression was synthesized using RTL compiler and it is mapped using 65nm
node standard cells. The Simulation was done using Modelsim simulator. Detailed
analysis for power and area was done using Design compiler (DC) from Synopsis
EDA tool. Power consumption of the DCT and IDCT are limited to 0.4350 mw
and 0.5519 mw with the cell area of 34983.35µm2 and 34903.79µm2 respectively.
The variable length encoder is mapped using 90nm node standard cells. The power
physical design of the proposed hardware in this research was done using IC
compiler.
viii
ACKNOWLEDGEMENT
The joy and satisfaction that would accompany the successful completion of
any task would be incomplete without the mention of those who made it possible. I
am grateful that and now have the opportunity to thank all those people who have
Educational and Research Institute, University for his inspiration and support
guidance, motivation, confidence and support for the speedy completion of this
thesis work.
Dr.M.G.R Educational and Research Institute, University who has given constant
colleagues for their constant encouragement and moral support. They have in
someway or the other responsible for the successful completion of this thesis.
VijayaPrakash A M
ix
TABLE OF CONTENTS
CHAPTER PAGE
TITLE
NO NO
ABSTRACT iv
1 INTRODUCTION
1.8.2 Quantization 18
SUMMARY 29
2 REVIEW OF LITERTURE
2.1 INTRODUCTION 30
3.8.2 Placement 60
3.8.3 Routing 61
SUMMARY 75
xii
SUMMARY 88
SUMMARY 105
SUMMARY 133
SUMMARY 145
APPENDIX 149
REFERENCES
LIST OF PUBLICATIONS
VITAE
xvi
LIST OF TABLES
TABLE PAGE
TITLE
NO NO
5.1 Comparison between conventional and proposed RLE 97
7.1 Power and Area Parameters of VLC and VLD blocks 141
LIST OF FIGURES
FIGURE PAGE
TITLE
NO NO
LIST OF ABBREVIATIONS
CHAPTER 1
INTRODUCTION
Image data compression has been an active research area for image
processing over the last decade and has been used in a variety of applications.
Compression is the process of reducing the size of a data by encoding its
information more efficiently. By doing this, the result is a reduction in the number
of bits and bytes used to store the information. In effect, reduce the bandwidth
required for transmission and to reduce the storage requirements [97]. This
research investigates the implementation of an image data compression method
with Low power VLSI hardware that could be used in practical coding systems to
compress Image signals.
DCT based coding and decoding systems play a dominant role in real-time
applications in science and engineering like audio and Images. VLSI DCT
processor chips have become indispensable in real time coding systems because of
their fast processing speed and high reliability. JPEG has defined an international
standard for coding and compression of continuous tone- still images. This
standard is commonly referred to as the JPEG standard [55]. The primary aim of
the JPEG standard is to propose an image compression algorithm that would be
generic, application independent and aid VLSI implementation of data
compression. As the DCT core becomes a critical part in an image compression
system, close studies on its performance and implementation are worthwhile and
important. Application specific requirements are the basic concern in its design. In
the last decade the advancement in data communication techniques was significant,
during the explosive growth of the Internet the demand for using multimedia has
increased. Video and Audio data streams require a huge bandwidth to be
transferred in an uncompressed form. Several ways of compressing multimedia
streams evolved, some of them use the Discrete Cosine Transform (DCT) for
transform coding and its inverse (IDCT) for transform decoding.
process of reducing the amount of information into a smaller data set that can be
used to represent, and reproduce the information. Types of image compression
include lossless compression, and lossy compression techniques that are used to
meet the needs of specific applications.
size of the data, without losing any important information. However, this is not the
case for psycho-visual redundancy. The most obvious way to increase compression
is to reduce the coding redundancy. This is referring to the entropy of an image in
the sense that more data is used than necessary to convey the information. Lossless
redundancy removal compression techniques are classified as entropy coding.
Other compression can be obtained through inter-pixel redundancy removal. Each
adjacent pixel is highly related to its neighbors, thus can be differentially encoded
rather than sending the entire value of the pixel. Similarly adjacent blocks have the
same property, although not too the extent of pixels.
Images contain both low frequency and high frequency components. Low
frequencies correspond to slowly varying color, whereas high frequencies
represent fine detail within the image. Intuitively, low frequencies are more
important to create a good representation of an image. Higher frequencies can
largely be ignored to a certain degree. The human eye is more sensitive to the
luminance (brightness), than the chrominance (color difference) of an image. Thus
5
during compression, chrominance values are less important and quantization can
be used to reduce the amount of psycho-visual redundancy [92, 101, 18].
Luminance data can be quantized, but more coarsely to ensure that important data
is not lost.
The need for image compression becomes apparent when number of bits
per image is computed resulting from typical sampling rates and quantization
methods. For example, the amount of storage required for given images is (i) a
low resolution, TV quality, color video image which has 512×512 pixels/color,
8bits/pixel, and 3 colors approximately consists of 6×10⁶ bits; (ii) a 24×36 mm
negative photograph scanned at 12×10⁻⁶mm: 3000×2000 pixels/color, 8
bits/pixel, and 3 colors nearly contains 144×10⁶ bits; (iii) a 14×17 inch radiograph
scanned at 70×10⁻⁶mm: 5000×6000 pixels, 12 bits/pixel nearly contains 360×10⁶
bits. Thus storage of even a few images could cause a problem. As another
example of the need for image compression, consider the transmission of low
resolution 512×512×8 bits/pixel×3- color video image over telephone lines. Using
a 96000 bauds (bits/sec) modem, the transmission would take approximately 11
minutes for just a single image, which is unacceptable for most applications.
This refers to the entropy of the image in the sense more data is used than
necessary to convey the information. This can be overcome by variable length
coding. Examples of image coding schemes that explore coding redundancy are
Huffman codes and Arithmetic coding technique.
correlated pixels, in other words, large regions where the pixel values are the same
or almost the same. Examples of compression techniques that explore interpixel
redundancy include Constant Area Coding, Run Length Encoding and many
predictive coding algorithms like Differential Pulse Code Modulation.
This refers to the fact that the human visual system will interpret an image
in a way such that removal of redundancy will create an image that is nearly
indistinguishable by human viewers. The main way to reduce this type of
redundancy is through quantization. Psychovisual properties are taken advantage
from studies performed on human visual system. Most of the image coding
algorithms in use today exploit this type of redundancy such as Discrete Cosine
Transform (DCT) based algorithm at the heart of the JPEG encoding standard.
The JPEG standard allows for both lossy and lossless encoding of still
images. The algorithm for lossy coding is a discrete cosine transforms (DCT)
based coding scheme. This is the baseline of JPEG and is sufficient for many
applications. However, to meet the needs of applications that cannot tolerate loss,
e.g., compression of medical images, a lossless coding scheme is also provided and
is based on a predictive coding scheme. From the algorithmic point of view, JPEG
includes four distinct modes of operation: sequential DCT-based mode,
progressive DCT-based mode, lossless mode, and hierarchical mode.
9
The sequential DCT based mode of operation comprises the baseline JPEG
algorithm. This technique can produce very good compression ratios, while
sacrificing image quality. The sequential DCT based mode achieves much of its
compression through quantization, which removes entropy from the data set.
Although this baseline algorithm is transform based, it does use some measure of
predictive coding called the differential pulse code modulation (DPCM) [8]. After
each input 8x8 block of pixels is transformed to frequency space using the DCT,
the resulting block contains a single DC component, and 63 AC components. The
DC component is predictively encoded through a difference between the current
DC value and the previous. This mode only uses Huffman coding models, not
arithmetic coding models which are used in JPEG extensions. This mode is the
most basic, but still has a wide acceptance for its high compression ratios, which
can fit many general applications very well.
However, in the progressive mode, the quantized DCT coefficients are first
stored in a buffer before the encoding is performed. The DCT coefficients in the
buffer are then encoded by a multiple scanning process. In each scan, the quantized
DCT coefficients are partially encoded by either spectral selection or successive
approximation. In the method of spectral selection, the quantized DCT coefficients
are divided into multiple spectral bands according to a zigzag order. In each scan, a
specified band is encoded. In the method of successive approximation, a specified
number of most significant bits of the quantized coefficients are first encoded,
followed by the least significant bits in later scans. The difference between
sequential coding and progressive coding is shown in Figure 1.1. In the sequential
coding an image is encoded part-by-part according to the scanning order while in
the progressive coding, the image is encoded by multi scanning process and in
each scan the full image is encoded to a certain quality level.
10
8X8 Quantized
Input DCT Quantizer DCT
Image coefficients
Quantization
Tables
The DCT process is shown in Figure 1.3 The input image is divided into
non-overlapping blocks of 8 x 8 pixels, and input to the baseline encoder. The
pixel values are converted from unsigned integer format to signed integer format,
and DCT computation is performed on each block. DCT transforms the pixel data
into a block of spatial frequencies that are called the DCT coefficients. Since the
pixels in the 8 x 8 neighborhood typically have small variations in gray levels, the
output of the DCT will result in most of the block energy being stored in the lower
spatial frequencies [80, 83, 60]. On the other hand, the higher frequencies will
have values equal to or close to zero and hence, can be ignored during encoding
without significantly affecting the image quality.
The 2-D DCT transforms an 8x8 block of spatial data samples into an 8x8
block of spatial frequency components. The IDCT performs the inverse of DCT,
transforming spatial frequency components back into the spatial domain. Figure
1.4 shows the frequency components represented by each coefficient in the output
matrix. The low frequency coefficients occur in the top left side of the output
matrix, while the remaining
14
higher frequency coefficients occur in the bottom right side. The DC coefficient at
position (0,0) gives an idea of the average intensity (for luminance blocks) or hue
(for chrominance blocks) of an entire block. Moving horizontally from position
(0,0) to position (0,7), the coefficients give the contributions of increasing vertical
frequency components to the overall 8x8 block. The coefficients from position
(1,0) to position (7,0) have similar meaning for horizontal frequency components.
Moving diagonally through the matrix gives the combined contribution of
horizontal and vertical frequency components. The original block is rebuilt by the
IDCT with these discrete frequency components. High frequency coefficients have
small magnitude for typical image data, which usually does not change
dramatically between neighboring pixels. Additionally, the human eye is not as
sensitive to high frequencies as to low frequencies. It is difficult for the human eye
to discern changes in intensity or colour that occurs between successive pixels. The
human eye tends to blur these rapid changes into an average hue and intensity.
However, gradual changes over the 8 pixels in a block are much more discernible
than rapid changes. When the DCT is used for compression purposes, the quantizer
15
unit attempts to force the insignificant high frequency coefficients to zero while
retaining the important low frequency coefficients.
DCT transforms the information from the time or space domains to the
frequency domain, such that other tools and transmission media can be run or used
more efficiently to reach application goals: compact representation [13], fast
transmission, memory savings, and so on. The JPEG image compression standard
was developed by Joint Photographic Expert Group. The JPEG compression
principle is the use of controllable losses to reach high compression rates. In this
context, the information is transformed to the frequency domain through DCT.
Since neighbor pixels in an image have high likelihood of showing small
variations in color, the DCT output will group the higher amplitudes in the lower
spatial frequencies. Then, the higher spatial frequencies can be discarded,
generating a high compression rate and a small perceptible loss in the image
quality. The JPEG compression is recommended for photographic images, since
drawing images are richer in high frequency areas that are distorted with the
application of the JPEG compression.
The name "JPEG" stands for Joint Photographic Experts Group. JPEG is a
method of lossy compression for digitized photographic images. JPEG can achieve
a good compression with little perceptible loss in image quality. It works with
color and grayscale images and finds applications in satellite, medical etc.
• Image/Block Preparation
• Quantization
• Entropy Coding
Original Compressed
Image DCT Quantization Encoder image
Compressed Reconstructed
Image IDCT Dequantization Decoder image
It transforms the input data into a format to reduce inter pixel redundancies
in the input image. Transform coding techniques use a reversible, linear
mathematical transform to map the pixel values onto a set of coefficients, which
are then quantized and encoded. The key factor behind the success of transform-
based coding schemes is that many of the resulting coefficients for most natural
images have small magnitudes and can be quantized without causing significant
distortion in the decoded image. For compression purpose, the higher the
capability of compressing information in fewer coefficients, the better the
transform; for that reason, the Discrete Cosine Transform (DCT) and Discrete
18
Wavelet Transform (DWT) have become the most widely used transform coding
techniques.
In order to make the data fit the discrete cosine transform, each pixel value
is level shifted by subtracting 128 from its value. The result of this is 8-bit pixels
that have the range of -127 to 128, making the data symmetric across 0. This is
good for DCT as any symmetry that is exposed will lead towards better entropy
compression [22, 90]. Effectively this shifts the DC coefficient to fall more in line
with the value of the AC coefficients. The AC coefficients produced by the DCT
are not affected in any way by this level shifting.
1.8.2 Quantization
The human eye responds to the DC coefficient and the lower spatial
frequency coefficients. If the magnitude of a higher frequency coefficient is below
a certain threshold, the eye will not detect it. Set the frequency coefficients in the
transformed matrix whose amplitudes are less than a defined threshold to zero
(these coefficients cannot be recovered during decoding) during quantization, the
19
Run Length is the first step in entropy coding. This is a simple thought that
is accomplished by assigning a code, run length and size, to every non-zero value
in the quantized data stream. The run length is a count of zero values before the
non-zero value occurred. The size is a category given to the non-zero value which
is used to recover the value later. The DC value of the block is omitted in this
process [69, 25]. Additionally, with every non-zero value a magnitude is generated
which determines the number of bits that are necessary to reconstruct the value. It
will indicate possible values in the size category that can be correct. Run Length
coding is a basic form of lossless compression. Essentially, this process is a
21
The figure 1.6 shows the Decompression model of the image in which the
reverse operation of the compression model is performed so that the original image
can be reconstructed back.
22
• Surveillance systems
( )
( ) = ∝ ( )∑ ( )cos …… (1.1)
23
( )
( )= ∑ ∝ ( ) ( )cos
…… (1.2)
for x = 0,1,2,…,N −1. In both equations (1.1) and (1.2) α (u) is defined as
=0
∝( )= …… (1.3)
≠0
1
= 0, ( = 0) = ( )
….. (1.4)
Thus, the first transform coefficient is the average value of the sample
sequence. In literature, this value is referred to as the DC Coefficient. All other
transform coefficients are called the AC Coefficients.
The 2-D DCT is a direct extension of the 1-Dimensional DCT and is given
by
( ) ( )
( , ) = ( ) ( )∑ ∑ ( , )cos cos
…. (1.5)
24
(2 + 1) (2 + 1)
( , )= ( ) ( ) ( , )cos cos
2 2
…. (1.6)
The inverse transform is defined as for x, y = 0, 1, 2…, N −1. The 2-D basis
functions can be generated by multiplying the horizontally oriented 1-D basis
functions with vertically oriented set of the same functions.
The different architectures are available to find the 2-D DCT for a image
matrix some of the architectures are discussing in the next section.
The implementation of the 2-D DCT directly from the theoretical equation
results in 1024 multiplications and 896 additions. Fast algorithms exploit the
symmetry within the DCT to achieve dramatic computational savings.
25
irregular relationship between inputs and outputs of the system, the savings in
computational power may be significant with the use of certain 1-D DCT
algorithms. With this direct approach, large chunks of the design cannot be reused
to the same extent as in the conventional row-column decomposition approach.
Thus, the direct approach will lead to more hardware, more complex control, and
much more intensive debugging. Although the direct approach used 278 less
additions than the row-column approach, it had much greater complexity.
Therefore, the number of computations alone could not determine which
implementation would result in the lowest power design.
With an increase in resources, the result of the RAC can converge quicker. The
speed of the calculation is increased by replicating the LUT. In a fully parallel
approach, the result of the DA converges at maximum speed the clock rate. In this
case, the LUT must be replicated as many times as there are input bits.
1 (2 + 1) (2 + 1)
(, )= () () ( , )
√2 2 2
… (1.7)
This property, known as separability, has the principle advantage that D (i,
j) can be computed in two steps by successive 1-D operations on rows and
columns of an image. The arguments presented can be identically applied for the
inverse DCT computation.
28
Symmetry: Row and column operations in the DCT Equation reveals that these
operations are functionally identical. Such a transformation is called a symmetric
transformation. A separable and symmetric transform can be expressed in the form
D = TMT’ … (1.8)
This Thesis consists of eight chapters, appendix and references in total. The
framework of the thesis is as follows. Chapter 1 describes the Introduction about
the DCT and VLC and also discusses the different DCT Architecture. In Chapter 2,
the researcher discusses about the literature review related to the present research.
Chapter 3 describes the VLSI design flow and Importance of Low Power and
different Techniques for low power VLSI design. Chapter 4 describes the proposed
Low power VLSI Architecture for DCT and IDCT. In chapter 5, the present
researcher discusses about the Low power Architecture for VLC and VLD.
Chapter 6 and Chapter 7 describe the results and discussion of the present research
to achieve the good compression with low power approach. Chapter 8 deals with
the conclusion of the present thesis and also possibilities of future works.
29
SUMMARY
CHAPTER 2
REVIEW OF LITERTURE
2.1 INTRODUCTION
Image and Video Compression has been a very active field of research and
development for over 20 years and many different systems and algorithms for
compression and decompression have been proposed and developed. In order to
encourage interworking, competition and increased choice, it has been necessary to
define standard methods of compression, encoding and decoding to allow products
from different manufacturers to communicate effectively. This has led to
development of a number of key international standards for image and video
compression, including the JPEG, MPEG and H.26X series of standards.
The Discrete Cosine Transform (DCT) was first proposed by Ahmed et al.
(1974), and it has been more and more important in recent years. DCT has been
widely used in signal processing of image data, especially in coding for
compression, for its near-optimal performance. Because of the wide-spread use of
DCT's, research into fast algorithms for their implementation has been rather
active.
1. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power
DCT using Highly Scalable Multipliers”, 16th IEEE International
Conference on Image Processing ,pp.1925-1928.
31
In this paper the authors have implemented a low power DCT using highly
scalable multiplier. Scalable multipliers are used since the width of the operand bit
varies in each stage of DCT implementation. Due to the use of the variable sized
multipliers power saving is obtained. A highly scalable multiplier (HSM) allows
dynamic configuration of multiplier for each stage. The authors have calculated
PSNR and SSIM on a set of images which are JPEG encoded based on the use of
scalable multiplier. The authors conclude that the use of scalable multiplier with
variable size HSM reduces the power consumption more than compared to any
other algorithms.
A low power and high speed DCT for image compression is implemented
on FPGA. The DCT optimization is achieved through the hardware simplification
of the multipliers used to compute the DCT coefficients. In this work the authors
have implemented the DCT with constant multiplier by making use of Canonic
32
signed digit encoding to perform constant multiplication. The canonic signed digit
representation is the signed data representation containing the fewest number of
nonzero bits. Thus for the constant multipliers, the number of addition and
subtraction will be minimum. A common sub-expression elimination technique has
also been used for further DCT optimization, thu
In this paper a new modified flow graph algorithm (FGA) of DCT based on
Lo-effler with hardware implementation of multiplier-less operation has been
proposed. The proposed FGA uses unsigned constant coefficient multiplication.
The multiplier-less method is widely used for VLSI realization because of
improvements in speed, area overhead and power consumption. The Lo-effler fast
algorithm is used to implement 1-D DCT. Two 1D DCT steps are used to generate
2-D DCT coefficient.
6. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective Image
Compression Scheme using Huffman and Adaptive Interpolation”, 24th
33
In this paper authors have made use of lossy and lossless compression
techniques. Different blocks are compressed in one of the ways depending on the
information content in that block. The process of compression begins with passing
the image through an high pass filter and the image matrix is divided into a number
of non-overlapping sub-blocks. Each of the sub-blocks is checked for the number
of zeros by setting a threshold. If the number of zeros in a particular block is more
than the threshold, it implies that the block contains less information and that
particular block from the original image matrix is taken for lossy compression
using Adaptive Interpolation. On the other hand if the number of zeros is less than
the threshold, then it implies that the information content is more and thus the
corresponding block of the original image matrix is subjected to lossless
compression using Huffman coding. The authors conclude that the computational
complexity of this approach is less and the compression ratios obtained by this
method is also high.
In this paper the authors have proposed data compression for math’s text
files based on Adaptive Huffman coding. The compression ratio obtained is more
than Adaptive Huffman coding. In this the method used by the authors the
encoding process of the system encodes the frequently occurring characters with
shorter bit codes and the infrequently occurring characters with longer bit codes.
The algorithm proceeds as follows: subgroups of 256 characters are made. These
subgroups are again grouped into three groups of alphabets, numbers and some
operators and remaining symbols. The symbols of each group are arranged in the
34
In this paper the authors have presented the techniques for minimizing the
complexity of multiplication by employing Differential Pixel Image (DPI). DPI is
the matrix obtained taking the difference of intensities of the adjacent pixels in the
input image matrix. The use of DPI instead of the original image matrix results in
significant reduction in the number of operations and hence the power consumed.
The intensity of the pixel in the DPI is obtained as fd (x,y)= [f(x,y) – f(x,y-1)],
where f(x,y) is the intensity at (x,y) in the original image. Also the intensity of the
first pixel of every sub-block of the DPI will be the same as the original matrix. In
this work the DCT coefficient matrix is represented using canonic signed digits.
The authors have also used common sub-expression elimination method where
35
multiple occurrences of identical bit patterns are identified in the DCT matrix and
thereby reducing the resources necessary that which can be shared.
In this paper the authors have compared the performance of 4x4 DCT with
8x8 DCT, since small size DCT is suitable for mobile applications using low
power devices as fast computation speed is required for real time applications. The
authors have compared the performance of 4x4 transforms with the conventional
8x8 DCT in floating point. Firstly, the authors have compared the conventional
4x4 DCT in floating point with conventional 8x8 DCT in floating point. Next, the
4x4 integer transform is compared with the conventional 8x8 DCT in floating
point. The comparison was done on computation time of the transform and inverse
transform and objective quality, based on the calculation of PSNR between input
and reconstructed image. The authors have concluded that the integer transform
approximation of the DCT will reduce the computational time considerably.
In this paper authors have designed the architecture for DCT based on Lo-
effler scheme. The 1D DCT is calculated for only the first two terms and the rest
six terms are taken as zero. The architecture takes in 8 pixels as input for every
clock cycle and generates only 2 outputs against the 8 outputs in the traditional Lo-
effler DCT. Thus the architecture needs only 4 multipliers and 14 adders. The
adders used are carry select adders and the multipliers used are high performance
multipliers.
This paper explores a new approach to obtain the DCT using flow graph
algorithm. All of the multiplications are merged in the quantization block. To
avoid the reduction in the operating frequency during the division at the
quantization process, all of the elements in the quantization matrix are represented
in the nearest powers of 2. Authors have shown that this method outperforms any
of the approaches in terms of operating frequency and resource requirements.
14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power
and high quality Cardic based Loffler DCT”, IEEE conference.
15. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI
Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on Image AND
Signal Processing, pp.3680-3690.
This paper proposes an efficient 1-D DCT and IDCT architectures using
sub-band decomposition algorithm. The orthonormal property of DCT/IDCT
transformation matrices is fully used to simplify the hardware complexities. The
IDCT, respectively and low hardware complexity 0 for both DCT and IDCT
are fully pipelined and scalable for variable length 2-D DCT/IDCT computation.
The proposed architecture requires 3 multipliers and 21 adders. In addition the
proposed architecture is highly regular, scalable, and flexible.
17. Tze-Yun sung, Yaw-shih Shieh and Chun-Wang Yu Hsi-Chin Hsin (2006),
“High Efficiency and Low Power Architectures for 2-D DCT and IDCT
based on cordic rotation”, Seventh International conference on Parallel,
Distributed Computing, Applications and Technologies(PDCAT), pp.191-196.
Multiplication is the key operation for both DCT and IDCT. In the
CORDIC based processor, multipliers can be replaced by simple shifters and
adders. Double rotation CORDIC algorithm has even better latency composed to
conventional CORDIC based algorithm. Hardware implementation of 8-point 2-D
DCT requires two SRAM banks (128 words), two 8 point DCT\IDCT processors,
two multiplexers and a control unit. By taking into account the symmetry
properties of the fast DCT/IDCT algorithm, high efficiency architecture with a
parallel – pipelined structure have been proposed to implement DCT and IDCT
processors. In the constituent 1-D DCT/IDCT processors, the double rotation
CORDIC algorithm with rotation mode in the circular co-ordinate system has been
utilised for the arithmetic unit for both DCT/IDCT i.e. multiplication computation.
Thus they are very much suited to VLSI implementation with design tradeoffs.
18. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE
transactions on circuits and systems for video technology, Vol.14, issue.4. pp 522-
527.
This paper presents an effective method to convert any float 1-D DCT into
an approximate multiplier-less version with shift and add operations. It converts
AAN’s fast DCT algorithm to their multiplier-less versions. Experiment results
shows that ANN’s fast DCT algorithm approximated by the proposed method and
using an optimized configurations can be used to reconstruct images with high
visual quality in terms of peak signal to noise ratio (PSNR). The constant
coefficients are represented in MSD form. All the butterflies present in ANN’s
algorithm are converted to lifting structures before using the proposed method.
20. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet
Transform for Image Compression and A Model of Parallel Image Compression
Scheme for Formal Verification”, Proceedings of the World Congress on
Engineering.
The use of discrete wavelet for image compression and a model of the
scheme of verification of parallelizing the compression have been presented in this
paper. We know that wavelet transform exploits both the spatial and frequency
correlation of data by dilations (or contractions) and translations of mother wavelet
on the input data. It supports the multi resolution analysis of data i.e. it can be
applied to different scales according to the details required, which allows
progressive transmission and zooming of the image without the need of extra
storage. Therefore the DWT characteristics is well suited for image compression
and includes the ability to take into account of Human Visual System’s (HVS)
characteristics, very good energy compaction capabilities, robustness under
transmission, high compression ratio etc. The implementation of wavelet
compression scheme is very similar to that of subband coding scheme: the signal is
decomposed using filter banks. The output of the filter banks is down-sampled,
quantized, and encoded. The decoder decodes the coded representation, up-
40
samples and recomposes the signal. A model for parallelizing the compression
technique has also been proposed here.
21. Abdullah Al Muhit, Md. Shabiul Islam and Masuri Othman (2004), “VLSI
Implementation of Discrete Wavelet Transform (DWT) for Image Compression”,
2nd International conference on Autonomous Robots and Agents, New Zealand.
December pp.391-395.
The authors have presented a graph-based R-D optimal algorithm for JPEG
run-length coding. It finds the optimal run size pairs in the R-D sense among all
possible candidates. Based on this algorithm, they have proposed an iterative
algorithm to optimize run-length coding, Huffman coding and quantization table
jointly. The proposed iterative joint optimization algorithm results in up to 30% bit
rate compression improvement for the test images, compared to baseline JPEG.
The algorithms are not only computationally efficient but completely compatible
41
with existing JPEG and MPEG decoders. They can be applied to the application
areas such as web image acceleration, digital camera image compression, MPEG
frame optimization and transcoding, etc.
This paper suggests a new image compression scheme, using the discrete
wavelet transformation (DWT), which is based on attempting to preserve the
texturally important image characteristics. The main point of the proposed
methodology lies on that, the image is divided into regions of textural significance
employing textural descriptors as criteria and fuzzy clustering methodologies.
These textural descriptors include co-occurrence matrices based measures and
coherence analysis derived features. More specifically, the DWT is applied
separately to each region in which the original image is partitioned and, depending
on how it has been texturally clustered, its relative number of the wavelet
coefficients to keep is then, determined. Therefore, different compression ratios are
applied to the image regions. The reconstruction process of the original image
involves the linear combination of its corresponding reconstructed regions.
24. Muhammad Bilal Akhtar, Adil Masoud Qureshi,Qamar- Ul- Islam (2011),
“ Optimized Run Length Coding for JPEG Image Compression Used in Space
Research Program of IST,” IEEE International Conference on Computer
Networks and Information Technology (ICCNIT) , pp.81-85.
In this paper the authors have proposed a new scheme for run length coding
to minimize the error during the transmission. The optimized run length coding
uses a pair of (RUN, LEVELS) only when a pattern of consecutive zeros occur at
the input of the encoder. The non zero digits are encoded as their respective values
in LEVELS parameters. The RUN parameter is eliminated from the final encoded
message for non zero digits.
42
In this paper the author has given the brief description for JPEG image
compression standard, which uses both lossy and lossless compression methods.
For JPEG images the lossy compression is based on DCT followed by
quantization. The lossless method is based on entropy coding which is a
completely reversible process. The procedures of run length coding are also given
by the author.
27. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up Tables
for Huffman Decoding”, IEEE International Conference on Image Processing,
pp.465-468.
those situations was found to be 78%. The work also shows the effect of varying
the table size and varying the probability distributions of a table on power, area
and delay.
28. Bao Ergude,Li Weisheng, Fan Dongrui, Ma Xiaoyu, (2008), “A Study and
Implementation of the Huffman Algorithm Based on Condensed Huffman Table”,
IEEE International Conference on Computer Science and Software Engineering,
pp.42-45.
They have used the property of canonical Huffman tree to study and
implement a new Huffman algorithm based on condensed Huffman table, which
greatly reduces the expense of Huffman table and increases the compression ratio.
The binary sequence in the paper requires only a small space and under some
special circumstances, the level without leaf is marked to be 1, which can further
reduce the required size.
30. Reza Hashemian (2003), “Direct Huffman Coding and Decoding using the
Table of Code-Lengths”IEEE International conference on information,
Technology, Coding, Computers and Communication, pp.237-241.
44
31. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “Balance of 0,1 Bits for
Huffman and Reversible Variable-Length Coding”, IEEE Journal on
Communications, pp. 359-361.
32. Da An, Xin Tong, Bingqiang Zhu and Yun He (2009), “A Novel Fast DCT
Coefficient Scan Architecture”, IEEE Picture Coding Symposium I, Beijing
100084, China, pp.1-4.
A novel, fast and configurable architecture for zigzag scan and optional
scans in multiple video coding standards, including H.261, MPEG-1,2,4,
H.264/AVC, and AVS is proposed. Arbitrary scan patterns could be supported by
configuring the ROM data, and the architecture can largely reduce the processing
45
cycles. The experimental results show the proposed architecture is able to reduce
up to 80% of total scanning cycles on average.
33. Pablo Montero, Javier Taibo Gulias, Samuel Rivas (2010), “Parallel
Zigzag Scanning and Huffman Coding for a GPU-Based MPEG-2 Encoder”, IEEE
International Symposium on multimedia pp.97-104.
This work describes three approaches to compute the zigzag scan, run-
level, and Huffman codes in a GPU based MPEG-2 encoder. The most efficient
method exploits the parallel configuration used for DCT computation and
quantization in the GPU using the same threads to perform the last encoding steps:
zigzag scan and Huffman coding. In the experimental results, the optimized
version averaged a 10% reduction of the compression time, including the
transference to the CPU.
34. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008),” An
Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE
International Transactions on multimedia, Vol.16, Issue.9, pp.1307-1315.
35. Basant K. Mohanty and Pramod K. Meher (2010), “Parallel and Pipelined
Architectures for High Throughput Computation of Multilevel 3-D DWT”.
distinct stages¸ and all the three stages are implemented in parallel by a processing
unit consisting of an array of processing modules. The throughput rate of the
proposed structure can easily be scaled without increasing the on-chip storage and
frame-memory by using more number of processing modules¸ and it provides
greater advantage over the existing designs for higher frame-rates and higher input
bock-size. The full-parallel implementation of proposed scalable structure provides
the best of its performance.
36. Anirban Das Anindya Hazra¸and Swapna Banerjee (2010), “An Efficient
Architecture for 3-D Discrete Wavelet Transform (DWT)”.
38. B. Das and Swapna Banerjee (2002), “Low power architecture of running
3-D wavelet transform for medical imaging application”.
In this paper a real-time 3-D DWT algorithm and its architecture realization
is proposed. Reduced buffer and low wait-time are the salient features which
makes it fit for bidirectional videoconferencing application mostly in real-time
biomedical application. The reduced hardware complexity and 100% hardware
utilization is ensured in this design. This architecture implemented on 0.25u
BiCMOS technology.
39. B.Das and Swapna Banerjee (2003), “A Memory Efficient 3-D DWT
Architecture”.
This paper proposes a memory efficient real-time 3-D DWT algorithm and
its architectural implementation. Parallelism being an added advantage for fast
processing has been used with three pipelined stages in this architecture. The
architecture proposed here is memory efficient and has a high throughput rate of 1
clock¸ with low latency period. Here we make use of Daubechies wavelet filters
for co-efficient mapping¸ correlation between low pass filter and high pass filter.
The 3-D DWT has been implemented for 8-tap Daubechies filter. However¸ this
algorithm can be extended for any number of frames at the cost of wait time. This
architecture requires a simple regular data-flow pattern. Thus¸ the control circuitry
overhead reduces making the circuit efficient for high speed¸ low power
applications. An optimization between parallelism and pipelined structure has been
used for making the circuit applicable for low power domains.
42. Erdal Oruklu, Sonali Maharishi and Jafar Saniie (2007), “Analysis of
Ultrasonic 3-D Image Compression Using Non-Uniform, Separable Wavelet
Transforms”.
CHAPTER 3
3.1 INTRODUCTION
The previous chapter describes the literature review in view of the present
research. The present researcher went through the different approaches presented
by the different authors to implement image compression with the low power
VLSI design approach. In this chapter, the author is discussing the VLSI Design
flow with the concepts of Verification i.e. how to adapt the concept of Linting and
code coverage, Synthesis of the design with low power concepts and finally the
physical design. The author also discuss the need for Low power in VLSI design
and the different Low power techniques applied in VLSI chip design.
¾ A steady increase in the size and hence the functionality of the ICs.
¾ A steady increase in the variety and size of software tools for VLSI design.
Idea
Design Description
Synthesis
Simulation
Physical Design
The Figure 3.2 describes details about the design that is carried out in different
stages. The process of transforming the idea into a detailed circuit description in
terms of the elementary circuit components constitutes design description. The
final circuit of such an IC can have up to a billion such components; it is arrived at
in a step-by-step manner.
The first step in evolving the design description is to describe the circuit in
terms of its behavior. The description looks like a program in a high level language
53
The circuit at the gate level in terms of the gates and flip flops can be
redundant in nature. The same can be minimized with the help of minimization
tools. The minimized logical design is converted to a circuit in terms of the switch
level cells from standard libraries provided by the foundries, In this proposed work
designer has used both 90nm and 65nm standard cells. The cell based design
generated by the tool is the last step in the logical design process, it forms the input
to the first level of physical design.
The design descriptions are tested for their functionality at every level like
behavioral, data flow, and gate. One has to check here whether all the functions are
carried out as expected and rectify them. All such activities are carried out by the
Verification tool. The tool also has an editor to carry out any corrections to the
source code. Simulation involves testing the design for all its functions, functional
54
During this stage for the required specifications all the sub systems are
implemented in a block diagram. These Block diagrams are at the system level and
realized in higher level of abstraction to obtain sustainable performance and will
be observed with different transaction level.
All the Subsystems and sub blocks of the design are coded in HDL
language either in Verilog or VHDL. These sub-systems are also obtained in the
form of Intellectual property (IP). Due to the fact that many optimized efficient
RTL codes are available from the previous design within the company or from the
IP vendors. All these put together and create a verification environment to verify
total functionality. Verification will be done with many techniques and will be
based on the design stages.
Verification is carried out with Simulator and also with emulation as well
as FPGA prototyping. Initial verification is carried out using Industry standard
simulators predominantly event based simulators. The verification will be done in
different stages as shown in figure 3.3. Simulation is software based and is less
expensive and can be done with quick setup time. Emulation is hardware based
verification, this can be carried out when full RTL code /Design is available. This
is very expensive and also time consuming for initial setup. FPGA based approach
is efficient but again this can be carried out when full RTL code /Design is
available. But additional resources and skill set is required.
55
Figure 3.3 RTL Verification flow with Linting and Code coverage
There are many standards and lint check rules are available. Some example for
rules,
1. Coding style,
3. Design style,
With the availability of design at the gate (switch) level, the logical design
is complete. The corresponding circuit hardware realization is carried out by a
synthesis tool. The synthesis flow with Low power approach is shown in Figure
3.4.
There are two common approaches used in the synthesis of VLSI system
and they are as follows: The circuit is realized through an FPGA. The gate level
design description is the starting point for the synthesis. The FPGA vendors
provide an interface to the synthesis tool. Through the interface the gate level
design is realized as a final circuit. With many synthesis tools, one can directly use
the design description at the data flow level itself to realize the final circuit through
an FPGA. The FPGA route is attractive for limited volume production or a fast
development cycle.
The circuit is realized as an ASIC. A typical ASIC vendor will have the
standard library of a particular technology. The standard library contains library of
basic components like elementary gates and flip-flops. Eventually the circuit is to
be realized by selecting such components and interconnecting them conforming to
the required design. This constitutes the physical design. Being an elaborate and
costly process, a physical design may call for an intermediate functional
verification through the FPGA route. The circuit realized through the FPGA is
tested as a prototype. It provides another opportunity for testing the design closer
to the final circuit. The present researcher’s proposed work synthesis was done
using ASIC approach by selecting the different standard cells.
1. Netlist
2. Delay file
Reports:
1. Timing
2. Area
3. Gate
4. Gated clock
5. Power
6. Generated clock
Output files like technology specific netlist and sdc file will be the input for
the physical design.
A fully tested and error-free design at the switch level can be the starting
point for a physical design. The Hierarchy of the physical design is shown in the
figure 3.5. The physical design is to be realized as the final circuit using (typically)
a million components in the foundry’s library. The step-by-step activities in the
process are described briefly as follows and it is shown in figure 3.5 and the
complete flow is shown in figure 3.6.
60
3.8.2 Placement
The selected components from the ASIC library are placed in position on
the “Silicon floor”. It is done with each of the blocks used in the design. During
the placement, rows are cut, blockages are created where the tool is prevented from
placing the cells, and then the physical placement of the cells is performed based
on the timing/area requirements. The power-grid is built to meet the power targets
of the Chip.
61
3.8.3 Routing
This stage involves checking the design for all manufacturability and
fabrication requirements.
1. DRC
2. ERC
3. LVS
Perform the ERC (Electrical Rule Checking) check, to know that the design is
meeting the ERC requirement, i.e. to check any open, short circuit or floating nets
in the layout.
One of the important stage in physical design is the LVS (layout Vs.
Schematic) check, this is a part of the verification which takes a routed netlist to
that of synthesized net list and compare that the two are matching.
For effective timing closure perform separate Static Timing Analysis need to
be run at every stage to verify that the Signal-integrity of the Chip. STA is
important as the signal-integrity effect can cause cross-talk delay and cross-talk
noise effects, and effects in the functionality and timing aspects of the design.
performance analysis carried out. This constitutes the final stage called
“verification.” One may have to go through the placement and routing activity
once again to improve performance.
Over the years historically, VLSI designers have used to design circuit
speed Vs the "performance" metric. As the design size become large demand in
terms of performance and silicon area increased but subsequently the increase in
the power consumption. As the integration of number of devices increased demand
for power also increased exponentially. In fact, power considerations have been the
ultimate design criteria in special portable applications such as mobile phones,
Music system like MP3, MP4 player, wristwatches and pacemakers for a long
time. The objectives in these applications are minimum power for maximum
battery life time. Almost all recent applications, power dissipation is becoming an
important constraint in large integration design. These low power requirements are
continued with the other applications with battery powered systems such as
Laptops, Notebook, Digital readers, Digital Camera and electronic organizer, etc.
1. System,
2. Architectural,
3. Gate,
4. Circuit and
5. Technology level.
65
At the system level, inactive modules may be turned off to save power. At
the architectural level, parallel hardware may be used to reduce global interconnect
and allow a reduction in supply voltage without degrading system throughput.
Clock gating is commonly used at the gate level. A many design techniques can be
used at the circuit level to reduce both dynamic and static power. For a design
specification, designers have many choices to make at different levels of
abstraction. Based on particular design requirement and constraints (such as
power, performance, cost), the designer can select a particular algorithm,
architecture and determine various parameters such as supply voltage and clock
frequency. This multi-dimensional design space offers a wide range of possible
trade-offs. The most effective design decisions derive from choosing and
optimizing architectures and algorithms at those levels.
CMOS devices have very low static power consumption, which is the result
of leakage current. This power consumption occurs when all inputs are held at
some valid logic level and the circuit is not in charging states. But, when switching
at a high frequency, dynamic power consumption can contribute significantly to
overall power consumption. Charging and discharging a capacitive output load
further increases this dynamic power consumption.
supply voltages, which reduce the power consumption of the individual transistors.
The exponential increase of operating frequencies results in a steady increase of
the total power consumption.
Typically, all low-voltage devices have a CMOS inverter in the input and
output stage. Therefore, for a clear understanding of static power consumption,
refer to the CMOS inverter modes shown in Figure 3.8.
As shown in Figure 3.8, if the input is at logic 0, the n-MOS device is OFF,
and the p-MOS device is ON. The output voltage is VCC, or logic 1. Similarly,
when the input is at logic 1, the associated n-MOS device is biased ON and the p-
MOS device is OFF. The output voltage is GND, or logic 0. Note that one of the
transistors is always OFF when the gate is in either of these logic states. Since no
current flows into the gate terminal, and there is no DC current path from VCC to
GND, the resultant quiescent (steady-state) current is zero; hence, static power
consumption is zero. However, there is a small amount of static power
consumption due to reverse-bias leakage between diffused regions and the
substrate. This leakage current leads to the static power dissipation.
68
this dynamic switching current is, the faster you can charge and discharge
capacitive loads, and your circuit will perform better.
C1 and C2 are capacitances associated with the overlap of the gate area and
the source and channel regions of the P and N-channel transistors, respectively. C3
is due to the overlap of the gate and source (output), and is known as the Miller
capacitance. C4 and C5 are capacitances of the parasitic diodes from the output to
VCC and ground, respectively. Thus the total internal capacitance seen by inverter
1 driving inverter 2 is given by the equation 3.2.
For fast input rise and fall times (shorter than 50 ns), the resulting power
consumption is frequency dependent. This is due to the fact that the more often a
device is switched, the more often the input is situated between logic levels,
causing both transistors to be partially turned on. Since this power consumption is
proportional to input frequency and specific to a given device in any application,
as is CL, it can be combined with CL. The resulting term is called “CPD”, the no-
load power dissipation capacitance.
71
From the all above discussion, it is quite clear that significant work lies
ahead for reducing and managing power dissipation in CMOS devices. With the
market desire to shrink size and put systems on chips, advances in circuit design
and materials are clearly required for managing this situation, since neither speed
nor size is helping power reduction.
In Clock Gating method of power reduction, clocks are turned off when
they are not required. Modern design tools support automatic clock gating. These
tools identify circuits where clock gating can be inserted without changing the
functionality. Figure 3.12 shows the different clock gating scheme to reduce the
power consumption in VLSI chip design. This clock gating method also results in
72
area saving, this is duee to single clock gating cell that takes the placce of multiple
multiplexers.
Asynchronous logic have pointed out that because their systems do not
have a clock, they save the considerable power that a clock tree requires. However,
asynchronous logic design suffers from the drawback of generating the completion
signals. This requirement means that additional logic must be used at each register
transfer in some cases, a double-rail implementation, which can increase the
amount of logic and wiring. Other drawbacks include testing difficulty and an
absence of design tools. Further, the asynchronous designer works at a
disadvantage because today’s design tools are geared for synchronous design.
Ultimately, asynchronous design does not offer sufficient advantages to merit a
wholesale switch from synchronous designs. However, asynchronous techniques
can play an important role in globally asynchronous, locally synchronous systems.
Such systems reduce clock power and help with the growing problem of clock
skew across a large chip, while allowing the use of conventional design techniques
for most of the chip.
The different multi Voltage strategies for low power used in the VLSI
chip design are
Static Voltage Scaling (SVC): Different supply voltage for different blocks of the
system.
The architect or systems designer can do little to limit leakage except shut
down the memory. This is only practical if the memory will remain unused for a
long time.
SUMMARY
In this chapter, the VLSI design flow for the chip design is discussed in
detail. This describes how the digital front end design with the RTL code is
performed. The simulation of the design with verification using linting and code
coverage is also discussed. The VLSI design flow describes how to generate the
gate level netlist from the HDL code by performing the synthesis of the design.
The physical design describes the floor planning, placement and routing of the
design on the silicon core. This chapter is also describes why the low power is
required in any chip design, what are the different sources of power consumption
and what are different methodologies to reduce the power consumption in the
VLSI chip design.
76
CHAPTER 4
4.1 INTRODUCTION
In the previous chapter the basic concepts of VLSI design flow and the
different low power techniques in VLSI design and the low power techniques
adopted in the present work was described at length.
Compression reduces the volume of data to be transmitted via text, fax, and
images and also reduces the bandwidth required for transmission and also reduces
the storage requirements of speech, audio, video which is much-needed. In digital
image, neighboring samples on a scanning line are normally similar to the spatial
redundancy. Spatial frequency is rate of change of magnitude as one traverses the
Image matrix. Useful image contents change relatively slowly across the image
much of the information in an image is repeated, hence the spatial redundancy.
Human eye is less sensitive to the higher spatial frequency components than the
lower frequency components.
television. The DCT and IDCT also have applications in such wide ranging areas
as filtering, speech coding and pattern recognition [56].
The Discrete Cosine Transform is the most complex operation that needs to
be performed in the baseline JPEG process. This subsection starts with an
introduction to our chosen DCT architecture, followed by a detailed mathematical
explanation of the principles involved. Our implementation of the Discrete Cosine
Transform stage is based on a vector processing architecture. Our choice of this
particular architecture was due to a multiple reasons. The design uses a concurrent
architecture that incorporates distributed arithmetic [94] and a memory oriented
structure to achieve high speed, low power, high accuracy, and efficient hardware
realization of the 2-D DCT.
We first compute the 1-D DCT (8 x 1 DCT) of each column of the input
data matrix X to yield CTX. After appropriate rounding or truncation, the
transpose of the resulting matrix, CTX, is stored in an intermediate memory. We
then compute another 1-D DCT (8 x 1 DCT) of each row of CTX to yield the
desired 2-D DCT as defined in equation 4.1. The block diagram of the proposed
design is shown in Figure 4.1.
78
Block diagram is presented below describing the top level of the design.
DCT core architecture is based on two 1-D DCT units connected through transpose
matrix RAM. Transposition RAM is double buffered, that is when 2nd stage of
DCT reads out data from transposition memory one, 1st DCT stage can populate
2nd transposition memory with new data. This enables creation of dual stage global
pipeline where every stage consist of 1-D DCT and transposition memory.1-D
DCT units are not internally pipelined, they use parallel distributed arithmetic with
butterfly computation to compute DCT values. Because of parallel DA they need
considerable amount of ROM memories to compute one DCT value in single clock
cycle. Design based on distributed arithmetic does not use any multipliers for
computing MAC (multiply and accumulate), instead it stores precomputed MAC
results in ROM memory and grab them as needed.
ROM ROM
memories memories
clk
rst
DCT CORE
xin (7:0)
rdy_out
( ) (2 + 1)
= , = 0, 1,2 … … 7
2 16
..… (4.2)
( )= =0
( )=1 ℎ
z = T · xT ….. (4.3)
…. (4.4)
Where ck = cos kπ/16, , a = c1, b = c2, c = c3, d = c4, e = c5, f = c6, g = c7.
….. (4.5)
82
The odd and even components of DCT can be easily changed in to the
matrix of the form in the equation 4.5. The 8 × 8 DCT matrix multiplication can
be expressed as additions of vector scaling operations, which allows us to apply
our proposed low complexity design technique for DCT implementation.
Xin [7:0]
Shift
register
Xk add/sub
block X
Xk
Xk add/sub
block X
Xk
ROM
ADDER
Xk add/sub
block X
Xk
Zk0-Zk7
Xk add/sub
block X
Xk
Toggle
Flip flop
One Dimensional DCT implementation for the input pixel is shown in the
figure 4.3. A 1-D DCT is implemented on the input pixels first. The output of this
is intermediate value that is stored in a RAM. The 2nd 1-D DCT operation is done
on this stored value to give the final 2-D DCT output dct_2d. The inputs are 8 bits
wide and the 2d-dct outputs are 9 bits wide. In 1- Dimensional section the input
signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and so on
up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit
shift registers are registered by the div8clk which is the clock signal divided by 8.
This will enable us to register in 8 pixels (one row) at a time. The pixels are paired
up in an adder/ subtractor block in the order xk0, xk7:xk1, xk6:xk2, xk5:xk3, xk4.
The adder/ subtractor are tied to clock [20, 8, 49], For every clock, the
adder/subtractor module alternately chooses addition and subtraction. This
selection is done by using the toggle flip flop. The output of the add/sub is fed into
a multiplier whose other input is connected to stored values in registers which act
as memory. The outputs of the 4 multipliers are added at every clock in the final
adder. The output of the adder Zk (0-7) is the 1-D DCT values given out in the
order in which the inputs were read in.
The outputs of the adder are stored in RAMs. When WR signal is high, the
corresponding RAM address takes the write operation. Otherwise, the contents of
the RAM address are read. The period of the address signals is 64 times of the
input clocks. Two RAMs are used so that data write can be continuous. The 1st
valid input for the RAM1 is available at the 15th clock. So the RAM1 enable is
active after 15 clocks. After this the write operation continues for 64 clocks. At the
65th clock, since z_out is continuous, we get the next valid z_out_00. These 2nd
sets of valid 1D-DCT coefficients are written into RAM2 which is enabled at
15+64 clocks. So at 65th clock, RAM1 goes into read mode for the next 64 clocks
and RAM2 is in write mode. After this for every 64 clocks, the read and write
switches between the 2 RAMS and 1D-DCT section.
84
After the RAM is enabled, data is written into the RAM1 for 64 clock
cycles. Data is written into each consecutive location. After 64 locations are
written into, RAM1 goes into read mode and RAM2 goes into write mode. The
cycle then repeats. For either RAM, data is written into each consecutive location.
However, data is read in a different order. If data is assumed to be written in each
row at a time, in an 8x8 matrix, data is read in each column at a time. When RAM1
is full, the 2nd 1-D calculations can start.
Zk [7:0]
Shift
register
Zk0 add/sub
block X
Zk7
Zk1 add/sub
block X
Zk6
ROM
ADDER
Zk2
add/sub
block X
Zk5
output
Dct_2d [11:0]
Zk3 add/sub
block X
Zk4
Toggle
Flip flop
A 1-D IDCT is shown in the figure 4.5 and it is implemented on the input
DCT values. The output of this called the intermediate value is stored in a RAM.
The 2nd 1-D IDCT operation is done on this stored value to give the final 2-D
IDCT output idct_2d. The inputs are 12 bits wide and the 2d-idct outputs are 8 bits
wide. In the 1st 1-D section, the input signals are taken one pixel at a time in the
86
order of x00 to x07, x10 to x17 and so on up to x77. These inputs are fed into an 8
bit shift register. The outputs of the 8 bit shift registers are registered at every 8th
clock [14, 93].This will enable us to register in 8 pixels (one row) at a time. The
pixels are fed into a multiplier whose other input is connected to stored values in
registers which act as memory. The outputs of the 8 multipliers are added at every
clock in the final adder. The output of the adder z_out is the 1-D IDCT values
given out in the order in which the inputs were read in.
The 1-D IDCT values are first calculated and stored in a RAM. The second
1-D implementation is the same as the 1st 1-D implementation with the inputs now
coming from RAM. Also, the inputs are read in one column at a time in the order
z00 to z70, z10 to z70 up to z77. The outputs from the adder in the 2nd section are
the 2D-IDCT coefficients. The 2-Dimensional IDCT block diagram is shown in
figure 4.6.
The outputs z_out of the adder are stored in RAMs. Two RAMs are used so
that data write can be continuous. The 1st valid input for the RAM1 is available at
the 12th clock..So the RAM1 enable is active after 11 clocks. After this the write
operation continues for 64 clocks. At the 65th clock, since z_out is continuous, we
get the next valid z_out_00. This 2nd set of valid 1D-DCT coefficients are written
into RAM2 which is enabled at 12+64 clocks. So at 65th clock, RAM1 goes into
read mode for the next 64 clocks and RAM2 is in write mode. After this for every
64 clocks, the read and write switches between the 2 RAMS. 2nd 1-D IDCT
section starts when RAM1 is full. The second 1D implementation is the same as
the 1st 1-D implementation with the inputs now coming from either RAM1 or
RAM2. Also, the inputs are read in one column at a time in the order z00 to z70,
z10 to z70 up to z77. The outputs from the adder in the 2nd section are the 2-D
IDCT coefficients.
clk
rst
IDCT CORE idct_2d(7:0)
dct_2d(11:0)
rdy_in
The schematic for the IDCT implementation are given in Figure 4.5 and the
block diagram in figure 4.7. The 1D-IDCT values are first calculated and stored in
a RAM. The 2nd 1D-IDCT is done on the values stored in the RAM. For each 1-D
implementation, input data are loaded into the first input of a multiplier. The
constant coefficient multiplication values are stored in a ROM and fed into the 2nd
input of the multiplier. At each clock 8 input values are multiplied with 8 constant
coefficients. The output of the eight multipliers are then added together to give the
1D coefficient values which are stored in the RAM. The values stored in the
intermediate RAM are read out one column at a time (i.e., every 8th value is read
out every clock) and this is the input for the 2nd DCT. The rdy_in signal is used as
a hand shake signal between the DCT and the IDCT functions [14]. When the
signal is high, it indicates to the IDCT that there are valid input to the core.
SUMMARY
CHAPTER 5
5.1 INTRODUCTION
There have been several designs proposed for the variable length coding
method in the past decades, and so all these designs were concentrated over high
performance variable length coding process, i.e., their main concern was to make
the variable length coding process faster. These designs were less bothered about
power consumption, with the only intention being high performance. But with
increase in demand for portable multimedia systems like mobiles, iPods etc. the
demand for low power multimedia systems is increasing, making the designers to
concentrate on low power approaches along with high performance.
In this thesis some of the low power approaches have been adopted
efficiently to reduce the power consumption in variable length coding process. The
proposed architecture consists of three major blocks, namely, zigzag scanning
block, run length coding block and Huffman coding block [42]. The low power
approaches have been employed in these blocks individually to make the whole
variable length coding process to consume less power.
The zigzag scanning requires that all the 64 DCT coefficients are available
before starting, so we have to wait for all 64 DCT coefficients to arrive and then
start scanning. To overcome this problem two separate RAMs have been used in
90
the present design to make the process faster, where one RAM will be storing the
incoming DCT coefficients and other being used for zigzag scanning of earlier 64
DCT coefficients stored. The RAMs will be doing this thing in the alternative
order. So when we use parallel approach there are two possibilities, either we can
make process faster at the same operating frequency or we can decrease the
operating frequency still maintaining the same speed as earlier. The reduction in
the operating frequency is one of several approaches to achieve low power, since
the power dissipation is directly proportional to operating frequency.
The zigzag scanning block scans the quantized DCT coefficients in such a
way that all the low frequency components are scanned first and then the high
frequency components. The high frequency components being quantized to zero
are all accumulated at the last, so that the effective compression can be done in the
run length coding block.
The Huffman coding is done by making use of a lookup table. The lookup
table is formed by arranging the different run-length combinations in the order of
their probabilities of occurrence with the corresponding variable length Huffman
codes. This approach of designing Huffman coder not only simplifies the design
91
but also results in less power consumption. Since we are using lookup table
approach, the only part of the encoder corresponding to the current run-length
combination will be active and other inactive parts of the encoder will not be
consuming any power. So turning off the inactive components of a circuit is the
low power approach adopted while designing Huffman coder.
Variable Length Encoding (VLE) is the final lossless stage of the image
compression unit.VLE is done to further compress the quantized image. VLE
consists of the following three stages,
1. Zigzag Scanning
3. Huffman Encoding
The figure 5.1 shows the block diagram of the Variable Length Encoder
(VLE). In the following sections we will discuss the detailed working of each sub
blocks.
92
Zigzag_en_in zigzag_en_out
ZIG ZAG
clk SCANNING
rdy_out
rst
The figure 5.2 describes the block diagram of the zigzag scanner. The
quantized DCT coefficients obtained after applying the Discrete Cosine
Transformation to 8×8 block of pixels are fed as input to the Variable Length
Encoder (VLE). These quantized DCT coefficients will have non-zero low
frequency components in the top left corner of the 8×8 block and higher frequency
components in the remaining places [71, 62]. The higher frequency components
are approximated to zero after quantization. The low frequency DCT coefficients
are more important than higher frequency DCT coefficients. Even if we ignore
some of the higher frequency coefficients, we can successfully reconstruct the
image from the low frequency coefficients [73]. The Zigzag Scanner block
exploits this property.
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 Increasing
40 41 42 43 44 45 46 47 Vertical
48 49 50 51 52 53 54 55 Frequency
56 57 58 59 60 61 62 63
Increasing Horizontal
In zigzag scanning, the quantized DCT coefficients are read out in a zigzag
order, as shown in the figure 5.3. By arranging the coefficients in this manner,
RLE and Huffman coding can be done to further compress the data. The scan puts
the high-frequency components together. These components are usually zeroes.
The following figure 5.4 explains the zigzag scanning process with a typical
quantization matrix.
3 0 1 0 0 0 0 0
1
1 2 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 2 0 0 0 0 0 0
0 0 0 0 0 0 0 0 Increasing
0 0 0 0 0 0 0 0 Vertical
0 0 0 0 0 0 0 0 Frequency
0 0 0 0 0 0 0 0
Increasing Horizontal Frequency
31 0 1 0 0 0 0 0 1 2 0 0 ……………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
Since the zigzag scanning requires that all the 64 DCT coefficients are
available before scanning, we need to store the serially incoming DCT coefficients
in a temporary memory. For each of the 64 DCT coefficients obtained and for each
8×8 block of pixels we have to repeat this procedure. So at a time either scanning
is performed or storing of incoming DCT coefficients is done [71, 62]. This will
slow down the scanning process. So in order to overcome this problem and to
94
make scanning faster, the present researcher proposes a new architecture in this
work for zigzag scanner and it is shown in the following figure 5.5.
In the proposed architecture, two RAM memories are used in the zigzag
scanner and they are zigzag register1 and zigzag register2. One of the two RAM
memories will be busy in storing the serially incoming DCT coefficients while
scanning is performed from the other RAM memory. Two 2:1 Multiplexers are
used in this architecture. One Multiplexer (left side) will be used to switch the
input alternatively between one of the two register sets and the other Multiplexer
will be used to connect the output alternatively from one of the two register sets.
Either zigzag scanning or storing of incoming DCT coefficients, is done on a
zigzag register set, and not both simultaneously. So while scanning is performed
on one register set, other register set will be used to store the incoming DCT
coefficients. So except for first 64 clock cycles i.e., until 64 DCT coefficients of
first 8×8 blocks become available [28], the zigzag scanning and storing of serially
incoming DCT coefficients is performed simultaneously. So by using two RAM
memories we will be able to scan one DCT coefficient in each clock cycle except
95
for first 64 clock cycles. A counter is used to count the number of clocks, which
counts up to 64. This counter generates switch memory signal every 64 clock
cycles to the Multiplexer. The counter also generates the ready out signal after first
64 clock cycles to indicate the next block, that zigzag block is ready to output
values. This signal acts as ready in signal to the next run-length block, which upon
receiving it will start to operate.
rle_in rle_out
rdy_in
RUN LENGTH ENCODER
clk
rdy_out
rst
defining an 8 x 8 block without RLE, 64 coefficients are used [69, 26]. To further
compress the data, many of the quantized coefficients in the 8 x 8 block are zero.
Coding can be terminated when there are no more non-zero coefficients in the zig-
zag sequence. Using the "end-of-block" code terminates the coding.
RLE_IN ‘0’
Comparator INR Zero
(= 0?) ‘!0’ CLR Counter
RDY_IN
CLK Load
Count
RST
RDY_OUT
Run Level
RLE_OUT
The following Table 5.1 illustrates the difference between the conventional
run-length encoding and the proposed method of run-length encoding.
97
31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
Conventional Proposed
RLE RLE
31 31
1,0 1,1
1,1 1,2
1,0 0,1
1,2 5,2
1,1 EOB
5,0
1,1
EOB
The above example clearly shows that the proposed RLE design in the
present work yields better compression than the conventional RLE.
98
Huffman_en_in
rdy_in Huffman_en_out
HUFFMAN
clk ENCODING
This approach of designing Huffman encoder not only simplifies the design
but also results in less power consumption. Since we are using lookup table
approach, the only part of encoder corresponding to the current run-length
combination will be active and other parts of the encoder will not be using any
power. So turning off the inactive components of a circuit in the Huffman encoder,
results in less power consumption.
The above figure 5.10 shows the interconnection of zigzag scanning, run-
length encoding and Huffman encoding blocks in the variable length encoder
[85, 9]. The clock and reset signal are shared among all three blocks in VLE. The
100
output of the zigzag scanner is connected to the run-length encoder and also a
ready out signal is connected from zigzag scanner to the run-length encoder as
ready in signal, to indicate the RLE when to start the encoding process [72]. The
output of the run-length encoder is given as input to the Huffman encoder block,
and also the ready out signal from RLE to Huffman encoder to initiate the
encoding process. The interconnection acts as variable length encoder taking
quantized DCT coefficients as input and processing it to give variable length codes
as output.
The variable length decoder is the first block on the decoder side. It
decodes the variable length encoder output to yield the quantized DCT coefficients
the basic block diagram of the Variable Length Decoder is shown in the figure
5.11. The variable length decoder consists of three major blocks, namely,
1. Huffman Decoding.
Zigzag Quantized
Compressed Huffman Run Length Inverse DCT
Data Decoding Decoding Scanning Coefficients
Huffman_in
Huffman_out
rdy_in
HUFFMAN
clk DECODER rdy_out
rst
The Huffman decoder forms the front end part of the variable length
decoder. The block diagram of the Huffman decoder is shown in the figure
5.12.The internal architecture of the Huffman decoder is same as the Huffman
encoder [78]. The same VLC Huffman coding table which was used in the
Huffman encoder is also used in the Huffman decoder. The input encoded data is
taken and a search is done for the corresponding run/value combination in the VLC
table. Once the corresponding run/value combination is found, it is sent as output
and Huffman starts decoding next coming input.
The VLC Huffman coding table which we are using in both the Huffman
encoder and the Huffman decoder, reduces the complexity of the Huffman decoder
[39, 14]. It not only reduces the complexity but also reduces the dynamic power in
the Huffman decoder, since only the part of the circuit is active at a time.
Data_In F_Data
clk
FIFO
FInN
rst F_EmptyN
FOutN
The First in First out (FIFO) also forms the part of the decoder part and it is
shown in the figure 5.13. The FIFO is used between the Huffman decoder and the
run-length decoder. The FIFO is used to match the operating speed between the
Huffman decoder and run-length decoder. The Huffman decoder sends a decoded
output to the run-length decoder in the form of run/value combination. The run-
length combination takes this as input and starts decoding. Since the run in the
run/value combination represents the number of zeroes in between consecutive
non-zero coefficients, the zero ‘0’ is sent as output for next ‘run’ number of clock
cycles. Until then the run-length decoder cannot accept other run/value
combination, and we know that the Huffman decoder decodes one input to one
run/value combination in every clock cycle. So Huffman decoder cannot be
connected directly to run-length decoder. Otherwise the run-length decoder cannot
decode correctly. So to match the speed between the Huffman decoder and run-
length decoder, the FIFO is used. The output of the Huffman decoder is stored
onto the FIFO, the run-length decoder takes one decoded output of Huffman
decoder from the FIFO when it is finished with decoding of the present input to it.
So after run-length decoder finishes decoding of the present input, it has to send a
signal to the FIFO to feed it a new input. This signal is sent to the FOutN pin,
which is read out pin of the FIFO. The FInN pin is used to write onto FIFO, the
Huffman decoder generates the signal for this while it has to write a new input
onto the FIFO. So the FIFO acts as a synchronizing device between the Huffman
decoder and the run-length decoder.
rle_in
rle_out
rdy_in RUN
LENGTH
clk DECODER rdy_out
rst
The Run-length decoder forms the middle part of the variable length
decoder. The block diagram of the Run Length Decoder is shown in figure 5.14. It
takes decoded output from Huffman decoder through FIFO. When the Huffman
decoder decodes one input and stores the decoded output onto the FIFO, then the
FIFO becomes non-empty (the condition when at least one element is stored on the
FIFO). The FIFO then generates the signal F_EmptyN. This signal is used as
rdy_in signal to the run-length decoder. So when Huffman decoder decodes one
input and stores it onto the FIFO, then a ready signal is generated to the run-length
decoder to initiate the decoding process. The run-length decoder takes the input in
the form of a run/value combination [74], then separates run and value parts. The
run here represents number of zeroes to output before sending out the non-zero
level ‘value’ in the run/value combination. So for example if {5,2} is input to the
run-length decoder then it sends 5 zeroes (i.e., 5 ‘0’) before transmitting the non-
zero level '2’ to the output. Once the run-length decoder sends out a non-zero level,
then it means that it is finished with the decoding of the present run/value
combination, and it is ready for the next run/value combination. So for this it
generates the rdy_out signal to the FIFO, to indicate that it has finished decoding
of present input and ready for decoding the next run/value combination. This
rdy_out is connected to the FOutN pin of the FIFO, which is read out pin of the
FIFO. Upon receiving this signal the FIFO sends out a new run/value combination
to the run-length decoder, to initiate run-length decoding process for the new
run/value combination. An example for run-length decoding process is shown
below,
104
Input to RLD
31 0 1 0 2 1 0 0 0 0 0 2 …………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
Output from RLD
zigzag_in zigzag_out
ZIGZAG
clk INVERSE
SCANNER
rst rle_out
31 0 1 0 2 1 0 0 0 0 0 2 ………………... 0 0
0 1 8 16 9 2 3 10 17 24 32 25 62 63
Input to the zigzag inverse scanner
31 0 1 0 0 0 0 0 1 2 0 0 ………………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
Output from the zigzag inverse scanner
The zigzag inverse scanner shown in the figure 5.15, forms the last stage in
the variable length decoder. The working and architecture, everything is similar to
the zigzag scanner, except that the scanning order will be different. The zigzag
inverse scanner gets the input from the run-length decoder, starts storing them in
105
one of the two RAMs, until it receives all 64 coefficients. Once it receives all the
64 coefficients, it starts inverse scanning to decode back the original DCT
coefficients. Meanwhile the incoming DCT coefficients are getting stored in
another RAM. Once scanning from one RAM is finished, it starts scanning from
another RAM and meanwhile the incoming DCT coefficients gets stored in first
RAM. So this process is repeated until all the DCT coefficients are scanned. There
will be delay of 64 clock cycles before the output appears. Once after that, for
every clock cycle an element will scanned continuously. The above example
illustrates the working of the zigzag inverse scanner.
SUMMARY
This chapter describes the algorithm and architecture details of the Variable
length coding that maps input source data onto code words with variable length is
an efficient method to minimize average code length. Compression is achieved by
assigning short code words to input symbols of high probability and long code
words to those of low probability. Variable length coding can be successfully used
to relax the bit-rate requirements and storage spaces for many multimedia
compression systems. For example, a variable length coder (VLC) employed in
MPEG-2 along with the discrete cosine transform (DCT), results in very good
compression efficiency. To reconstruct the image back before going to the
dequantization Variable Length Decoder is designed with low power approach.
Since early studies have been focused only on high throughput variable
length coders, low-power variable length coders have not received much attention.
This trend is rapidly changing as the target of multimedia systems is moving
towards portable applications like laptops, mobiles and iPods etc. These systems
highly demand low-power operations, and, thus require low power functional
units.
106
CHAPTER 6
6.1 INTRODUCTION
The image data is divided up into 8x8 blocks of pixels. DCT is applied to
each 8x8 block of the image, DCT converts the spatial image representation into a
frequency map, the low-order or "DC" term represents the average value in the
block, while successive higher-order ("AC") terms represent the strength of more
and more rapid changes across the width or height of the block. The highest AC
term represents the strength of a cosine wave alternating from maximum to
minimum adjacent pixels.
The DCT calculation is fairly complex in fact; this is the most costly step in
JPEG compression. The point of doing it is that we have now separated out the
high- and low-frequency information present in the image. We can discard high-
frequency data easily without losing low-frequency information. The DCT step
itself is lossless except for round-off errors. To discard an appropriate amount of
information, the compressor divides each DCT output value by a "quantization
coefficient" and rounds the result to an integer. The larger the quantization
coefficient, the more data is lost, because the actual DCT value is represented less
107
and less accurately. Each of the 64 positions of the DCT output block has its own
quantization coefficient, with the higher-order terms being quantized more heavily
than the low-order terms (that is, the higher-order terms have larger quantization
coefficients). Furthermore, separate quantization tables are employed for
luminance and chrominance data, with the chrominance data being quantized more
heavily than the luminance data. This allows JPEG to exploit further the eye's
differing sensitivity to luminance and chrominance.
In this thesis low power design, employ parallel processing units that
enable power savings from a reduction in clock speed. This architecture saves
power by disabling units that are not in use. When the units are in a standby mode,
they consume a minimum amount of power. The common low power design
technique is to reorder the input data so that a minimum number of transitions
occur on the input data lines. DCT architecture based on the proposed low
complexity vector scalar design technique is effectively used to remove redundant
computations, thus reducing computational complexity in DCT operations. This
DCT architecture shows lower power consumption, which is mainly due to the
108
efficient computation sharing and the advantage of using carry save adders as final
adders.
• Each pixel value in the 2-D matrix is quantized using 8 bits which produces a
value in the range of 0 to 255 for the intensity/luminance values and the range
of -128 to + 127 for the chrominance values. All values are shifted to the range
of -128 to + 127 before computing DCT.
• All 64 values in the input matrix contribute to each entry in the transformed
matrix.
• The value in the location F [0,0] of the transformed matrix is called the DC
coefficient and is the average of all 64 values in the matrix .
109
• The other 63 values are called the AC coefficients and have a frequency
coefficient associated with them.
Because the DCT is designed to work on pixel values ranging from -128 to
127, the original block is “leveled off” by subtracting 128 from each entry. This
results in the following matrix. It is now ready to perform the Discrete Cosine
Transform, which is accomplished by matrix multiplication.
D = T M T’ ….. (6.1)
40 33 33 22 26 40 36 26
43 26 33 22 29 43 22 36
43 40 19 36 36 33 15 26
M= 36 43 26 33 29 29 13 4
33 33 29 26 15 33 26 4
36 33 33 26 22 29 26 12
33 46 29 26 33 12 12 4
26 33 29 22 12 4 8 0
This block matrix now consists of 64 DCT coefficients, c (i, j), where i and
j range from 0 to 7. The top-left coefficient, c (0, 0) correlates to the low
frequencies of the original image block. As we move away from c(0,0) in all
directions, the DCT coefficients correlate to higher and higher frequencies of the
image block, where c (7, 7) corresponds to highest frequency. Higher frequencies
are mainly represented as lower number and lower frequencies as higher number
[56,58,59]. It is important to know that human eye is most sensitive to lower
frequencies.
6.2.2 Quantization
JPEG standard quantization matrix. With a quality level of 50, this matrix renders
both high compression and excellent decompressed image quality.
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
Q50 = 14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
Recall that the coefficients situated near the upper-left corner correspond to
the lower frequencies for which the human eye is most sensitive of the image
block. In addition, the zeros represent the less important, higher frequencies that
have been discarded, giving rise to the lossy part of compression. As mentioned
earlier, only the remaining nonzero coefficients will be used to reconstruct the
image.
The IDCT is next applied to matrix, which is rounded to the nearest integer.
Finally, 128 is added to each element of that result, giving us the decompressed
JPEG version N of our original 8x8 image block M. The output of IDCT and the
input image matrices, when compared there is a remarkable result, considering that
nearly 70% of the DCT coefficients were discarded prior to image block
decompression/reconstruction. Given that similar results will occur with the rest of
the blocks that constitute the entire image, there should be no surprise that the
JPEG image will be scarcely distinguishable from the original. Remember, there
are 256 possible shades of gray in a black-and-white picture, and a difference of
say 10 is barely noticeable to the human eye [15]. DCT takes advantage of
redundancies in the data by grouping pixels with similar frequencies together; and
moreover, if we observe as the resolution of the image is very high, even after
sufficient compression and decompression there is very less change in the original
and decompressed image. Thus, we can also conclude that at the same
compression ratio the difference between original and decompressed image goes
on decreasing as there is increase in image resolution.
The procedure of the steps followed for image compression using Matlab is
given in the form of flow chart and it is shown in the Figure 6.1.
114
Apply quantization
End
To find the DCT for the given image read the image in the form of frames,
each frame consists of 8x8 pixels these pixels are represented in the form of 8X8
image matrix. The coefficients of the image matrix are varied from 0 to 255, shift
each of the coefficients in the range of -128 to +127 by subtracting each coefficient
by 128. Apply the proposed DCT algorithm to individual rows and columns to find
the DCT. Perform the quantization using Q50 quantization matrix to achieve the
compression then display the compressed image. Matlab simulation results are
compared with the Verilog HDL simulation results. Results obtained after
115
performing DCT on the original images are shown in Figure 6.2. The figure 6.3
shows original image and the image obtained after applying DCT and also shows
the reconstructed image by applying IDCT.
The above results shows that from the present work for the proposed
DCT and IDCT algorithm and architecture to achieve the low power, it is
possible to reconstruct the original image with out losing the originality of the
image quality.
116
The figure 6.3 shows how to reconstruct the complete image with the
proposed DCT and IDCT algorithms using Mathlab.
The figure 6.4 and 6.5 shows the one more typical image before and after
performing the proposed DCT and IDCT Algorithms.
117
The main objective of the researcher is to achieve the low power with the
DCT and IDCT core design. The proposed DCT and IDCT architectural blocks of
the design were written in the form of RTL code using Verilog HDL. For
functional verification of the design test-benches are written and simulation is done
using Cadence IUS simulator and Modelsim simulator. Behavioral simulation was
also done using Matlab from Math Works. The Matlab was used to generate input
image matrix for the input of the DCT core. Matlab reads both input and output
calculates DCT/IDCT coefficients of the input file and compares them to the
output of the core.
Simulation of the DCT core was done using Cadence Incicive Unified
Simulator (IUS) and Modelsim simulator. The DCT core was simulated by giving
the 8x8 image matrices taken from the Matlab. The Verilog test bench is written to
simulate the DCT core. When the DCT computation is over rdy_out signal goes
high. After 92 clock cycles the 2-D DCT output for all the 64 values are
continuous. The Table 6.1 shows the signal description of the DCT core.
119
An 8x8 image matrix is given to Matlab, The DCT block performs the 2-D
DCT. The 2-D DCT output obtained from the Matlab is given in figure 6.6. The
same image input is given to test bench of the Verilog code to obtain the 2-D DCT.
Output from HDL Verilog simulation of DCT core is shown in figure 6.7.
Resultant DCT matrix obtained from both the cases are compared. Only a few
coefficients are different which does not affect the output of the image much. The
IDCT core outputs are also compared with the Matlab results and HDL simulation
results they are comparable hence from the proposed IDCT Architecture it is
possible to reconstruct the image back with out much degradation in the image.
120
The Figure 6.6 shows the Matlab results for the 2-D DCT using Matlab and
using the proposed low power architecture to find the 2-D DCT for a 8x8 image
matrix.The figure 6.7 shows the Modelsim simulated result to find the 2-D DCT
with the same architecture.
Comparison of Matlab output for 8x8 image matrix with the verilog HDL
simulation results. Image matrix (8x8) 64 values are forced as input to the DCT
core using test bench and the output as 64 DCT coefficients are observed. The
results obtained in both Matlab and HDL simulator are same. Hence the proposed
architecture for hardware implementation of DCT core with the low power is
working efficiently.
The IDCT core was simulated, where the input is given from dct_2d signal
from DCT core. The signal description of the IDCT core is shown in the Table 6.2.
The output is obtained from IDCT is given to the Matlab to reconstruct the image.
The verilog testbench is written to simulate the IDCT core. When the IDCT
computation is starts the rdy_in signal is made high. After a initial latency of 84
clock cycles the 2-D IDCT output for all the 64 values is continuous. Simulation
steps were followed using Cadence IUS simulator and Modelsim simulator.
Simulation steps of IDCT were done exactly like the steps followed in the
DCT core. Matlab was used for generating the test vectors, and analysis of the
output. The core was used to obtain IDCT coefficients and compared with Matlab
results.The 8x8 blocks of IDCT was taken and processed, the whole image was
done using block processing in Matlab. The figure 6.8 and figure 6.9 shows the
HDL simulation and Matlab simulation of IDCT respectively.
The different features and analysis of both DCT and IDCT core was done
using RTL compiler from cadence. The design was synthesized and the core was
mapped on to tcbtcbn65lphvtwcl_ccs standard cells. The features like area, power
and number of cells required for the design are tabulated in table 6.3. The layout,
place and route of each of the design were done using IC Compiler.
123
The layouts of both DCT and IDCT core shows that design can be
comfortably placed on the physical area using Place and route tool without any
conjetion in the design. The Zoomed version of the design shows how the design
can be placed in the standard cells approach.
The starting point for this proposed research was a FPGA implementation
of image compression using DCT .This unit gives the detail of that
implementation. The FPGA implementation of image compression using Discrete
Cosine Transformation and Quantization was done with the different architecture
and this design was implemented on to the FPGA device 2vp30ff896-7 and the
simulation of the design was done using Modelsim simulator. The different stages
of the process for image compression using DCTQ are shown in different figures.
The figure 6.14 shows the original image which had to be compressed using
DCTQ and reconstruct the same using IDCT.
126
The image that was generated in Matlab along with the pixel values and the
image matrix is given in the figure 6.15.
The matrix reprresentation of the gray image of figure 6.15, maaximum value
shows the pure white coolor and minimum value shows the pure block color.
The matrix C is
i the DCT matrix that is obtained from the design using
Matlab. As we notice from the DCT matrix values lies between -1 and
a +1 .But to
implement the design on to the FPGA device each of the DCT cooefficients are
multiplied by 128 and level
l shifted by 128.
128
Figure 6.16
6 DCT coefficients before quantization
6.9 RECONSTRUC
CTION OF IMAGE USING IDCT
The following figure 6.19 shows the comparison of original and final
image using DCT and IDCT respectively.
The synthesis results and the device utilization summary of DCTQ for the
targeted device are presented below.
=======================================
Macro Statistics
# Multipliers : 16
16x9-bit multiplier : 8
9x9-bit multiplier : 8
# Adders/Sub tractors : 30
16-bit adder : 7
133
26-bit adder : 7
9-bit subtract or : 16
# Registers : 48
16-bit register : 16
26-bit register : 8
9-bit register : 24
SUMMARY
The simulation and synthesis of the Discrete Cosine Transform and Inverse
Discrete Cosine Transforms were described in the present chapter. The proposed
architecture of the DCT and IDCT were first implemented in Matlab in order to
estimate the quality of the reconstructed image and the compression that can be
achieved. In addition Matlab output serves as a reference for verifying the Verilog
output. The core Modules of the DCT and IDCT were realized using Verilog for
ASIC implementation. The quality of the reconstructed pictures using Matlab and
Verilog were compared. The synthesis of the proposed architecture was done using
the ASIC Synthesis tool. The synthesis results shows that design was done to
achieve the low power both in DCT and IDCT modules. The design was
implemented using 65nm standard cells with low power approach. The physical
design and Place and Route results for the DCT and IDCT were also presented.
134
CHAPTER 7
7.1 INTRODUCTION
31 0 1 0 0 0 0 0 1 2 0 0 …………………………... 0 0
0 1 2 3 4 5 6 7 8 9 10 11 62 63
135
The following figure 7.1 shows the simulation waveform of the zigzag
scanning block for the above test sequence. The main purpose of the zigzag
scanning is to scan the low frequency components before the high frequency
components [28]. The simulation results shows that the low frequency components
are scanned before the high frequency components so that by neglecting the high
frequency components later still it is possible to reconstruct the image.
Figure 7.1 Waveform obtained after simulating the Zigzag Scanning block
The simulation of the run-length encoding block is done using the output
sequence obtained in zigzag scanning process, which appears as the input of the
run-length encoder and the pattern is as shown below.
The following figure 7.2 shows the simulation waveform of the proposed
Run-Length Encoding block for the above scanned input sequence; but normally in
a typical quantized DCT matrix the number of zeros is more compared to non zero
coefficients being repeated. So in the present research for run length
encoder [68, 25], the researcher exploits this property of more number of zeros in
between non- zero DCT coefficients and is counted; Unlike in the conventional run
length encoder where the number of occurrence of repeated symbols are counted.
136
The results shows 31 is appeared first ,then one zero before getting 1 and again one
more zero before getting 2 and that process repeat. This proposed RLE architecture
requires less number of digits to represent the same data and this yields better
compression than the conventional RLE.
The output of the run-length encoder is used as the test sequence to the
Huffman encoder. The output of the run-length encoder will appear as shown
below.
The following figure 7.3 shows the simulation waveform of the Huffman
encoding block for the above test input sequence. In the proposed look up table
approach for the Huffman encoding block the run/value combination received will
be searched in the look up table, when the run/value combination is found its
corresponding Huffman code will be generated [22, 29, 63].The waveform shows
1,1 is coded as 0003 and 1,2 is coded as 0004 and so on.
137
The simulation result of the Variable Length Encoder block shows the
further compression of the image after the lossy compression using DCT. Using
the Variable Length Encoder further compression is achieved with lossless
approach after the synthesis of the VLC design. The physical implementation of
the Variable Length Encoder is done using Synopsis IC Compiler EDA tool. The
figures 7.4 and 7.5 shows layout diagrams of VLC with good placement and
routing. This layout of the Variable Length Encoder shows that the proposed
design can be easily implementable with the available standard cells.
The compressed data from the variable length encoder is fed to the
Huffman decoding block as an input. The output obtained from the Huffman
encoding block is used as a test bench to the Huffman decoding block [84, 88].
The following figure 7.6 shows the simulation waveform for the Huffman
decoding block. The waveform shows 0003 decoded as 1,1 and 0004 is decoded as
1,2 and so on.
The output of the run-length decoder is given as input to the zigzag inverse
scanner, which will output the quantized DCT coefficients. The original Quantized
coefficients and the inverse scanner outputs are matched. Hence from this it is
possible to obtain the original pixel coefficients by performing the IDCT and then
reconstruction of the image can be achieved easily.
140
VLC and VLD blocks have been designed using 90 nm Standard Cells with
0.7 Volts global operating voltage. The power and area requirements of all the
different blocks used in the VLC and VLD designs have been obtained by the tool.
They are tabulated in table 7.1. They are also shown in the form of bar chart in the
figure 7.10.
Table 7.1
Features
Power in mW Area µm2
Design No of cells
The table 7.2 shows the comparison of results [40] of Huffman decoder with the
proposed architecture and the comparison representation in the form of graph is
shown in the figure 7.11.
143
Proposed 45
Figure 7.11 Bar chart showing the power comparison of Huffman decoders
The power consumed by the proposed design is compared with [59]. The
compared results of Run Length Encoding and Huffman coding is shown in the
Table 7.3. The same is shown in the form of a bar chart and is shown in the figure
7.12.
144
Table 7.3 Power Comparison for Run Length and Huffman Encoders
Proposed 65.9
The percentage of power savings from proposed design are calculated and
are tabulated as shown in the table 7.4.
Each of the Comparison results shows the proposed design are mapped to
standard libraries of 90nm by adopting low power techniques, and from this
research work it is possible to achieve the low power consumption without
compromising with the performance of the design.
SUMMARY
The simulation and synthesis of the Variable Length Encoder and Variable
Length Decoder were described in the present chapter. The proposed architecture
of the VLC and VLD modules were designed using Verilog code and they are
simulated using Modelsim simulator. The synthesis of the design was done using
RTL Compiler EDA tool and implemented using 90nm standard cells. The
physical design of both VLC and VLD was done using the IC Compiler back end
VLSI tool. The Power analysis of each of the design was done and the results are
tabulated. The tabulated results show the present research work is able to achieve
the low power with the proposed architecture. Finally the compression ratio was
computed this ratio shows good compression without compromising with the
performance.
146
CHAPTER 8
8.1 CONCLUSIONS
In this research work, algorithms and architecture have been developed for
Discrete Cosine Transform, Quantization, Variable Length Encoding and
Decoding for image compression with an emphasis on low power consumption.
These algorithms have been subsequently verified and the corresponding hardware
architectures are proposed so that they are suitable for ASIC implementation. Six
novel ideas implemented have been presented in this thesis in order to speed up the
Low Power Image Compression implementation on ASIC and they are as follows
presented. The present work throws open a number of work that may be
undertaken by researchers in the near future.
The design of the proposed work is modular and scalable and therefore, it
can be upgraded to accommodate more compatible standards, without appreciable
increase in hardware. The work may also include making the processing of images
faster by including parallel architectures. Some of the applications do require
maintaining a high quality for images like in medical image processing, where
high quality is preferred over the compression, in those cases the lossless variable
length coding can play a vital role. As a part of future enhancement, low-power
architecture for Ultra low power design is suitable for portable systems and can
also be implemented for both DCT and IDCT by redesigning the multipliers and
adder circuits which consume major power in the design presented. The power
requirement is also scaled down drastically by reducing the picture size as well as
operating frequency and voltage.
149
APPENDIX
Getting started
Pre Setup
If you are using MAC OS/X or windows please refer to the links shown at
end for software requirements to connect to the UNIX server.
For Windows:
• PuTTY
To connect to with PuTTY simply type in domain name, under the section
session, make sure that the connection port is SSH with port 22, and click
open.
151
You will then be prompt to input your user name and password that should
be given from your system admin.
With SSH Tectia Client click on Quick Connect and you will be prompt to
ask for domain name and user name. Make sure the port number is 22.
• Cygwin
Username referring to your username, hit enter and you will be prompt to
ask for your password:
152
Once you have connected to the server check if you have file ius55.csh. if you
have it then its all good if not report to your system admin.
Csh
Source ius55.csh
153
Now you press enter button. You will see in the terminal as and that’s it. The
verilog environment was setted up successfully.
There are many editors that one can choose. Here I opted vieditor i.e. gvimeditor.
To write a verilog source file using gvim type in the following command:
gvim test1.v
The file extension must be .v otherwise the compiler will not recognize the file.
After editing verilog code in the editor it looks like
154
This is the easy part. To just compile your code do the following:
ncverilog –C test1.v
ncverilog test1.v
After running the compiler and simulator, you should notice that in your
current directory a folder called INCA_libswii be created. Which holds snapshots
of the simulation. To invoke the snapshot simply type the following for the current
program
ncsim worklib.test1_tb:v
If you notice from the following argument, test1_tbis your test benchmark
function. If all goes well you should see the following:
156
Furthermore if you notice in your current directory ncsim and ncverilog has
written log of the past activities
WAVEFORM VIEWING:
bsubsimvision
You can also access the other tools in the Simvision analysis environment through
menu choices and toolbar buttons, as follows
157
Next the design browser lets you move the through the design hierarchy to view
objects. You can see the design browser to select the signals that you want to
display in the waveform window.
In order to display the selected signals in the waveform window, you select
the required signals like ex. load, clk, reset etc. next click on the waveform button
to display these selected signals in the waveform window.
158
13. Byoung-2 Kim, Sotirios. G. Ziavras (2009), “Low Power Multiplier less
DCT for Image/Video Coders”, IEEE 13th International Symposium on
Consumer Electronics, pp.133-136.
14. Chi-Chia Sun, Benjamin Heyne, Juergen, Goetze (2006), “A Low Power
and high quality Cardic based Loffler DCT”, IEEE conference.
16. Chi-Chia Sun, Philipp Donner and Jurgen Gotze (2009), “Low-
Complexity Multi-Purpose IP Core for Quantized Discrete Cosine and
Integer Transform”, IEEE International Symposium on Circuits and
Systems, pp.3014-3017.
20. Da An, Xin Tong, Bingqiang Zhu and Yun He(2009), “A Novel Fast
DCT Coefficient Scan Architecture”, IEEE Picture Coding Symposium I
Beijing 100084, China, pp.1-4.
21. David A. Maluf, Peter B. Tran, David Tran (2008), “Effective Data
Representation and Compression in Ground Data Systems”, International
Conference on Aerospace, pp 1-7.
22. Ding Xing hao, Qian Kun, Xiao Quan, Liao Ying hao, Guo Dong hui,
Wang Shou jue (2009),“Low Bit Rate compression of Facial Images
Based on Adaptive Over-complete Sparse Representation” , IEEE 2nd
International congress on Image and Signal Processing, pp.1-3.
23. Dr. Muhammad Younus Javed and Abid Nadeem (2000), “Data
Compression Through Adaptive Huffman Coding Scheme”, IEEE
Proceedings on TENCON ,Vol.2. pp.187-190.
24. Emy Ramola, J. Samuel Manoharan (2011), “An Area Efficient VLSI
Realization of Discrete Wavelet Transform for Multi resolution
Analysis”, IEEE International Conference on Electronics Computer
Technology (ICECT) pp.377-381.
25. En-hui Yang, and Longji Wang (2009), “Joint Optimization of Run-
Length Coding, Huffman Coding, and Quantization Table With Complete
Baseline JPEG Decoder Compatibility”, IEEE Transactions on Image
processing, Vol.8, Issue 1, pp.63-74.
29. Gopal Lakhani (2004), “Optimal Huffman Coding of DCT Blocks”, IEEE
transactions on circuits and systems for video technology, Vol.14,issue.4.
pp 522-527.
31. Hai Huang, Tze-Yun Sung, Yaw-shih Shieh (2010), “A Novel VLSI
Linear array for 2-D DCT/IDCT”, IEEE 3rd International Congress on
Image and Signal Processing, pp.3680-3690.
32. Hassan Shojania and Subramania Sudarsanan (2005), “A High
Performance CABAC Encoder”, Proceedings of the 3rd International
Conference, pp.315-318.
37. Jack Venbrux, Pen-Shu and Muye (1992), “A VLSI Chip Set for High-
Speed Lossless Data Compression”, IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 2, Issue.4, pp. 381-391.
38. Jaehwan Jeon, Jinhee Lee, and Joonki Paik (2011), “Robust Focus
Measure for Unsupervised Auto-Focusing Based on Optimum Discrete
Cosine Transform Coefficients”, IEEE Transactions on Consumer
Electronics, Vol. 57, No. 1, pp.1-5.
39. Jason McNeely and Magdi Bayoumi (2007), “Low Power Look-Up
Tables for Huffman Decoding”, IEEE International Conference on Image
Processing , pp.465-468.
40. Jason McNeely, Yasser Ismail, Magdy A. Bayoumi and Peiyi Zaho
(2008).” Power Analysis of the Huffman Decoding Tree”, 15th IEEE
International Conference on Image Processing, pp.1416-1419.
41. Jer Min Jou and Pei-Yin Chen (1999), “A Fast and Efficient Lossless
Data-Compression Method”, IEEE Transactions on communications,
Vol.47.Issue.9, pp.1278-1283.
42. Jia-Yu Lin, Ying Liu, and Ke-Chu Yi(2004), “ Balance of 0,1 Bits for
Huffman and Reversible Variable-Length Coding”, IEEE Journal on
Communications, pp. 359-361.
43. Jin Li Weiwei Chen Moncef Gabbouj Jarmo Takala Hexin Chen (2011)
“Prediction of Discrete Cosine Transformed Coefficients in Resized Pixel
Blocks”, IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) pp.1045-1048.
44. Jin Li, Moncef Gabbouj, Jarmo Takala and Hexin Chen (2009), “Direct
3-D DCT-to-DCT Resizing Algorithm for Video Coding”, Proceedings of
the 6th International Symposium on Image and Signal Processing and
Analysis, pp.105-110.
47. Kamrul Hasan Talukder and Koichi Harada (2007), “Discrete Wavelet
Transform for Image Compression and A Model of Parallel Image
Compression Scheme for Formal Verification”, Proceedings of the World
Congress on Engineering.
48. Koen Denecker, Jeroen Van Overloop and Ignace Lemahieu (1997), “An
Experimental Comparison of Several Lossless Image Coders for Medical
Images”, IEEE International Conference on Data Compression.
50. L.Y.Liu, J.F.Wang, R.J. Wang, J.Y. Lee (1995), “Design and Hardware
Architectures for Dynamic Huffman Coding”, IEEE Proceedings on
Computers and Digital Techniques. Vol.142, Issue.6, pp 411-418.
52. Li Wenna, Goa Yang, Yi Yufeng, Goa Liqun (2011) “Medical image
coding based on wavelet transform and distributed arithmetic coding”,
IEEE International Conference on Chinese Control and Decision
Conference (CCDC), pp.4159-4162.
53. Liang-Wie, Liang-Ying Liu, Jhing-Fa Wang and Jau-Yien Lee (1993)
“Dynamic Mapping Technique for Adaptive Huffman Code” IEEE
International Journal on Computer, Communication, Control and Power
Engineering,Vol.3,pp 653-656.
54. Lili Liu, Hexin Chen, Aijun Sang, Haojing Bao (2011), “Four-
dimensional Vector Matrix DCT Integer Transform codec based on
multi-dimensional vector matrix theory”, IEEE fourth International
Conference on Intelligent Computation Technology and Automation
(ICICTA), pp.552-555.
55. Lin Ma, Songnan Li, Fan Zhang and King Ngi Ngan (2011), “Reduced
Reference Image Quality Assessment using Reorganized DCT Based
Image Representation”, IEEE Transactions on Multimedia, Vol. 13, NO.
4, pp. 824-829.
60. M.R.M. Rizk (2007) “Low Power Small Area High Performance 2D-
DCT Architecture”, 2nd International Design and Test Workshop (IDT),
pp.120-125-777.
61. Majdi elhaji, Abdlekrim Zitouni, Samy meftali, Jean-luc Dekey ser and
rached tourki (2011), “A Low power and highly parallel implementation
of the H.264 8*8 transform and quantization”, Proceedings of 10th IEEE
International Symposium on Signal Processing and Information
Technology (ISSPIT), pp.528-531.
62. Marcelo J. Weinberger, Gadiel Seroussi, Guillermo Sapiro (1996),
“LOCO-I A Low Complexity, Context-Based, Lossless Image
Compression Algorithm”, IEEE International Conference on Data
Compression, pp.140-149.
63. Md. Al Mamun, Xiuping Jia and Michael Ryan (2009), “Adaptive data
compression for Efficient Sequential Transmission and Change Updating
of Remote Sensing Images”, IEEE International Symposium Geoscience
and Remote sensing, IGARSS, pp.498-501.
66. Muhammed Yusuf Khan, Ekram Khan and M. Salim Beg (2008),
“Performance Evaluation of 4×4 DCT Algorithms for Low Power
Wireless Applications”, International Conference on Emerging trends in
Engineering and Technology. pp.1284-1286.
68. Munish Jindal, RSV Prasad and K. Ramkishor (2003), “Fast Video
Coding at Low Bit-Rates Mobile Devices”, International Conference on
Information, Communication and Signal Processing Vol.1, pp 483- 487.
72. Paul G. Howard and Jeffrey Scott Vitter (1991), “Analysis of Arithmetic
Coding for Data Compression”, IEEE International Conference on Data
Compression, pp.3-12.
73. Paulo Roberto Rosa Lopes Nunes (2006), “Segmented Optimal Linear
Prediction applied to Lossless Image Coding”, IEEE International
Symposium on Telecommunications, pp.524-528.
74. Pei-Yin Chen, Member, Yi-Ming Lin, and Min-Yi Cho (2008), “An
Efficient Design of Variable Length Decoder for MPEG-1/2/4”, IEEE
International Transactions on multimedia, Vol.16, Issue 9, pp.1307-
1315.
75. Peng Wu, Chuangbai Xiao, Shoudao Wang, Mu Ling (2009), “An
Efficient Method for early Detecting All-Zero Quantized DCT
Coefficients for H.264/AVC”, IEEE International Conference on
Systems, Man and Cybernetics San Antonio, USA, pp. 3797-3800.
76. Piyush Kumar Shukla, Pradeep Rusiya, Deepak Agrawal, Lata Chhablani,
Balwant Singh (2009.), “Multiple Subgroup Data Compression
Technique Based On Huffman Coding”. First International Conference
on Computational Intelligence, Communication Systems and Networks
(CICSYN), pp. 397-402.
79. Reza Hashemian (2003),”Direct Huffman Coding and Decoding using the
Table of Code-Lengths”, IEEE International conference on Information
Technology, Coding, Computers and Communication, pp.237-241.
80. Ricardo Castellanos, Hari Kalva and Ravi Shankar (2009), “Low Power
DCT using Highly Scalable Multipliers”, 16th IEEE International
Conference on Image Processing, pp.1925-1928.
83. S.Vijay, D. Anchit (2009), “Low Power Implementation of DCT for On-
Board Satellite Image Processing Systems”, 52nd IEEE International
Symposium on Circuits and Systems, pp.774-777.
85. Stephen Molloy and Rajeev Jain (1997), “Low Power VLSI Architectures
for Variable-Length Encoding and Decoding”, Proceedings of the 40th
International Midwest Symposium on Circuits and Systems, pp.997-
1000.
89. Sunil Bhoosan, Shipra Sharma(2009), “An Effective and Selective Image
Compression Scheme Using Huffman and Adaptive Interpolation”, 24th
IEEE International Conference on Image and Vision Computing New
Zealand, pp.197-202.
90. Sunil Bhooshan, Shipra Sharma (2009), “An Efficient and Selective
Image Compression Scheme using Huffman and Adaptive Interpolation”,
24th International Conference Image and Vision Computing New Zealand
(IVCNZ ), pp.1-3.
94. Vijay Kumar Sharma, K. K. Mahapatra and Umesh C. Pati (2011) “An
Efficient Distributed Arithmetic based VLSI Architecture for DCT”,
Proceedings of IEEE International Conference on Devices and
Communications, pp.1-5.
95. Vijay Kumar Sharma,U.C. Pati and K.K. Mahapatra (2010), “An Study of
Removal of Subjective Redundancy in JPEG for Low Cost, Low Power,
Computation efficient Circuit Design and High Compression Image”
Proceedings of IEEE International Conference on Power, Control and
Embedded systems (ICPCES), pp.1-6.
97. Wei-Yeo Chiu, Yu-Ming Lee and Yinyi Lin (2010), “Advanced Zero-
block Mode Decision Algorithm for H.264/AVC Video Coding”,
Proceedings of IEEE International Conference (TENCON), pp.687-690.
98. Wenna Li, Zhaohua Cui (2010), “Low Bit Rate Image Coding Based on
Wavelet Transform and Color Correlative Coding”, International
Conference on Computer Design and Applications (ICCDA) , pp.479-
482.
99. Y. M. Lin and P. Y. Chen (2006) ,“A Low-Cost VLSI Implementation for
VLC” IEEE International Conference on Industrial Electronics and
Applications (ICIEA) ,pp.1-4.
100. Y.P Lee, Chen (1997) “A cost effective architecture for 8x8 two-
dimensional DCT/IDCT using direct method”, IEEE Transactions on
circuit and system for video technology vol 7. Issue.9. pp. 459–467.
103. Yongli Zhu, Zhengya Xu (2006) “Adaptive Context Based Coding for
Lossless Color Image Compression” IMACS Multiconference on
Computational Engineering in Systems Applications (CESA), Beijing,
China, pp.1310-1314.
104. Yushi Chen, Yuhhang Zhang, Ye Zhang, Zhixin Zhou (2011) “Fast
Vector Quantization Algorithm for Hyperspectral Image Compression”,
IEEE International Conference on Data Compression. pp.450.
LIST OF PUBLICATIONS
NATIONAL CONFERENCES
INTERNATIONAL CONFERENCES
2. Vijaya Prakash. A M, Anoop R.Katti and Shakeeb Ahabed Pasha. B K,( 2011)
“Novel VLSI Architecture for Real Time Blind Source Separation”,IEEE
International Conference on ARTCOM-2011 at Reva College of Engineering.
Bangalore, India.
INTERNATIONAL JOURNALS
Educational Qualification:
Pursuing PhD (“Low Power VLSI Architecture for Image Compression Using
Discrete Cosine Transform”) from Dr.M.G.R University Chennai.
Software skills:
EDA tools :
PERSONEL DETAILS:
I declare that the above particulars are true to the best of my knowledge and belief.
( VIJAYAPRAKASH. AM)