Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views6 pages

A Smart Camera For Multimodal Human Computer

The document discusses the design and implementation of GestureCam, a smart camera system that enables multimodal human-computer interaction through gesture recognition. It highlights the evolution of smart cameras, the challenges in their design, and the advantages of using FPGA technology for real-time processing. GestureCam aims to simplify HCI applications by providing a compact, efficient solution for recognizing simple hand and head gestures in desktop environments.

Uploaded by

Vitor Café
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

A Smart Camera For Multimodal Human Computer

The document discusses the design and implementation of GestureCam, a smart camera system that enables multimodal human-computer interaction through gesture recognition. It highlights the evolution of smart cameras, the challenges in their design, and the advantages of using FPGA technology for real-time processing. GestureCam aims to simplify HCI applications by providing a compact, efficient solution for recognizing simple hand and head gestures in desktop environments.

Uploaded by

Vitor Café
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Smart Camera for Multimodal Human Computer

Interaction
Yu Shi, Parnesh Raniga, Ismail Mohamed

Abstract — A smart camera is an embedded vision system for the purpose of better decision-making in an automated
which, in addition to image capture, performs image control system. For example, a motion-triggered
analysis and pattern recognition to provide as output a surveillance camera captures video of a scene, detects
high-level understanding of the imaged scene. Smart motion in the region of interest, and raises an alarm when
cameras are essential components to build active and the detected motion satisfies certain criteria. In this case, the
automated control systems for many applications, such as ASIP is motion detection and alarm generation.
surveillance, machine vision, and interactive visualization The advent of smart cameras can be traced back to the
systems. The heart of smart camera is the intelligent image early 1990s when PCs became popular and video frame
processing algorithms that turn raw data into knowledge. grabbers became available. Frame grabbers allowed CCD
The design of smart camera is challenging because on one (Charge-Coupled Device) cameras with analogue output to
hand video processing has insatiable demand for be connected to computers and digitized for versatile
performance and power, and on the other hand embedded
processing by computers. This marked the beginning of
systems place considerable constraints on the design. In
smart camera systems, with the camera performing image
this paper we firstly present an overview of smart camera
technologies and the process to design smart cameras as capture and computer carrying out intelligent processing
embedded systems. We then present the design and tasks such as motion detection and shape recognition. The
implementation of a smart camera, called GestureCam, first applications were in the area of industrial machine
which can recognize simple hand and head gestures. The vision and surveillance.
camera uses a CMOS image sensor as capture front-end, The real interest in and the growth of smart cameras
and the image processing and gesture recognition is started in late 1990s and early 2000s, spurred by factors
completely built on a single FPGA device. The experimental such as technological advancements in chip manufacturing,
results have shown it to be robust with enough performance embedded system design, the coming-of-age of CMOS
to meet real-time constraints. We plan to use the (Complementary Metal Oxide Semiconductor) image
GestureCam to build next generation of natural multimodal sensors and so on. Socio-economic factors such as Moore‘s
human computer interfaces. law and society’s increasing concerns over security due to
the impact of terrorism also played important roles.
Index Terms — Human Computer Interface, Computer Advanced smart camera systems often integrate the latest
Vision, Smart Cameras, Embedded System.
technologies in image sensors, optics, imaging systems,
embedded systems, computer vision, video analysis and
I. INTRODUCTION
communication, networking and etc.
A smart camera is a vision system of which the primary One of the application areas for smart cameras is human
function is to produce a high-level understanding of the computer interfaces (HCI). Camera-based optical mouse is a
imaged scene and generate application-specific data to be good example of a smart camera for HCI. Multimodal
used in an autonomous and intelligent system. The idea of human computer interaction is an emerging technology in
smart cameras is to convert data to knowledge by processing HCI which aims to enable a computer to understand a user’s
information where it becomes available, and transmit only input made through his or her speech and gestures.
results that are at a higher level of abstraction. A smart Allowing a user to interact with a computer in the similar
camera is ‘smart’ because it performs application specific way to human-to-human communication is the holy grail of
information processing (ASIP), the goal of which is usually HCI researchers and engineers. Gesture recognition is an
not to provide better quality images for human viewing but important part of this exciting technology.
to understand and describe what is happening in the images Gesture recognition is not a trivial task. Computer vision-
based approach to gesture recognition usually involves
Yu Shi is with the National ICT Australia, Australian Technology Park, using general purpose cameras which are connected to
Eveleigh, NSW 1430, Australia (e-mail: [email protected]). general purpose PC for video digitization and processing.
Parnesh Raniga is with the University of Sydney, Darlington, NSW However, PCs are far from being ideal to perform high data
2006, Australia (e-mail: [email protected]).
Ismail Mohamed is with the University of Queensland, Brisbane QLD rate image processing tasks such as those involved in
4072, Australia (e-mail: [email protected]). gesture recognition, due to real-time and low latency

1-4244-0216-6/06/$20.00 ©2006 IEEE


requirements. This is especially true when high resolution timing design suitable to the targeted hardware platform
and high frame rate cameras are used. also needs to be defined. The mapping between algorithms
In this paper we present the design and implementation of
a smart camera, called GestureCam, which can recognize App. Requirements
simple head and hand gestures. The GestureCam is built
from a CMOS image sensor for image capture and a Xilinx
Virtex-II Pro FPGA for gesture tracking and recognition.
Architecture Design
The GestureCam can have many applications in Multimodal
Human Computer Interaction (MHCI).
In the remainder of this paper, Section 2 discusses some
related work; Section 3 presents an overview of the smart Proof of Concept
camera design process; Section 4 reviews vision-based
gesture recognition; Section 5 presents the design and
implementation of the GestureCam (work in progress).
Section 6 explores some possible applications of the Algorithm Conversion
GestureCam in MHCI.

II. RELATED WORK


There has been some research work in recent years in Integration and
Debugging
building smart cameras that can recognize gestures. Wolf et
al in [1] described a VLSI-based smart camera for gesture
recognition. They used a commercial general purpose
camera that provides analogue output to a VLSI video Test & Evaluation
processing board which is inserted into a PC. Bonato et al in
[2] presented the design of an FPGA-based smart camera
that can perform gesture recognition in real-time for mobile
robot application. They used a CMOS camera as capture Requirements
device which provides a processed digital image output to Met? No
their FPGA. Wilson et al in [3] designed a system allowing
a user to control Windows applications using gestures. Their Yes
system uses a pair of general purpose cameras. Fig. 1. Design process for smart cameras as embedded systems.
Our GestureCam is a smart camera built from scratch,
that is, it is not based on a commercial camera which requirements and hardware resources is an important issue.
provides processed analogue or digital outputs, as is the case For hardware architecture, a heterogeneous, multiple-
in [1, 2, 3]. Rather, the image capture part of the processor architecture can be ideal for smart camera
GestureCam is customarily built so that we can apply our development. For example, such architecture may consist of
own color and image pre-processing algorithms to the raw an FPGA or a DSP as a data processor to tackle image
video output (Bayer pattern) of the image sensor. This segmentation and feature extraction, and a high-
provides us with opportunity to have low-noise and better performance DSP or media processor to tackle math-
quality data going into gesture recognition. intensive tasks such as statistical pattern classification. This
kind of system can allow better exploitation of pipelining
III. DESIGN OF SMART CAMERAS AS EMBEDDED and parallel processing, which are essential to achieve high
SYSTEMS frame rates and low latency.
The proof-of-concept stage may use a PC platform for
Figure 1 shows the design process for smart cameras as
research and algorithm development. Usually a COTS
embedded systems which we created and followed in the
(Commercial Off-The-Shelf) general purpose camera is
design and development of the GestureCam. As shown in
used at this stage. Hardware components according to
Figure 1, the process can be iterative.
architecture design need to be acquired, integrated and
The application requirements specifications stage is of
tested. However, this is not needed if, during the
paramount importance. Correct specifications can shorten
architecture design stage, a third party camera development
design and development cycle, provide clear targets for
platform or hardware accelerator unit for video processing is
algorithm and hardware performance, and reduce total cost.
identified to be an appropriate solution to hardware
The system architecture design stage will decide on
platform.
software and hardware architectures, based on performance,
The algorithm conversion stage is necessary because
time-to-delivery and cost criteria. Algorithmic design and
algorithm development for embedded systems is quite images and objects into feature space representation
different from that for PC-based platforms. Basically it can significantly reduces the amount of data to be processed for
be a lot more demanding and challenging, especially if clarification. The gesture classification stage is essentially a
FPGA or ASIC processors are targeted. Usually when pattern recognition task which compares incoming feature
designing applications for ASIC or FPGA, one has to vectors against those from a database of predefined gesture
understand chip architecture so that algorithms can be representations. The gesture database is built through
executed efficiently and effectively. Nowadays behavior training and modelling. HMM (Hidden Markov Models) is a
synthesizers or algorithmic synthesizers do exist to help widely used technique for classification.
designers to forget about the device architecture and focus
on functionality, but they come at the cost of efficiency in camera output
terms of chip area or gate counts and power consumption.
Therefore, it is always important to gain an intimate
knowledge of the device architecture of whichever of the
ASIC, FPGA or DSP is targeted. This intimate knowledge image pre-processing
can also help design parallel processing and pipelining
processing, which can be a very important and effective
video processing technique. Converting floating-point
arithmetic to fixed-point, eliminating divisions as much as object segmentation
possible (by using hardware multipliers and look-up tables,
for example), and taking low power and low complexity
requirements into account are other design considerations
for algorithm conversion. feature extraction
The Embedded System Integration stage will result in a
prototype smart camera using an embedded hardware
platform running embedded versions of algorithms. This can
be a time-consuming process and sometimes requires some
gesture classiication Gesture
adjustments to be made to the algorithms, software and
Database
hardware architecture.
The field test stage will provide precious opportunity to
test camera performance under realistic environment, recognized gestures
identify potential problems and possible improvements, and
Fig. 2. Typical gesture recognition process.
benchmark camera performance against initial application
requirement specifications.
V. GESTURECAM DESIGN AND DEVELOPMENT
IV. VISION-BASED GESTURE RECOGNITION
The goal of gesture recognition (GR) in HCI context is A. Motivation – Why FPGA-Based GestureCam?
for a computer to interpret or recognize gestures made by The idea of GestureCam is to design and develop a
the user, usually by moving his or her hand(s) and head. stand-alone, FPGA-based smart camera that can perform all
Vision-based GR, or simply GR hereafter, aims to use the tasks of gesture recognition. In other words, it is an
cameras and computer vision and image processing embedded device with raw video in and recognized gesture
techniques to detect, track and recognize gestures. out. The advantages of the GestureCam as compared to
A very common approach to GR is to use a general common approach of using a general purpose camera and a
purpose video camera as image capture device and link it to PC to perform GR include:
a PC through either direct digital interface (e.g. x FPGA is a far better computing platform than PC
CameraLink, Firewire), frame grabber, or dedicated PCI- to perform data-intensive video processing tasks
based processing board. All processing tasks are performed with real-time requirements;
by either the PC or the dedicated processing board. Figure 2 x There is no need for high bandwidth requirement
between camera and PC, because the output from
shows the main steps typically involved in vision-based
the camera can be as simple as merely an index of
gesture recognition. The image pre-processing and object
a gesture among pre-defined gesture database;
segmentation stages localize and segment objects of
x GestureCam can greatly simplify an HCI
interests, such as head, face and hands. The feature application design which includes a GR
extraction stage is crucial to GR, as it uses a small set of component, because the designer can focus on how
salient parameters to represent each gesture and provides to make use of the gesture input and does not have
good distinguishability between gestures. Transforming
to worry about allocating computing and Ethernet, and RS232 ports. It also provides a Memec Design
bandwidth resources to GR; P160 expansion module and 1 P160 daughter card.
x GestureCam is more compact and easier to deploy Two programming environments for the development
than a combination of a general purpose video kit are available: the Xilinx ISE environment facilitates
camera and a PC; HDL (Hardware Description Language) design for
x Using FPGA as smart camera platform can have processing modules that need low-level customization to
advantages over DSP based platform in terms of maximize performance, while the Xilinx EDK environment
performance, cost and conversion to ASIC-based allows running complex algorithms in C language on the
mass market product. PowerPC.
For the first version of the GestureCam, we decided to
B. Camera Design
work on an image of the half-VGA resolution, that is
The design and development of GestureCam followed 320x240 pixels, so that the RAM on the FPGA chip is big
the process described in Section III. enough for all frame and line buffering requirements and
there is no need to use off-chip SDRAM. Later-on the
1) Application Requirements design can be scaled up to accommodate full capture
The first version of the GestureCam aims to meet the resolution.
following design requirements: The connection between the ICU and the GRU is
x Desktop HCI environment, with single computer through one of the two connectors of P160 daughter card
user sitting in front of a computer monitor and the which sits on the P160 expansion module on the Memec
GestureCam placed above or on the side of the development board. The connector connects sensor data and
monitor, capturing the whole view of user’s head control pins to FPGA’s available user pins.
and a raised hand; The HDU consists of a host PC with two LCD
x Real-time recognition of a simple head and hand monitors. One monitor is for code development and user
gesture sets, for example, head moving to the left interface for camera configuration, another is for real-time
and right, raised hand with palm open moving up video display of results from different stages of processing,
and down. for debugging purpose. An in-house built VGA display
converter based on the CH8398 video DAC from Chrontel
2) System Architecture converts digital RGB video output to analog VGA display
GestureCam development system consists of mainly standard for display on the monitor. The RGB data available
three parts: an image capture unit (ICU), an FPGA-based from FPGA user pins goes into the VGA display module via
GR unit (GRU), and a host and display unit (HDU), as the other connector of the P160 daughter card. The
shown in Figure 3. Strictly speaking, GestureCam itself only recognized gesture output and gesture trajectory data can be
includes the ICU and the GRU. The HDU is mainly for transferred to the host PC via Ethernet connection. Sensor
FPGA and application development, and for debugging. and camera control data can be passed to the GRU and ICU
via an RS232 port.
Figure 4 shows the hardware set-up of the GestureCam
ICU:
GRU:
HDU:
development platform. The image sensor board under the
FPGA-based optics is shown near the upper-left corner. The VGA display
Optics and Host PC and
image processing
image sensor for
and gesture
Display for I/O DAC module is shown near upper-right corner.
capture and debugging
recognition engine

Fig. 3. GestureCam development system components.

The ICU includes a small in-house built PCB on which


sits a megapixel CMOS color image sensor OV9620 from
Omnivision. The PCB is fit into a dummy camera casing
which provides easy connection to a 2/3” format video lens
from Computar. The OV9620 provides full SXGA
(1280x1024) resolution Bayer pattern video output at 15 Fig. 4. Photo of the GestureCam development platform.
frames per second, and VGA (640x480) resolution at 30
frames per second. 3) Algorithms Design
A Xilinx Virtex-II Pro FPGA development kit from For GestureCam, robust head/hand segmentation can
Memec has been chosen to form the GRU. This kit is a help achieve robust feature extraction, which is crucial to
powerful yet flexible development platform for imaging gesture classification. In the context of single user and
applications. It includes a Virtex-II Pro 2VP30 on-chip desktop use scenario where user is close to the camera and
embedded PowerPC cores, over 2MB on-chip RAM, his or her head and hand have relatively large sizes on the
captured images, we decide to employ skin-color detection by Rhee et al [5]. Before applying the EBTA algorithm, we
and classification to achieve user’s head/hand segmentation. start by finding the edge of the object under detection. This
The image pre-processing stage consists of mainly color can be done simply by finding the brightest pixel, then
interpolation. As the CMOS image sensor in GestureCam hopping towards the closest edge to in the positive x
provides only Bayer pattern video output, each pixel has direction. Hopping stops when the current pixel is below a
only one color, R or G or B, of one byte depth. Because the predetermined threshold. From this edge pixel, EBTA can
skin-color detection algorithm requires each pixel to have be applied. In short, the direction that the tracing algorithm
RGB three color representations, a color interpolation follows is determined by the current pixel’s neighbors.
operation needs to be carried out to produce two missing For feature extraction, we calculate image moments [6]
colors for each pixel. For GestureCam the image sensor is of the head or hand blobs (or ellipses) up to the second
operating in VGA mode and only a half-VGA resolution is order. This can and will be calculated as a part of the
used for skin color detection and subsequent processing. boundary extraction algorithm as this speeds up and makes
This sub-sampling operation and the color interpolation the process more efficient. To track head movement the
operation take place in the same process, with the help of a angle between the long axis of the head ellipse and the Y
line buffer. This is illustrated in Figure 5. If we denote the axis, called ‘head angle’, is calculated for each frame. To
four pixels of the Bayer pattern VGA image shown in track hand movement, the centre of the mass of the hand
Figure 5(a) as P(m,i), P(m,i+1), P(m+1,i) and P(m+1,i+1), ellipse, called ‘hand centre’, is calculated for each frame.
and the pixel of the sub-sampled half-VGA image shown in For gesture classification, HMM-based classifier will
Figure 5(b) as P’(m,i), then the R, G and B components of be implemented similar to [2] for more complicated gestures
the pixel P’(m,i) are such as drawing gestures. But for simple deictic gestures we
P’green(m,i) = (P(m,i) + P(m+1,i+1)) / 2; use a set of trajectory and time based rules to classify head
P’blue(m,i) = (P(m, i+1); and hand gestures. For example, when the head starts
P’red(m,i) = (P(m+1, i); moving and the ‘head angle’ are increasing toward X-axis
i i+1 within half-a-second or so, a head movement to the user’s
right hand side is detected and classified. Similarly, when
i
the hand starts moving and within half-a-second or so the Y
m P(m,i) P(m,i+1)
value of the ‘hand centre’ is increasing significantly while
P'(m,i) its X value does not change significantly, a hand movement
m
m+1 P(m+1,i+1
‘up’ is detected and classified.
P(m+1,i) Figure 6 shows all functional modules required by the
)
GestureCam and performed by the FPGA, which are
currently being implemented. The ‘camera control’ module
(a) (b) receives parameters from the host PC to adjust behavior and
Fig. 5. Sub-sampling and color interpolation.
performance of the image sensor and other modules.
Object (head/hand) segmentation stage includes three
Virtex-II Pro 2vp30 FPGA
algorithms in the order of skin-color classification, filtering
and contour tracing. The purpose of the filtering is to
classification ethernet
eliminate small patches of skin-like objects in the
to PC
background, and the purpose of the contour tracing is to
produce segmented skin color blobs, which can be
feature extraction
approximated to ellipses.
For skin-color classification, we use Gomez and
Morale’s [4] standard rule based skin algorithm which contour tracing
determines whether a pixel is skin or not skin based on the
RGB color of the pixel. If the color lies within a predefined
skin color region, then it is classified as skin. The filtering
Normalized RGB color space is used in this algorithm. This
color space uses the standard RGB color space but
skin detection
normalizes the colors to become more independent from the
intensity of light in the scene. to display
The filtering operation is done by applying a simple unit
from VGA
25x25 low pass (rectangular or averaging) filter to each color interpolation driver
sensor
pixel. The result is that all significant skin color objects are
converted into grayscale blobs, with the largest blob having to
from PC
camera control RS232
the brightest pixel i.e. the highest grayscale value. sensor
The contour tracing algorithm is based on the “External
Boundary Tracing Algorithm (EBTA)” algorithm proposed
Fig. 6. GestureCam functional modules as performed by the FPGA. VI. APPLICATIONS OF THE GESTURECAM IN MHCI
4) Algorithms Development and Proof-of-Concept GestureCam can significantly simply system-level design
Before implementation into FPGA using VHDL, the of an MHCI application that includes a gesture component
processing flow shown in Figure 6 was implemented firstly as user input, because by itself the GestureCam can provide
on a Windows PC platform using Microsoft Visual C++ and user gesture analysis and recognition without external
Intel IPP (Intel Performance Primitives) library. A Logitech resources. Some applications which we are working on are:
webcam was used during this stage. The various algorithms x Gesture-enabled web browser: an extension to
were written in C++. The purpose of this stage is to develop Firefox web browser which allows a user to
and test a few different algorithms for the filtering and control web navigation by making head and hand
segmentation to try to select those suitable for hardware gestures;
implementation. However some of these had to be changed x Gesture-enabled graph visualization, which allows
when hardware implementation began as the limitations and a user to manipulate a large graph using hand
advantages of the underlying FPGA platform became gestures; and
apparent. x Gesture-enabled interactive map control, which
allows a user to manipulate a map using hand
5) Algorithms Conversion and Implementation on FPGA
gestures, to obtain information about location
The color interpolation operation is straightforward and
related services.
implemented with very small pixel and line buffers. The
skin color classification requires divisions which can not be
easily performed by FPGA. Therefore, we used the on- REFERENCES
board multiplier primitives in conjunction with a lookup [1] W. Wolf, B.Ozer, and T.Lv, “Smart Cameras as Embedded Systems” ,
IEEE Computer. 35(9):48–53, Sep 2002.
table of inverse values of the denominator to approximate [2] V.Bonato, A.Sanches, M.Fernandes, J.Cardoso, E.Simoes, and
division. That is, the numerator is multiplied with the E.Marques, “A Real Time Gesture Recognition System for Mobile
inverse of the denominator. This results in fixed-point Robots”, In Int’l Conf. on Informatics in Control, Automation, and
division. The skin color detection is carried out on-the-fly Robotics, August 25-28, Setúbal, Portugal, 2004, pp. 207-214..
[3] A.Wilson and N.Oliver, “Gwindows: Robust Stereo Vision for
without consuming frame buffers. Gesture-Based Control of Windows”, In Int’l Conf. on Multimodal
The 25x25 low pass filter is basically a FIR filter with a Interaction,. Nov. 5–7, 2003, Vancouver, British Columbia, Canada.
rectangular window in two dimensions. This means that [4] G.Gomez and E.Moralez, “Automatic feature construction and a simple
there is a delay as samples are collected, and buffering is rule induction algorithm for skin detection”, In Proc. IMCL Workshop
on Machine Learning in Computer Vision, (2002), 31-38.
necessary. Averaging is done in two stages. The first is
[5] P.Rhee and C.La, “Boundary extraction of moving objects from image
horizontally, the second is vertically. The horizontal sum is sequence”, TENCON 99. Proceedings of the IEEE Region 10
calculated using a FIFO buffer. The vertical sum is Conference Volume 1, pp.15-17 Sept. 1999.
calculated on the fly. All the operations performed in [6] W.Freeman, D. Anderson, P.Beardsley, C.Dodge, M.Roth,
C.Weissman, and William S. Yerazunis, “Computer Vision for
hardware are capable of and do run in parallel with each
Interactive Computer Graphics”, In IEEE Computer Graphics and
other as long as data is available. The above stages are Applications, May/June 1998. pp. 42-53.
connected in a pipeline with signals synchronizing the
output and input between the various stages. Yu Shi (M’98) is a Senior Researcher with National
The contour tracing, image moments and classification ICT Australia in Sydney, Australia. He obtained his
can be implemented either in VHDL or in C using PhD in signal processing and biomedical engineering
in 1992 in Toulouse, France. He also completed post-
embedded PowerPC programming environment. However, doctoral research at Oxford Brookes University in
the implementations for each environment are very England in the late 1990s. His main research interests
different. For example, for contour tracing, in VHDL, a are in embedded vision systems, FPGA-based design
finite state machine architecture is required, while in the and applications, multimodal user interfaces and web services.
PowerPC environment a simple C program could be used. Parnesh Raniga is a graduate of Software
Engineering from The University of Sydney, Australia.
Figure 7 shows screen captures of temporary results. He is currently pursuing a PhD at at the university in
The picture on the left is the image after color interpolation; conjunction with the CSIRO. Parnesh's research
the middle one is after skin color classification; the right one interests include Image Processing, Computer
is after low pass filtering, with noise being greatly reduced. Graphics, Computer Architecture and FPGA's. He also
enjoys cricket, cycling and flying. He is also
currently pursuing his Private Pilots License.
Ismail Mohamed is an Electrical Engineering
student from the University of Queensland, Australia,
currently in his honours’ year. He has received the
Dean’s Commendation for High Achievement in every
semester of study since he began in 2003. Ismail’s
main areas of interest are in FPGAs, signal and image
Fig. 7. Screenshots of images after color interpolation (left) , skin processing, and digital and analog electronics. He
classification (middle) and filtering (right). enjoys playing soccer and cricket.

You might also like