Computer Vision Application
Object detection is a computer technology related to computer
vision and image processing that deals with detecting instances of
semantic objects of a certain class (such as humans, buildings, or
cars) in digital images and videos.[1] Well-researched domains of
object detection include face detection and pedestrian detection.
Object detection has applications in many areas of computer
vision, including image retrieval and video surveillance.
Due to object detection’s close relationship with video analysis and
image understanding, it has attracted much research attention in
recent years. Traditional object detection methods are built on
handcrafted features and shallow trainable architectures. Their
performance easily stagnates by constructing complex ensembles
which combine multiple low-level image features with high-level
context from object detectors and scene classifiers. With the rapid
development in deep learning, more powerful tools, which are able
to learn semantic, high-level, deeper features, are introduced to
address the problems existing in traditional architectures. These
models behave differently in network architecture, training strategy
and optimization function, etc. In this paper, we provide a review on
deep learning based object detection frameworks. Our review begins
with a brief introduction on the history of deep learning and its
representative tool, namely Convolutional Neural Network (CNN).
Then we focus on typical generic object detection architectures
along with some modifications and useful tricks to improve detection
performance further. As distinct specific detection tasks exhibit
different characteristics, we also briefly survey several specific tasks,
including salient object detection, face detection and pedestrian
detection. Experimental analyses are also provided to compare
various methods and draw some meaningful conclusions. Finally,
several promising directions and tasks are provided to serve as
guidelines for future work in both object detection and relevant
neural network based learning systems.
The problem definition of object detection is to determine where
objects are located in a given image (object localization) and which
category each object belongs to (object classification). So the
pipeline of traditional object detection models can be mainly divided
into three stages: informative region selection, feature extraction
and classification. Informative region selection. As different objects
may appear in any positions of the image and have different aspect
ratios or sizes, it is a natural choice to scan the whole image with a
multi-scale sliding window. Although this exhaustive strategy can
find out all possible positions of the objects, its shortcomings are
also obvious. Due to a large number of candidate windows, it is
computationally expensive and produces too many redundant
windows. However, if only a fixed number of sliding window
templates are applied, unsatisfactory regions may be produced.
Feature extraction. To recognize different objects, we need to
extract visual features which can provide a semantic and robust
representation. SIFT [19], HOG [20] and Haar-like [21] features are
the representative ones. This is due to the fact that these features
can produce representations associated with complex cells in human
brain [19]. However, due to the diversity of appearances,
illumination conditions and backgrounds, it’s difficult to manually
design a robust feature descriptor to perfectly describe all kinds of
objects. Classification. Besides, a classifier is needed to distinguish a
target object from all the other categories and to make the
representations more hierarchical, semantic and informative for
visual recognition. Usually, the Supported Vector Machine (SVM)
[22], AdaBoost [23] and Deformable Part-based Model (DPM) [24]
are good choices. Among these classifiers, the DPM is a flexible
model by combining object parts with deformation cost to handle
severe deformations. In DPM, with the aid of a graphical model,
carefully designed low-level features and kinematically inspired part
decompositions are combined. And discriminative learning of
graphical models allows for building high-precision part-based
models for a variety of object classes.