Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views16 pages

A Fast and Accurate Unconstrained Face Detector

The document presents a novel method for unconstrained face detection that addresses challenges such as pose variations and occlusions using a new feature called Normalized Pixel Difference (NPD). This feature is scale invariant and allows for efficient detection through a single cascade classifier, achieving high speeds and improved accuracy over existing methods. Experimental results demonstrate the effectiveness of the proposed approach on multiple public datasets, outperforming state-of-the-art techniques in detecting faces in complex scenarios.

Uploaded by

Andrea Lorenzon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

A Fast and Accurate Unconstrained Face Detector

The document presents a novel method for unconstrained face detection that addresses challenges such as pose variations and occlusions using a new feature called Normalized Pixel Difference (NPD). This feature is scale invariant and allows for efficient detection through a single cascade classifier, achieving high speeds and improved accuracy over existing methods. Experimental results demonstrate the effectiveness of the proposed approach on multiple public datasets, outperforming state-of-the-art techniques in detecting faces in complex scenarios.

Uploaded by

Andrea Lorenzon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/264558859

A Fast and Accurate Unconstrained Face Detector

Article in IEEE Transactions on Pattern Analysis and Machine Intelligence · August 2014
DOI: 10.1109/TPAMI.2015.2448075 · Source: arXiv

CITATIONS READS

317 8,695

3 authors:

Shengcai Liao Anil Kumar Jain


Chinese Academy of Sciences Jain Super Mart
74 PUBLICATIONS 11,746 CITATIONS 327 PUBLICATIONS 59,256 CITATIONS

SEE PROFILE SEE PROFILE

Stan Z Li
Westlake University
1,061 PUBLICATIONS 56,062 CITATIONS

SEE PROFILE

All content following this page was uploaded by Stan Z Li on 25 August 2014.

The user has requested enhancement of the downloaded file.


1

A Fast and Accurate Unconstrained Face


Detector
Shengcai Liao, Member, IEEE, Anil K. Jain, Fellow, IEEE, and Stan Z. Li, Fellow, IEEE

Abstract—We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and
occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the
difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is
scale invariant, bounded, and is able to reconstruct the original image. Second, we learn the optimal subset of NPD features and
their combinations via regression trees, so that complex face manifolds can be partitioned by the learned rules. This way, only
a single cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can
be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector
very fast (about 178 FPS for 640x480 resolution videos and 30 FPS for 1920x1080 resolution videos on a desktop PC, about
arXiv:1408.1656v1 [cs.CV] 6 Aug 2014

6 times faster than OpenCV). Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the
proposed method outperforms the state-of-the-art methods in detecting unconstrained faces with arbitrary pose variations and
occlusions in cluttered scenes.

Index Terms—Unconstrained face detection, normalized pixel difference, regression tree, AdaBoost, cascade classifier

1 I NTRODUCTION
The objective of face detection is to find and locate
faces in an image. It is the first step in automatic face
recognition applications. Face detection has been well
studied for frontal and near frontal faces. The Viola
and Jones’ face detector [1] is the most well known
face detection algorithm, which is based on Haar-like
features and cascade AdaBoost [2] classifier. However,
in unconstrained scenes such as faces in a crowd,
state-of-the-art face detectors fail to perform well
due to large pose variations, illumination variations,
occlusions, expression variations, out-of-focus blur,
Fig. 1. Face images annotated (red ellipses) in the
and low image resolution. For example, the Viola-
FDDB database [3].
Jones face detector fails to detect most of the face
images in the Face Detection Data set and Benchmark
(FDDB) database [3] (examples shown in Fig. 1) due
focusing on extracting different types of features and
to the difficulties mentioned above. In this paper, we
developing different cascade structures. A variety of
refer to face detection with arbitrary facial variations
complex features [4], [5], [6], [7], [8], [9], [10], [11],
as the unconstrained face detection problem. We are
[12], [13] have been proposed to replace the Haar-
interested in face detection in unconstrained scenarios
like features used in [1]. While these methods can
such as video surveillance or images captured by
improve the face detection performance to some ex-
hand-held devices.
tent, they generate a very large number (hundreds of
Numerous face detection methods have been de- thousands) of features and the resulting systems take
veloped following Viola and Jones’ work [1], mainly too much time to train. Another development in face
detection has been to learn different cascade struc-
• Shengcai Liao and Stan Z. Li are with the National Laboratory of Pat- tures for multiview face detection, such as parallel
tern Recognition and the Center for Biometrics and Security Research,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190,
cascade [14], pyramid architecture [15], and Width-
China. E-mail: {scliao,szli}@nlpr.ia.ac.cn First-Search (WFS) tree [16]. All these methods need
• Anil K. Jain is with the Dept. of Computer Science and Engineering, to learn one cascade classifier for each specific facial
Michigan State University, East Lansing, MI 48824 USA. He is
also affiliated with the Dept. of Brain & Cognitive Engineering,
view (or view range). In unconstrained scenarios,
Korea University, Anamdong, Seongbukgu, Seoul 136-713, Republic however, it is not easy to define all possible views
of Korea. E-mail: [email protected] of a face, and the computational cost increases with
an increasing number of classifiers in complex cascade
2

structure. Moreover, these approaches require manual viewpoints, without pose labeling or clustering
labeling of face pose in each training image. in the training stage.
While some of the available methods [14], [15], The advantages of the proposed approach include:
[16] can handle multiview faces, they are not able • The NPD feature evaluation is extremely fast,
to simultaneously consider other challenges such as requiring a single memory access using a look
occlusion. In fact, since these methods require parti- up table.
tioning multiview data into known poses, occlusion • Multiscale face detection can be easily achieved
is not easy to handle in this way. On the other hand, by applying pre-scaled detection templates.
while several studies addressed face detection under • The unconstrained face detector does not depend
occlusion [17], [18], [19], [20], [21], they constrained on pose specific cascade structure design; pose
themselves to detect only frontal faces under occlu- labeling or clustering in the training stage is also
sion. As discussed in [22], a robust face detection al- not required.
gorithm should be effective under arbitrary variations • The face detector is able to handle illumina-
in pose and occlusion, which remains an unresolved tion variations, pose variations, occlusions, out-
challenging problem. of-focus blur, and low resolution face images in
In this paper, we are interested in developing effec- unconstrained scenarios.
tive features and robust classifiers for unconstrained
The remainder of this paper is organized as follows.
face detection with arbitrary facial variation. First, we
In Section 2 we review the related work. In Section 3
propose a simple pixel-level feature, called the Nor-
we introduce the NPD feature space. The proposed
malized Pixel Difference (NPD). An NPD is computed
NPD based face detection method is presented in Sec-
as the ratio of the difference between any two pixel
tion 4. Experimental results are provided in Section 5.
intensity values to the sum of their values, in the
Finally, we summarize the contributions in Section 6.
same form as the Weber Fraction in experimental psy-
chology [23]. The NPD feature has several desirable
properties, such as scale invariance, boundedness, and 2 R ELATED WORK
ability to reconstruct the original image. we further As indicated in a survey of face detection method-
show that NPD features can be obtained from a look s [24], the most popular face detection methods are
up table, and the resulting face detection template can appearance based, which use local feature represen-
be easily scaled for multiscale face detection. tation and classifier learning. Viola and Jones’ face
Secondly, we develop a method to construct a single detector [1] was the first one to apply rectangular
cascade AdaBoost classifier that can effectively deal Haar-like features in a cascaded AdaBoost classifier
with complex face manifolds and handle arbitrary for real-time face detection. Many approaches have
pose and occlusion conditions. While the individual been proposed around the Viola-Jones detector to
NPD feature may have “weak” discriminative ability, advance the state of the art in face detection. Lienhart
our work indicates that a subset of NPD features and Maydt [4] proposed an extended set of Haar-like
can be optimally learned and combined to construct features, where 45◦ rotated rectangular features were
more discriminative features in a regression tree. In introduced. Li et al. [5] proposed another extension
this way, different types of faces can be automatically of Haar-like features, where the rectangles can be
divided into different leaves of a regression tree, and spatially set apart with a flexible distance. A similar
the complex face manifold in high dimensional space feature, called the diagonal filter was also proposed
can be partitioned in the learning process. This is by Jones and Viola [6]. Various other local texture
a “divide and conquer” strategy to tackle uncon- features have been introduced for face detection, such
strained face detection in a single classifier, without as the modified census transform [7], local binary
pre-labeling of views in the training set of face images. pattern (LBP) [8], MB-LBP [11], LBP histogram [10],
The resulting face detector is robust to variations in and the locally assembled binary feature [12]. These
pose, occlusion, and illumination, as well as to blur features have been shown to be robust to illumination
and low image resolution. variations. Mita et al. [9] proposed the joint Haar-
The novelty of this work is summarized as follows: like features to capture the co-occurrence of effective
• A new type of feature, called NPD is proposed, Haar-like features. Huang et al. [16] proposed a sparse
which is efficient to compute and has several feature set in a granular space, where granules were
desirable properties, including scale invariance, represented by rectangles, and each individual sparse
boundedness, and enabling reconstruction of the feature was learned as a combination of granules. A
original image. problem with the approaches in [9] and [16] is that the
• A subset of NPD features is automatically learned joint feature space is very large, making the optimal
and combined in regression trees to boost their combination a difficult task.
discriminability. In this way, only a single cas- While more sophisticated features may provide bet-
cade AdaBoost classifier is needed to handle un- ter discrimination power than Haar-like features for
constrained faces with occlusions and arbitrary the face detection task, they generally increase the
3

computational cost. In contrast, ordinal relationships the Vector Boost algorithm for multiclass learning,
among image regions are simple yet effective image which is well suited for multiview pose estimation.
features [25], [26], [27], [28], [29], [30]. Sinha [25] stud- However, all these methods need to learn a cascade
ied several robust ordinal relationships in face images classifier for each specific view (or view range) of
and developed a face detection method accordingly. a face, requiring an input face image to go through
Liao et al. [28] further showed that ordinal features different branches of the detection structure. Hence,
can be effectively learned by AdaBoost classifier for their computational cost generally increases with the
face recognition. Sadr et al. [26] showed that pixelwise number of classifiers in complex cascade structures.
ordinal features (ordinal relationship between any two Moreover, these approaches require manual labeling
pixels) can faithfully encode image structures. Baluja of the face pose in each training image.
et al. [27] showed that simple pixelwise ordinal fea- Instead of designing a detection structure, Lin and
tures are good enough for discriminating between five Liu [19] proposed to learn the multiview face de-
facial orientations, a relatively simpler task than face tector as a single cascade classifier. They derived a
detection. Wang et al. [30] applied the random forest multiclass boosting algorithm, called MBHBoost by
classifier together with pixelwise ordinal features for sharing features among different classes. This is a
facial landmark localization. Abramson and Steux [29] simpler approach to multiview face detection than
proposed a pixel control point based feature for face designing complex cascade structures. Nevertheless, it
detection, where each feature is associated with two still requires manual labeling of poses. In uncontrolled
sets of pixel locations (control points). However, it is environments, however, it is not easy to define specific
not easy to learn control point based features because views of a face by discretizing the pose space, because
of the huge number of control point combinations. a face could be in arbitrary pose simultaneously in
Besides different feature representations, some re- yaw (out-of-plane), roll (in-plane), and pitch (up-and-
searchers have also tried different AdaBoost algo- down) angles. To avoid manual labeling, Seemann et
rithms and weak classifiers. For weak classifiers uti- al. [35] suggested learning viewpoint clusters auto-
lized in boosting, Lienhart et al. [31] and Brubaker et matically for object detection. However, for human
al. [32] have shown that classification and regression faces, Kim and Cipolla [36] showed that clustering by
trees (CART) [33] work better than simple decision traditional techniques like K-Means does not result
stumps. In this paper, we show that the optimal ordi- in categorized poses. They hence proposed a multi-
nal features and their combinations can be learned by classifier boosting (MCBoost) for human perceptual
integrating the proposed NPD features in a regression clustering of object images, which showed promise
tree. In this way, unconstrained face variations can be for clustering face poses. However, the clusters are
automatically partitioned into different leaves of the not always related to pose variations; in addition to
learned regression tree. different pose clusters, they also obtained clusters
Given that the original Viola-Jones face detector has with various illumination variations.
limitations for multiview face detection [24], various Face detection in presence of occlusion is also an
cascade structures have been proposed to tackle mul- important issue in unconstrained face detection, but
tiview face detection [6], [14], [15], [16]. Jones and it has received less attention compared to multiview
Viola [6] extended their face detector by training one face detection. This is probably because, compared
face detector for each specific pose. To avoid evaluat- to pose variations, it is more difficult to categorize
ing all face detectors on each scanning subwindow, arbitrary occlusions into predefined classes. Hotta [17]
they developed a pose estimation step (similar to proposed a local kernel based SVM method for face
Rowley et al. [34]) before face detection, and then only detection, which was better than global kernel based
the face detector trained on that estimated pose was SVM in detecting occluded frontal faces. Lin et al. [18]
applied. In this two-stage detection structure, if the considered 8 kinds of manually defined facial oc-
pose estimation is not reliable, the face is not likely clusions by training 8 additional cascade classifiers
to be detected in the second stage. Wu et al. [14] besides the standard face detector. Lin and Liu [19]
proposed a parallel cascade structure for multiview further proposed the MBHBoost algorithm to handle
face detection, where all face detectors tuned to dif- faces with one of 12 in-plane rotations or one of 8
ferent views have to be evaluated for each scanning types of occlusions, with each kind of rotation and
window; they did use the first few cascade layers of occlusion treated as a different class. Chen et al. [20]
all face detectors to estimate the pose for speedup. proposed a modified Viola-Jones face detector, where
Li and Zhang [15] proposed a coarse-to-fine pyramid the trained detector was divided into sub-classifiers
architecture for multiview face detection, where the related to several predefined local patches, and the
entire range of face poses was divided into increas- outputs of sub-classifiers were fused. Goldmann et
ingly smaller subranges, resulting in a more efficient al. [21] proposed a component-based approach for
detection structure. Huang et al. proposed a WFS face detection, where the two eyes, nose, and mouth
tree based multiview face detection approach, which were detected separately, and further connected in a
also works in a coarse-to-fine manner. They proposed topology graph. However, none of the above meth-
4

ods considered face detection with both occlusions between the two pixels x and y. Compared to the
and pose variations simultaneously in unconstrained absolute difference |x − y|, NPD is invariant to scale
scenarios. As discussed in [22], a robust face detector change of the pixel intensities.
should be effective under arbitrary variations in pose Weber, a pioneer in experimental psychology, stated
and occlusion, which has not yet been solved. that the just-noticeable difference in the magnitude
Recently, unconstrained face detection has gained change of a stimulus is proportional to the magnitude
attention. Jain and Learned-Miller [3] developed the of the stimulus, rather than its absolute value [23].
FDDB database and benchmark for the developmen- This is known as the Weber’s Law. In other words, the
t of unconstrained face detection algorithms. This human perception of difference in stimulus is often
database contains images collected from the Internet, measured as a fraction of the original stimulus, that
and presents challenging scenarios for face detection. is, in a form ΔI/I, which is called the Weber Fraction.
Subburaman and Marcel [37] proposed a fast bound- Chen et al. [43] proposed a local image descriptor,
ing box estimation technique for face detection, where called Weber’s Law Descriptor for face recognition,
the bounding box is predicted by small patch based which was computed from Weber Fractions of pixels
local search. Jain and Learned-Miller [38] proposed in a 3 × 3 window. The proposed feature in Eq. (1)
an online domain adaption approach to improve has also been used in other fields such as remote
the performance of the Viola-Jones face detector on sensing, where the Normalized Difference Vegetation
the FDDB database. Li et al. [13] proposed the use Index (NDVI) [44] is defined as the difference to sum
of SURF feature [39] in an AdaBoost cascade, and ratio between the visible red and the near infrared
area under the curve (AUC) criterion to speed up spectra to estimate the green vegetation coverage.
the face detector training. Zhu and Ramanan [40] The NPD feature has a number of desirable proper-
proposed to jointly detect face, estimate pose, and ties. First, the NPD feature is antisymmetric, so either
localize face landmarks in the wild. Shen et al. [41] f (x, y) or f (y, x) is adequate for feature representa-
proposed an exemplar-based face detection approach, tion, resulting in a reduced feature space. Therefore,
which retrieves images from a large annotated face in an s × s image patch (vectorized as p × 1, where
dataset; facial landmark locations are inferred from p = s · s), NPD feature f (xi , xj ) for pixel pairs
the annotations. Li et al. [42] proposed a probabilistic 1 ≤ i < j ≤ p is computed, resulting in d = p(p − 1)/2
elastic part (PEP) model to adapt any pre-trained features. For example, in a 20×20 face template, there
face detector to a specific image collection like FDDB. are (20 × 20) × (20 × 20 − 1)/2 = 79, 800 NPD features
This method extracts the PEP representation for each in total. We call the resulting feature space the NPD
candidate face detected by a general face detector, and feature space, denoted as Ωnpd (∈ Rd ).
trains a classifier with the top positive and negative Second, the sign of f (x, y) is an indicator of the or-
samples. Despite the availability of these methods for dinal relationship between x and y. Ordinal relation-
unconstrained face detection, the detection accuracy ship has been shown to be an effective encoding for
is still not satisfactory, especially when the detector is object detection and recognition [25], [26], [28] because
required to have low false alarms. ordinal relationship encodes the intrinsic structure of
an object image and it is invariant under various
illumination changes [25]. However, simply using the
3 N ORMALIZED P IXEL D IFFERENCE F EA - sign to encode the ordinal relationship is likely to be
TURE S PACE sensitive to noise when x and y have similar values.
The Normalized Pixel Difference (NPD) feature be- In the next section we will show how to learn robust
tween two pixels in an image is defined as ordinal relationships with NPD features.
x−y Third, the NPD feature is scale invariant, which is
f (x, y) = , (1) expected to be robust against illumination changes.
x+y
This is important for image representation, since il-
where x, y ≥ 0 are intensity values of the two pixels1 , lumination change is always a troublesome issue for
and f (0, 0) is defined as 0 when x = y = 0. both object detection and recognition.
The NPD feature measures the relative difference Fourth, as shown in Appendix A, the NPD feature
between two pixel values. The sign of f (x, y) indicates f(x,y) is bounded in [-1,1]. The bounded property
the ordinal relationship between the two pixels x and makes the NPD feature amenable to histogram bin-
y , and the magnitude of f (x, y) measures the relative ning or threshold learning in tree-based classifiers [1].
difference (as a percentage of the joint intensity x + y) Fig. 2 shows that f (x, y) is a bounded function and it
between x and y. Note that the definition f (0, 0)  0 is defines a nonlinear surface.
reasonable because, in this case, there is no difference Theorem 1 (Reconstruction): Given the NPD fea-
ture vector f = (f (x1 , x2 ), f (x1 , x3 ), . . . ,f (xp−1 , xp ))T
1. For ease of representation, sometimes we also denote x and y ∈ Ωnpd , the original image intensity values I =
as pixels instead of pixel values. We use subscripts to differentiate
between pixel and pixel values only when pixel locations are under (x1 , x2 , . . . , xp )T can be reconstructed up to a scale
discussion. factor.
5

Fig. 3. Learning and combining ordinal features in


a regression tree. Left: four pixelwise ordinal features
are automatically selected in the learning process.
Right: the four features are optimally combined in a
Fig. 2. A plot of the NPD function f (x, y). regression tree for face/nonface prediction.

The proof of Theorem 1 is shown in Appendix B, at each branch node, but also learn optimal thresholds
which also gives a linear-time approach to reconstruct for splitting. Generally, one of the following two cases
the original image up to a scale factor. Theorem 1 are leaned for each NPD feature at a branch node:
states that each point in the feature space Ωnpd cor- x−y
f (x, y) = < θ1 < 0, (2)
responds to a group of intensity-scaled images in x+y
the original pixel intensity space. In contrast, the
x−y
scale invariance property says that all intensity-scaled f (x, y) = ≥ θ2 > 0, (3)
x+y
images are “compressed” to a point in the bounded
feature space Ωnpd . Therefore, Ωnpd is a feature space where θ1 and θ2 are the thresholds. Eq. (2) applies if
which is invariant to scale variations, but it carries all the object pixel x is notably darker than pixel y, while
the necessary information from the original space. Eq. (3) covers the case when pixel x is notably brighter
than pixel y. The learned thresholds allow the ordinal
4 NPD FOR FACE DETECTION encodings in the learned regression trees to represent
the intrinsic object structure. To learn such regression
4.1 Learning Object Structures trees, we use the CART algorithm [33] with the NPD
Ordinal relationship [25] is a well-known simple and features.
basic concept: it compares the brightness of any two
image regions, and encodes the result with 1 (brighter)
4.2 Face Detector
or 0 (darker) accordingly. Sinha [25] showed that
ordinal features can represent the intrinsic structure of Given that the proposed NPD features contain re-
objects such as a human face, and they are insensitive dundant information, we also apply the AdaBoost
to illumination changes. Instead of encoding ordinal algorithm to select the most discriminative features
relationship between two image regions, in this paper, and construct strong classifiers [1]. We adopt the Gen-
we learn robust ordinal relationships between pairs of tle AdaBoost algorithm [2] to learn the NPD feature
pixels via the NPD feature. For a face pattern which is based regression trees.
well structured, automatically learned combinations As in [1], a cascade classifier is further learned for
of ordinal features may represent a face better than rapid face detection. We only learn one single cascade
manual configurations. Therefore, we propose to learn classifier for unconstrained face detection robust to
a combination of simple ordinal features by boosted occlusions and pose variations. This implementation
regression trees [33]. By providing a training set of has the advantage that there is no need to label the
face and nonface images, a weak classifier is learned pose of each face image manually or cluster the poses
by a regression tree. At each node, the tree checks before training the detector. In the learning process,
the optimal ordinal feature value, and then passes the the algorithm automatically divides the whole face
input data to the next branch accordingly. See Fig. 3. manifold into several sub-manifolds by regression
Regression tree is also well suited for face detection trees.
with arbitrary pose variations, since similar views can Below is a summary of how the proposed method
be clustered in the same leaf node of the tree. handles the unconstrained face detection problem.
Ordinal relationship can always be generated by • Pose. Pose variations are handled by learning
the default threshold of 0, but it will be sensitive to NPD features in boosted regression trees, where
noise especially when the two pixels to be compared different views can be automatically partitioned
have similar values. In this paper, we learn robust into different leaves of the regression trees.
ordinal relationships and their combinations by learn- • Occlusion. In contrast to Haar-like features that
ing regression trees with NPD features. In this way, are sensitive to occlusions because of large sup-
regression trees not only learn optimal NPD features port [18], NPD features are computed by only
6

two pixel values, making them robust to occlu-


sion.
• Illumination. Since NPD features are scale invari-
ant, they are robust to illumination changes.
• Blur or low image resolution. Because the NPD
features involve only two pixel values, they do
not require rich texture information on the face.
This makes NPD features effective in handling
blurred or low resolution face images.
Fig. 4. Example face (left) and nonface (right) images
from [13] for face detector training.
4.3 Speed Up
To further speed up the proposed NPD face detector,
we develop the following two techniques. First, for
8-bit gray images, we build a 256 × 256 look up
table to store pre-computed NPD features. This way,
computing f (x, y) in Eq. 1 only requires one memory
access from the look up table. Tree 1 Tree 2 Tree 3
Second, the learned face detection template (e.g. Fig. 5. The learned NPD features by boosting regres-
20 × 20 used in this paper) can be easily scaled to sion trees in the first stage of the cascade.
enable multiscale face detection. So, we pre-compute
multiscale detection templates and apply them to
detect faces at various scales. This way, iterative re- features in the three regression trees are distributed
scaling of images for multiscale detection is avoided. in different parts of the facial region. This is because,
in the boosting scheme, all samples are reweighted
5 E XPERIMENTS when a weak classifier is learned, so that the next
weak classifier can focus on the training samples that
We evaluate the performance of the NPD face de-
can not be correctly classified in the current step. The
tector on three public-domain databases, FDDB [3],
face shown in Fig. 5 is a frontal face, but it should be
GENKI [45], and CMU-MIT [34]. We also provide
kept in mind that the face can have arbitrary pose
an analysis of the proposed method, report the face
variations, and some learned features may be only
detection speed, and report unconstrained face detec-
effective for a specific pose.
tion performances under illumination variations, pose
In the test stage, a scale factor of 1.2 was set for
variations, occlusion, and blur, respectively.
multiscale detection. A postprocessing method similar
to the OpenCV face detection module was implement-
5.1 Implementation of NPD Face Detector ed, which merges nearby detections by the disjoint
A subset of the training data2 in [13] was used to set algorithm. For each detected face, we summa-
train our detector, including 12,102 face images and rized the scores of AdaBoost classifiers in all stages
12,315 nonface images (some private face images and of the cascade to be the final score; this score was
the Corel5k nonface images were not available, so used to generate the Receiver Operating Characteristic
they could not be used). Fig. 4 shows some example (ROC). We used three public face databases, FDDB [3],
face and nonface images from this training dataset. GENKI [45], and CMU-MIT [34], to evaluate our face
The detection template is 20 × 20 pixels. The detector detection algorithm.
cascade contains 15 stages, and for each stage, the
target false accept rate was 0.5, with a detection rate 5.2 Evaluation on FDDB Database
of 0.999. For the depth of the regression tree, we
The FDDB dataset [3] covers challenging scenarios for
set a constraint that each leaf node must contain at
face detection. Images in FDDB comes from the Faces
least (1/16)th of the total number of training samples.
in the Wild dataset [46], which is a large collection
Under this constraint, the tree depth is at most 5,
of Internet images collected from the Yahoo News. It
and in the test phase at most 4 NPD features need
contains 2,845 images with a total of 5,171 faces, with
to be computed for each regression tree. The first
a wide range of challenging scenarios including arbi-
five stages of our detector include 3, 4, 6, 7, 9 weak
trary pose, occlusions, different lightings, expressions,
classifiers, respectively. Fig. 5 shows the NPD features
low resolutions, and out-of-focus faces. All faces in the
learned in the three regression trees in the first stage.
database have been annotated with elliptical regions.
It can be observed that most of the learned features
Fig. 1 shows some examples of the annotated faces
are around eyes, eyebrows, and nose. In addition, the
from the FDDB database.
2. https://sites.google.com/site/leeplus/publications/ For benchmark evaluation, Jain and Learned-
facedetectionusingsurfcascade Miller [3] provided an evaluation code for a compari-
7

EXP-2, because we did not find any result following


the EXP-1 protocol. In both Figs. 8 and 9, the curve
labels in the legend are sorted in descending order of
the detection rates at zero false positives (FP=0). Note
also that, on average, FP=285 generally means one
false detection per image for the FDDB experiments.
Therefore, the useful FPs are in the range [0,500]; we
show the X axis in logarithmic scale to emphasize
the performance at low FPs. Among the baseline
methods, “Olaworks Inc.” and “Illuxtech Inc.” are
two commercial detectors. Their methods, as well as
the method of “Shenzhen University”, have not been
Fig. 6. Face images cropped from the FDDB published. “SURF Cascade” is the SURF descriptor
database [3]. based cascade method proposed by Li et al. in [13],
which is the best published result at low false posi-
tives to date. The method of Zhu-Ramanan [40] was
evaluated by the FDDB team, and the result, reported
on the FDDB website, is now the state of the art
among published methods. For the proposed NPD
Fig. 7. Modified images from the FDDB database [3] face detector, besides scaling the detection template
for bootstrapping nonface samples. in a nearest neighbor fashion, we also tried building
the image pyramid representation by the default im-
resize function in MATLAB, and applied the 20 × 20
son of different face detection algorithms. There are t- detection template. Since this function uses the bicubic
wo metrics for performance evaluation based on ROC: interpolation method with antialiasing, we call the
discrete score metric and continuous score metric, resulting detector “Smooth NPD”.
which correspond to coarse match (similar to previous
evaluations in the face detection literature) and pre- 0.8
50%, Smooth NPD, EXP−1

cise match, respectively, between the detection and the 0.7


47%, Smooth NPD, EXP−2
45%, NPD, EXP−2

ground truth. Two experimental setups are proposed 0.6


38%, NPD, EXP−1
27%, Zhu−Ramanan [42]
True Positive Rate

in [3]. The first experiment (EXP-1) requires a 10-fold


25%, Shenzhen University
0.5 25%, Illuxtech Inc.
5%, VJGPR [40]
cross-validation, while the second experiment (EXP-2) 0.4
3%, Mikolajczyk et al. [49][3]
3%, Olaworks Inc.
allows unrestricted training, which means that images 0.3
3%, SURF Cascade [13]
1%, Viola−Jones [1][3]
outside FDDB can be used for face detector training. 0%, XZJY [43]
0.2 0%, Koestinger et al. [50]
We followed both experimental protocols. For EXP- 0%, Segui et al. [51]
n/a, PEP [44]
0.1
1, we trained 10 face detectors, with the same settings n/a, Subburaman−Marcel [39]

described in Section 5.1, and tested them separately 0


1 10
False Positives
100 500

using 10-fold cross-validation. On average, we used


about 4,500 face images annotated in FDDB to train Fig. 8. ROC curves for face detection on the FDDB
a single face detector. Fig. 6 shows some face images database [3] with the discrete score metric.
that were cropped from the FDDB database for train-
ing our face detectors. Since FDDB does not provide a From the discrete score metric results shown in
set of nonface images, we replaced all annotated face Fig. 8, it can be observed that the proposed method
regions with black patches in the FDDB images and
then used the resulting images to bootstrap nonface 0.6

samples. Fig. 7 illustrates such modified images. 34%, Smooth NPD, EXP−1
31%, Smooth NPD, EXP−2
For EXP-2, we used the detector trained with data
0.5
30%, NPD, EXP−2
26%, NPD, EXP−1
outside FDDB, as described in the previous subsec- 0.4
21%, Zhu−Ramanan [42]
True Positive Rate

17%, Shenzhen University

tion. For evaluation, this detector was applied on 16%, Illuxtech Inc.
5%, Olaworks Inc.

each subset of the FDDB database separately, and an 0.3 3%, VJGPR [40]
2%, Mikolajczyk et al. [49][3]

average performance is reported. 0.2


2%, SURF Cascade [13]
1%, Viola−Jones [1][3]

We compared our method with state-of-the-art re-


0%, XZJY [43]
0%, Koestinger et al. [50]

sults reported on the FDDB website3 . The ROC curves 0.1 0%, Segui et al. [51]
n/a, PEP [44]
n/a, Subburaman−Marcel [39]
of various algorithms are depicted in Fig. 8 for the 0
1 10 100 500
discrete score metric and in Fig. 9 for the continuous False Positives

score metric. Note that all the baseline results are for
Fig. 9. ROC curves for face detection on the FDDB
3. http://vis-www.cs.umass.edu/fddb/results.html database [3] with the continuous score metric.
8

Fig. 10. Detected faces in the FDDB database [3] by the proposed NPD method. Green boxes are detections by
the NPD detector, while red ellipses are ground truth annotations.

outperforms all the baseline methods except Olawork- the NPD detectors trained for EXP-1 and EXP-2 are
s Inc. However, the proposed NPD detector is much comparable, though the training data size for EXP-
better than Olaworks’ detector when FP < 10. In 2 is several times larger than that for EXP-1. This
fact, when FP=0 (shown in the legend), the proposed result indicates that FDDB contains representative
NPD detector detects 45% of the annotated FDDB images for unconstrained face detection. However, it
faces in coarse sense (50% overlap with ground truth), is not easy to handle all this data in training a single
while the detection rates of all baseline detectors are detector (recall the large variations in face appearance
below 30%. Note that with a sub training set that in Fig. 6). Note that the generic NPD features are
was previously used for SURF Cascade [13], NPD for learned in regression trees to divide and conquer the
EXP-2 shows much better performance than SURF complex face manifolds.
Cascade. Further, the Smooth NPD is slightly better
Similar observations can be found in Fig. 9 for the
than NPD, but with an additional cost of smoothing
continuous score metric, except that Zhu-Ramanan is
computation. It is also observed that the results of
slightly better than the proposed method when FP> 5,
9

TABLE 1
Comparison of detection rates (%) with both discrete
and continuous metrics for EXP-2 on the FDDB
database [3]*

Discrete Metric Continuous Metric


FP = 0 FP = 10 FP = 100 FP = 0 FP = 10 FP = 100
Smooth NPD 47.23 70.41 75.38 31.26 46.78 50.60
NPD 45.32 67.47 73.72 29.99 44.95 49.63
Zhu-Ramanan [40] 27.38 63.88 73.08 21.25 48.62 55.40 (a) discrete (b) continuous
Shenzhen University 24.87 59.06 72.50 16.51 39.12 48.05
Illuxtech Inc. 24.56 60.55 68.86 16.50 40.82 47.01 Fig. 11. ROC curves for face detection on the GENKI-
VJGPR [38] 4.58 15.76 51.00 2.95 10.20 33.16 SZSL dataset [45] with (a) discrete and (b) continuous
Mikolajczyk et al. [47] [3] 3.25 10.23 33.28 2.10 6.61 21.67 score metrics.
Olaworks Inc. 2.94 67.84 82.58 4.79 45.18 53.34
SURF Cascade [13] 2.59 48.27 70.01 1.60 30.21 44.36
Viola-Jones [1] [3] 1.39 10.02 32.64 0.90 6.48 21.26
XZJY [41] 0.31 7.91 67.51 0.19 4.99 43.40 are not labeled, therefore they are not suitable for the
Koestinger et al. [48] 0.19 21.47 57.03 0.14 15.38 40.55 face detection evaluation task. After removing such
Segui et al. [49] 0.00 15.08 67.94 0.00 9.78 43.76 unlabeled images, we are left with 3,270 images for
PEP [42] n/a 8.43 73.35 n/a 5.38 47.30
Subburaman-Marcel [37] n/a 0.54 17.25 n/a 0.36 11.27
face detection evaluation. For performance evaluation,
* Red numbers represents the best results, while blue numbers are
it is not fair to apply the learned detector described
the second best results. Results for Mikolajczyk et al. [47] and in Section 5.1, because the training data used for
Viola-Jones [1] are reported in [3]. Results for Zhu-Ramanan [40] that detector contained face images from the GENKI
are evaluated by the FDDB team and reported on their website.
database4 . Therefore, we used the NPD face detector
trained on the first fold of the FDDB 10-fold cross
validation to evaluate the GENKI database. We also
and “Smooth NPD, EXP-1” outperforms Olaworks
evaluated the Viola-Jones face detector implement-
Inc. Table 1 shows a comparison of detection rates for
ed in OpenCV 2.4, and a commercial face detector
EXP-2 on the FDDB database at FP=0, 10, and 100. It
PittPatt [50]. We again used the benchmark evalua-
is promising that at low false positives, the proposed
tion code by in [3] for performance evaluation, but
method is either much better than the baseline meth-
slightly modified the code for allowing ground truth
ods, or comparable to the best performers.
annotations as rectangles. The ROC curves of the three
Fig. 10 shows some examples of detected faces in methods are shown in Fig. 11 for both the discrete
the FDDB database by the proposed NPD method. and continuous score metrics. The results show that
Rotated, occluded, and out-of-focus faces can be suc- the proposed NPD face detector performs much better
cessfully detected by the proposed method as shown than both the Viola-Jones and PittPatt face detectors.
in Fig. 10. Some occluded faces (e.g. 4th row and 2nd
column in Fig. 10) and blurred faces (e.g. top-right
image in Fig. 10) that are not annotated in the ground 5.4 Evaluation on CMU-MIT Database
truth can still be detected by the proposed method. The CMU-MIT face dataset [34] is one of the early
However, there are a number of faces that cannot be benchmark for face detection. The CMU-MIT frontal
detected by the proposed method, especially in very face data set contains 130 gray-scale images with a
crowded scenes (see the 1st image and the 3rd image total of 511 faces, most of which are not occluded.
in row 1, and the bottom-right image in Fig. 10). We applied the same NPD detector described in
Subsection 5.1 on this database. We also used the
modified benchmark evaluation code from [3] with
5.3 Evaluation on GENKI Database the discrete score metric for performance evaluation.
The GENKI database [45] was collected by the Ma- Fig. 13 shows the ROC curves for the proposed NPD
chine Perception Laboratory, University of California, face detector, the Soft cascade method [51], the SURF
San Diego. We evaluated the current release of the cascade method [13], and the Viola-Jones detector [1].
GENKI database, GENKI-R2009a, on its SZSL sub- The results show that, compared to the Viola-Jones
set, which contains 3,500 images collected from the frontal face detector, the NPD detector performs better
Internet. These images include a wide range of back- when the number of false positives, FP < 50, while
grounds, illumination conditions, geographical loca- it is slightly worse than Viola-Jones at higher FPs.
tions, personal identity, and ethnicity. Some examples Compared to the SURF cascade detector, the NPD
of face images from the GENKI database are shown in detector is better when FP < 3, but SURF cascade
Fig. 12, with labeled detections by the proposed NPD method outperforms NPD at higher FPs. Note that
method. Most images in the GENKI dataset contain
only one face. In that sense, the GENKI dataset is 4. This training data is provided by the authors of [13]. We
cannot remove the GENKI face images from this training data,
not as challenging as the FDDB dataset. Some of the because we have access to only the raw face images in binary
images in the GENKI-SZSL dataset contain faces that format; we do not know the corresponding filenames and sources.
10

Fig. 12. Detected faces in the GENKI-SZSL dataset [45] by the proposed NPD method.

Fig. 14. Detected faces in the CMU-MIT dataset [34] by the proposed NPD method.

5.5 Analysis of the Proposed Face Detector


Since the proposed face detector is a combination of
regression trees and the NPD features, it is instructive
to determine the contribution of each of these two
components. In the following, we trained all com-
pared face detectors on the same training set and
cascade training settings described in Section 5.1.
First, we trained a face detector based on the NPD
features, but with the stump classifier [1], a basic tree
classifier with only one splitting node. As shown in
Table 2, the stump classifier based detector contains
Fig. 13. ROC curves for face detection on the CMU- 1,597 weak classifiers. In contrast, the regression tree
MIT dataset [34]. based detector contains 176 weak classifiers, indi-
cating that combining NPD features in a regression
tree is much more effective in constructing a weak
classifier for AdaBoost learning. Furthermore, in cas-
the SURF cascade method uses a face template of size cade processing, each scanning subwindow needs
40 × 40 pixels, which is four times larger than our to evaluate 36.5 NPD features, on average, for the
face detection template (20 × 20 pixels). Generally, a stump classifier based detector. On the other hand,
larger face template contains more features for face for the regression tree based detector, only 34.4 NPD
description, but is computationally more expensive features, on average, need to be evaluated, which
and may have a limitation in detecting blurred faces. means that using regression tree does not increase the
In addition, the proposed NPD method is not as good average computation cost. The face detectors based
as the Soft cascade, the state-of-the-art method on the on the stump classifier and the regression tree were
CMU-MIT dataset. Still, the proposed NPD method tested on the FDDB database. The ROC curves of
can detect about 80% of the frontal faces without any these two detectors are shown in Fig. 15 for both
false positives, which is promising since we did not the discrete score metric and continuous score metric.
train a frontal face detector. Some of the detected As illustrated, using regression trees instead of stump
faces in the CMU-MIT dataset by the proposed NPD classifier improves the face detection performance by
method are shown in Fig. 14. about 2% – 10% for discrete metric and 1% – 7%
11

0.8 0.55 0.8 0.55

0.75 0.75
0.5 0.5
True Positive Rate

True Positive Rate

True Positive Rate

True Positive Rate


0.7
0.7
0.65 0.45 0.45
0.65
0.6
0.4 0.4
0.6 NPD2 NPD2
0.55 NPD NPD
0.35 Haar 0.35 Haar
0.5 0.55
NPD−Tree NPD−Tree LBP LBP
NPD−Stump NPD−Stump POF POF
0.45 0.5
1 10 100 500 1 10 100 500 1 10 100 500 1 10 100 500
False Positives False Positives False Positives False Positives

(a) discrete (b) continuous (a) discrete (b) continuous


Fig. 15. Comparison of NPD face detectors based on Fig. 16. Comparison of different features in regression
stumps and regression trees on the FDDB database [3] tree based face detector on the FDDB database [3]
with (a) discrete and (b) continuous score metrics. with (a) discrete and (b) continuous score metrics.

TABLE 2
Comparison of detector complexity. performs better than Haar and LBP, especially at
low false positives, indicating that combining optimal
Haar LBP POF NPD-stump NPD-tree pixel-level features in regression trees provides better
#weak classifiers 150 108 276 1,597 176 discrimination between faces and nonfaces. On the
#features learned 1,763 1,269 3,082 1,597 2,035
#feature evaluations 33.9 30.4 44.3 36.5 34.4
other hand, one can also observe that except at low
false positives, NPD performs about the same or just
slightly better than Haar-like features and LBP.
We also tried a variation of NPD, defined as
for continuous metric. The improvement is larger at
f (x, y) = √x−y . This is denoted as NPD2. With
smaller false positives. 2 2 x +y
Next, we fixed the regression tree based weak the same setting as NPD, we trained another detector
learner, but tried three other local features, namely based on NPD2. The testing results on FDDB are
Haar-like features [1], LBP [52], and pixelwise ordinal also shown in Fig. 16; the performances of NPD and
feature (POF) [30]. Since LBP is a discrete label, we NPD2 are about the same, with NPD2 being slightly
treated it as a categorical variable in the regression better. However, considering that NPD is simpler than
tree learning, that is, for branching at each tree node, NPD2, we still suggest the formulation of Eq. (1).
the algorithm finds the optimal criterion that splits
the discrete LBP codes into two groups. Using the 5.6 Evaluation Under Specific Detection Chal-
same training set as in Section 5.1, we trained the three lenge
detectors using Haar, LBP, and POF, respectively. The In the following, we evaluate how the proposed NPD
model complexity of these detectors is summarized face detector performs under illumination variation,
in Table 2. It can be observed that, the NPD model pose variation, occlusion, and blur (or low resolution).
appears to be more efficient than the POF model, Note that these four challenges are often encountered
though it requires slightly more feature evaluations simultaneously in an image. In our selection of the
than the Haar and LBP models. However, it should be four subsets, one per specific challenge, we focused on
noted that the computation of Haar-like features re- the main source of variation in each image. For each
quires computing integral images, while for LBP, each challenge, we selected 100 images from the FDDB
feature needs to compare 8 pairs of pixels and con- database [3] (examples are shown in Fig. 17), and
vert the resulting binary string to the corresponding ran the NPD detector described in Subsection 5.1 on
decimal number. In contrast, using look up tables as each subset separately. Fig. 18 shows that the NPD
aforementioned, computing the NPD feature requires face detector performs the best on the illumination
only one memory access. subset. This is not surprising since the proposed NPD
The four detectors with different local features were features are robust against illumination variations.
tested on the FDDB database, and the corresponding Further, the NPD method performs better for face
ROC curves are shown in Fig. 16 for both the discrete images with pose variation than with occlusion or
and continuous score metrics. The NPD detector per- blur. These results indicate that occlusion and blur
forms better than the Haar, LBP, and POF detectors are the two major challenges for unconstrained face
with the same regression tree based weak learners. detection, which have not been well addressed in the
The performance improvements due to NPD features literature.
over Haar, LBP, and POF features are about 6%, The NPD face detector is also compared with the
10%, and 6%, respectively, for discrete metric, and Viola-Jones face detector implemented in OpenCV 2.4,
about 4%, 6%, and 4%, respectively, for continuous and the commercial face detector PittPatt on the four
metric, at FP=1. NPD is better than POF, because subsets of FDDB discussed above. The resulting ROC
with NPD features the regression tree learns optimal curves with the discrete score metric are shown in
thresholds to form more robust ordinal rules. NPD Fig. 19. These plots show that the proposed NPD
12

(a) illumination (b) pose (a) occlusion (b) pose

(c) occlusion (d) blur (c) illumination (d) blur


Fig. 17. Example images and annotated faces for four Fig. 19. ROC curves for face detection on four subsets
subsets extracted from the FDDB database [3]. from the FDDB database [3] with the discrete score
metric.

were selected for this evaluation: (i) a normal desktop


PC with the Intel Core i5-2400 @3.1GHz CPU (4
cores, 4 threads), and (ii) a netbook with Intel Atom
N450 @1.6GHz processor (1 core, 2 threads), to sim-
ulate low-end devices. For face detection evaluation,
a video clip of the movie “Jobs” was used. This video
(a) discrete (b) continuous clip shows a busy campus, with each frame containing
from one to tens of faces. The length of the video
Fig. 18. ROC curves of the proposed NPD face de-
clip is about 2 minutes, containing 3,950 frames in
tector on the four subsets extracted from the FDDB
total. The original resolution is 1280 × 720. To test
database [3] with (a) discrete and (b) continuous score
the detection speed at various resolutions, the original
metrics.
video clip was cropped and resized to 1920 × 1080,
800×600, and 640×480. In this evaluation, the minimal
face detector outperforms both the Viola-Jones and face size to detect was set to 40 × 40 pixels, and the
the PittPatt face detectors on all the four subset- scaling factor was 1.2. The multi threading technique
s. The reasons for the superior performance of the was enabled in both NPD and OpenCV detectors for
proposed method under illumination variations, pose parallel computation.
variations, occlusions, and blur, were discussed in The test results (measured in terms of Frame Per
Subsection 4.2. Second, FPS) are shown in Table 3. Note that we only
calculated the face detection time, regardless of the
video decoding time. The detection speed of the SURF
5.7 Detection Speed cascade [13], a fast face detection algorithm, is also
For handheld devices like mobile phones, the avail- compared in Table 3. The detection speed of the SURF
able resources for computation and memory are rather cascade algorithm is taken directly from [13], since
limited. Therefore, face detector’s complexity and de- we do not have access to the code. The detection
tection speed are very important for embedded sys- parameters in [13] are the same as our algorithm,
tems. In this subsection, we report the detection speed except that authors in [13] used the Intel Core-i7 CPU
of the proposed NPD face detector, compared with for the desktop computer. From Table 3 it can be
the Viola-Jones5 face detector in OpenCV 2.4, which is observed that the NPD detector is much faster than
known to be optimized for speed. The proposed NPD both the OpenCV and SURF cascade detectors. On
face detector is implemented in C++; the size of the Atom N450 processor, the detection speed of the NPD
model trained in Section 5.1 is 41KB. Two platforms detector is about 9 times faster than the detection
speed of the OpenCV detector; on i5 processor the
5. We have tested four models of the Viola-Jones face de- speed of the NPD detector is about 7 times the speed
tector provided in OpenCV 2.4, and found that the “haarcas-
cade frontalface alt” model is the fastest, which was selected here of the OpenCV detector.
for comparison. Table 3 shows that the NPD detector can run in
13

TABLE 3 R EFERENCES
Comparison of face detection speed (as FPS).
[1] P. Viola and M. Jones, “Rapid object detection using a boosted
CPU Resolution NPD OpenCV SURF [13]* cascade of simple features,” in IEEE Computer Society Confer-
640 × 480 19.4 2.1 5.8 ence on Computer Vision and Pattern Recognition, 2001.
Atom N450 [2] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic re-
800 × 600 12.1 1.3 -
@1.6GHz 1280 × 720 6.8 0.7 - gression: a statistical view of boosting,” The Annals of Statistics,
(1 core, 2 threads) 1920 × 1080 3.0 0.3 - vol. 28, no. 2, pp. 337–374, April 2000.
640 × 480 177.6 24.4 71.3 [3] V. Jain and E. Learned-Miller, “FDDB: A benchmark for
i5-2400 face detection in unconstrained settings,” University of Mas-
800 × 600 112.6 16.2 -
@3.1GHz 1280 × 720 63.3 8.9 - sachusetts, Amherst, Tech. Rep. UM-CS-2010-009, 2010.
(4 cores, 4 threads) 1920 × 1080 29.6 3.6 - [4] R. Lienhart and J. Maydt, “An extended set of Haar-like
* “-” means data is not available for the SURF detector [13]. features for rapid object detection,” in Proceedings of the IEEE
This is because we do not have access to the code, and [13] International Conference on Image Processing, 2002.
only reports detection speed at resolution 640×480 or lower. [5] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “S-
tatistical learning of multi-view face detection,” in Proceedings
of the 7th European Conference on Computer Vision, 2002.
[6] M. Jones and P. Viola, “Fast multi-view face detection,” Mit-
subishi Electric Research Lab TR-2003-96, 2003.
real-time (29.6 FPS) on i5 desktop PC for processing [7] B. Froba and A. Ernst, “Face detection with the modified
1920 × 1080 high definition videos. For the standard census transform,” in Proceedings of the 6th IEEE International
VGA (640 × 480) videos, the NPD detector on i5 pro- Conference on Automatic Face and Gesture Recognition, 2004.
[8] H. Jin, Q. Liu, H. Lu, and X. Tong, “Face detection using
cessor can detect faces at even faster speed (177.6 FPS). improved LBP under bayesian framework,” in Proceedings of
On the low-end Atom platform, the NPD detector can the 3rd International Conference on Image and Graphics, 2004.
still run in near real-time (19.4 FPS) for processing [9] T. Mita, T. Kaneko, and O. Hori, “Joint Haar-like features for
face detection,” in Proceedings of the 10th IEEE International
VGA videos. The reasons for the high processing Conference on Computer Vision, vol. 2, 2005, pp. 1619–1626.
speed of NPD are two folds. First, the NPD feature [10] H. Zhang, W. Gao, X. Chen, and D. Zhao, “Object detection
is simple, involving only two pixels. Further with the using spatial histogram features,” Image and Vision Computing,
vol. 24, no. 4, pp. 327–341, 2006.
look up table technique, the evaluation of each NPD
[11] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, “Face detec-
feature requires only one memory access. Second, the tion based on multi-block LBP representation,” in Proceedings
NPD feature can be easily scaled to various sizes of the IAPR/IEEE International Conference on Biometrics, 2007.
of detection templates. Therefore, pre-calculating and [12] S. Yan, S. Shan, X. Chen, and W. Gao, “Locally assembled
binary (LAB) feature with feature-centric cascade for fast and
storing multiscale templates can speed up detection accurate face detection,” in Proceedings of IEEE Computer Society
because rescaling the input image is avoided. Conference on Computer Vision and Pattern Recognition, 2008.
[13] J. Li, T. Wang, and Y. Zhang, “Face detection using SURF
cascade,” in ICCV BeFIT workshop, 2011.
[14] B. Wu, H. Ai, C. Huang, and S. Lao, “Fast rotation invariant
multi-view face detection based on real AdaBoost,” in IEEE
6 S UMMARY AND F UTURE W ORK Conference on Automatic Face and Gesture Recognition, 2004.
[15] S. Li and Z. Zhang, “Floatboost learning and statistical face
detection,” IEEE Transactions on Pattern Analysis and Machine
We have proposed a fast and accurate method for face Intelligence, vol. 26, no. 9, pp. 1112–1123, 2004.
detection in cluttered scenes. The method is based [16] C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance ro-
on the normalized pixel difference (NPD) feature in tation invariant multiview face detection,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp.
conjunction with boosted regression trees. An analysis 671–686, 2007.
of NPD feature shows that it possesses properties [17] K. Hotta, “A robust face detector under partial occlusion,” in
of scale invariance, boundedness, and reconstruction International Conference on Image Processing, 2004.
[18] Y. Lin, T. Liu, and C. Fuh, “Fast object detection with occlu-
ability. We have developed a method for learning the sions,” in Proceedings of the European Conference on Computer
optimal set of NPD features and their combinations. Vision, 2004, pp. 402–413.
As a result, a single cascade AdaBoost classifier is [19] Y. Lin and T. Liu, “Robust face detection with multi-class
boosting,” in Proceedings of IEEE Computer Society Conference
able to achieve promising results for face detection on Computer Vision and Pattern Recognition, 2005.
with large pose variations and occlusions. Evalua- [20] J. Chen, S. Shan, S. Yang, X. Chen, and W. Gao, “Modification
tions on three public domain databases, namely FD- of the adaboost-based detector for partially occluded faces,”
in 18th International Conference on Pattern Recognition, 2006.
DB, GENKI, and CMU-MIT show that the proposed [21] L. Goldmann, U. Monich, and T. Sikora, “Components and
method outperforms state-of-the-art methods for un- their topology for robust face detection in the presence of
constrained face detection. The proposed NPD face partial occlusions,” IEEE Transactions on Information Forensics
and Security, vol. 2, no. 3, pp. 559–569, 2007.
detector can process 1920 × 1080 video frames in [22] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in
realtime, which is about 6 times faster than the Viola- images: A survey,” IEEE Transactions on Pattern Analysis and
Jones face detector implemented in OpenCV 2.4. The Machine Intelligence, vol. 24, no. 1, pp. 34–58, 2002.
reported results also show that occlusions and blur are [23] E. H. Weber, “Tastsinn und gemeingefühl,” in Handwörterbuch
der Physiologie, R. Wagner, Ed. Brunswick: Vieweg, 1846, pp.
two big challenges for face detection. Our future work 481–588.
will use the NPD feature and the classifier learning [24] C. Zhang and Z. Zhang, “A survey of recent advances in face
method for other applications such as face attribute detection,” Microsoft Research, Tech. Rep. MSR-TR-2010-66,
June 2010.
classification (e.g. pose estimation, age estimation, and [25] P. Sinha, “Qualitative representations for recognition,” in Bio-
gender classification) and pedestrian detection. logically Motivated Computer Vision Workshop, 2002.
14

[26] J. Sadr, S. Mukherjee, K. Thoresz, , and P. Sinha, “Toward the [51] L. Bourdev and J. Brandt, “Robust object detection via soft
fidelity of local ordinal encoding,” in Proceedings of the Annual cascade,” in IEEE Computer Society Conference on Computer
Conference on Neural Information Processing Systems, 2001. Vision and Pattern Recognition, vol. 2, 2005, pp. 236–243.
[27] S. Baluja, M. Sahami, and H. Rowley, “Efficient face orientation [52] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution
discrimination,” in International Conference on Image Processing, gray-scale and rotation invariant texture classification with
vol. 1, 2004, pp. 589–592. local binary patterns,” IEEE Transactions on Pattern Analysis
[28] S. Liao, Z. Lei, X. Zhu, Z. Sun, S. Z. Li, and T. Tan, “Face and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
recognition using ordinal features,” in Proceedings of the 1st
IAPR International Conference on Biometrics, Hong Kong, 2006.
[29] Y. Abramson, B. Steux, and H. Ghorayeb, “Yet even faster
(YEF) real-time object detection,” International Journal of Intel-
ligent Systems Technologies and Applications, vol. 2, no. 2, pp.
102–112, 2007.
[30] L. Wang, L. Ding, X. Ding, and C. Fang, “2D face fitting-
assisted 3D face reconstruction for pose-robust face recogni-
tion,” Soft Computing-A Fusion of Foundations, Methodologies and
Applications, vol. 15, no. 3, pp. 417–428, 2011.
[31] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis
of detection cascades of boosted classifiers for rapid object
detection,” MRL, Intel Labs, Tech. Rep., May 2002.
[32] S. Brubaker, J. Wu, J. Sun, M. Mullin, and J. Rehg, “On the
design of cascades of boosted ensembles for face detection,”
Georgia Institute of Technology, Tech. Rep. GIT-GVU-05-28,
2005.
[33] L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classifica-
tion and Regression Trees. Chapman & Hall/CRC, 1984.
[34] H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant
neural network-based face detection,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 1998.
[35] E. Seemann, B. Leibe, and B. Schiele, “Multi-aspect detection
of articulated objects,” in Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2006.
[36] T. Kim and R. Cipolla, “MCBoost: Multiple classifier boosting
for perceptual co-clustering of images and visual features,”
Proceedings of Neural Information Processing Systems, 2008.
[37] V. B. Subburaman and S. Marcel, “Fast bounding box estima-
tion based face detection,” in ECCV Workshop on Face Detection:
Where we are and what next, 2010.
[38] V. Jain and E. Learned-Miller, “Online domain adaptation of
a pre-trained cascade of classifiers,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2011.
[39] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
robust features (SURF),” Computer Vision and Image Understand-
ing, vol. 110, no. 3, pp. 346–359, 2008.
[40] X. Zhu and D. Ramanan, “Face detection, pose estimation,
and landmark localization in the wild,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2012.
[41] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Detecting and aligning
faces by image retrieval,” in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2013.
[42] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang, “Probabilistic
elastic part model for unsupervised face detector adaptation,”
in IEEE International Conference on Computer Vision, 2013.
[43] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikäinen, X. Chen,
and W. Gao, “WLD: A robust local image descriptor,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 9, pp. 1705–1720, Sept. 2010.
[44] F. Kriegler, W. Malila, R. Nalepka, and W. Richardson, “Pre-
processing transformations and their effects on multispectral
recognition,” in Proceedings of the Sixth International Symposium
on Remote Sensing of Environment, 1969, pp. 97–131.
[45] http://mplab.ucsd.edu, “The MPLab GENKI Database,
GENKI-SZSL Subset.”
[46] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth, “Whos in
the picture,” Advances in neural information processing systems,
vol. 17, pp. 137–144, 2004.
[47] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human detec-
tion based on a probabilistic assembly of robust part detec-
tors,” in European Conference on Computer Vision (ECCV), 2004.
[48] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Robust
face detection by simple means,” in DAGM Computer Vision in
Applications Workshop, 2012.
[49] S. Seguı́, M. Drozdzal, P. Radeva, and J. Vitrià, “An integrated
approach to contextual face detection.” in International Confer-
ence on Pattern Recognition Applications and Methods, 2012.
[50] PittPatt Software Developer Kit, Pittsburgh Pattern Recogni-
tion, Inc., http://www.pittpatt.com.
1

A PPENDIX A the first p − 1 rows. To show this, let’s denote the row
B OUNDEDNESS OF NPD containing fij −1 and fij +1 by rij . We will show that
Lemma 1 (Boundedness): ∀x, y ≥ 0, the NPD fea- fij − 1 fij + 1
ture f(x,y) is well bounded in [-1,1]. In addition, rij = r1i + r1j , (g)
f1i + 1 f1j + 1
f (x, y) = 1 if and only if x > 0 and y = 0; and
f (x, y) = −1 if and only if x = 0 and y > 0. holds for all i > 1 and j > i. In fact, it is easy to verify
Proof: From the definition of NPD we know that that the above equation holds for all columns of rij ,
x ≥ 0, y ≥ 0, and f (0, 0) = 0 ∈ [−1, 1]. When either x r1i , and r1j after the first column. So, we only need
or y is nonzero, for example, y ≥ 0 but x > 0, Eq. (1) to show that, for the first column, we have
can be reformulated as (f1i − 1)(fij − 1) (f1j − 1)(fij + 1)
+ = 0, (h)
x−y 2x 2 f1i + 1 f1j + 1
f (x, y) = = −1= − 1 ≤ 1. (a)
x+y x+y 1 + xy which is equivalent to
The inequality in Eq. (a) holds because y ≥ 0, and f1i f1j fij − f1i + f1j − fij = 0. (i)
the last equality holds if and only if x > 0 and y =
0. Similarly, when x ≥ 0 but y > 0, Eq. (1) can be This can be verified by substituting each feature with
reformulated as its definition in Eq. (1).
x−y 2y 2 Given that rank(F) = p − 1, we know that the
f (x, y) = =1− =1− x ≥ −1. (b) nullspace of F contains only one nonzero vector,
x+y x+y y +1
which is a solution to Eq. (e). Furthermore, from
The inequality in Eq. (b) holds because x ≥ 0, and the Lemma 1 we can infer that (fij −1)(fij +1) ≤ 0, hence
last equality holds if and only if x = 0 and y > 0.  Eq. (d) tells that xi xj ≥ 0, ∀i, j. Consequently, Eq. (e)
always has a nonnegative solution x̂, and all solutions
A PPENDIX B to Eq. (e) must be cx̂, where c is a scale factor. 
Given this proof, we make four observations below:
P ROOF OF T HEOREM 1
• For a solution, c can be any real value, but to
Denote fij = f (xi , xj ). From Eq. (1) we have satisfy the constraint that all pixel intensity values
fij (xi + xj ) = xi − xj . (c) are nonnegative, c should be positive.
• The solution to Eq. (e) spans a one-dimensional
Equivalently, subspace (the nullspace).
• A specific solution can be obtained by assigning
(fij − 1)xi + (fij + 1)xj = 0. (d)
x1 = 1 and solving for the other variables from
Therefore, we have the following set of linear equa- the first p − 1 rows of Eq. (e) in linear time.
tions • When the original image is x = 0, it can also be
Fx = 0, (e) reconstructed by cx̂ where x̂i = 1, ∀i, and c = 0.
However, in this case a solution with c > 0 is
where not generally regarded as a scaled version of the
 
f12 − 1 f12 + 1 0 ··· 0 original image x = 0.
 f13 − 1 0 f13 + 1 ··· 0 
 
 ··· ··· ··· ··· ··· 
 
F=
 f1p − 1 0 0 ··· f1p + 1 
 (f)
 0 f23 − 1 f23 + 1 ··· 0 
 
 ··· ··· ··· ··· ··· 
0 0 0 · · · fp−1,p + 1
is a sparse d × p matrix with each row containing
at most two nonzero entries. Furthermore, from the
formulation of F we know that each row of F contains
at least one nonzero entry, because (fij − 1) 6= (fij + 1)
always holds for all i and j. Without loss of gener-
ality, let’s assume f12 + 1 6= 0. Then it follows that
f1j +1 6= 0, ∀j. Because if ∃j such that f1j +1 = 0, then
from Lemma 1 we know that x1 = 0. This will further
lead to f12 +1 = 0, which violates the assumption that
f12 +1 6= 0. Therefore, the first p−1 rows in the matrix
F are linearly independent of each other.
We will further prove that rank(F) = p − 1. In fact,
any row of the matrix F can be linearly expressed by

View publication stats

You might also like