A Fast and Accurate Unconstrained Face Detector
A Fast and Accurate Unconstrained Face Detector
net/publication/264558859
Article in IEEE Transactions on Pattern Analysis and Machine Intelligence · August 2014
DOI: 10.1109/TPAMI.2015.2448075 · Source: arXiv
CITATIONS READS
317 8,695
3 authors:
Stan Z Li
Westlake University
1,061 PUBLICATIONS 56,062 CITATIONS
SEE PROFILE
All content following this page was uploaded by Stan Z Li on 25 August 2014.
Abstract—We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and
occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the
difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is
scale invariant, bounded, and is able to reconstruct the original image. Second, we learn the optimal subset of NPD features and
their combinations via regression trees, so that complex face manifolds can be partitioned by the learned rules. This way, only
a single cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can
be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector
very fast (about 178 FPS for 640x480 resolution videos and 30 FPS for 1920x1080 resolution videos on a desktop PC, about
arXiv:1408.1656v1 [cs.CV] 6 Aug 2014
6 times faster than OpenCV). Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the
proposed method outperforms the state-of-the-art methods in detecting unconstrained faces with arbitrary pose variations and
occlusions in cluttered scenes.
Index Terms—Unconstrained face detection, normalized pixel difference, regression tree, AdaBoost, cascade classifier
1 I NTRODUCTION
The objective of face detection is to find and locate
faces in an image. It is the first step in automatic face
recognition applications. Face detection has been well
studied for frontal and near frontal faces. The Viola
and Jones’ face detector [1] is the most well known
face detection algorithm, which is based on Haar-like
features and cascade AdaBoost [2] classifier. However,
in unconstrained scenes such as faces in a crowd,
state-of-the-art face detectors fail to perform well
due to large pose variations, illumination variations,
occlusions, expression variations, out-of-focus blur,
Fig. 1. Face images annotated (red ellipses) in the
and low image resolution. For example, the Viola-
FDDB database [3].
Jones face detector fails to detect most of the face
images in the Face Detection Data set and Benchmark
(FDDB) database [3] (examples shown in Fig. 1) due
focusing on extracting different types of features and
to the difficulties mentioned above. In this paper, we
developing different cascade structures. A variety of
refer to face detection with arbitrary facial variations
complex features [4], [5], [6], [7], [8], [9], [10], [11],
as the unconstrained face detection problem. We are
[12], [13] have been proposed to replace the Haar-
interested in face detection in unconstrained scenarios
like features used in [1]. While these methods can
such as video surveillance or images captured by
improve the face detection performance to some ex-
hand-held devices.
tent, they generate a very large number (hundreds of
Numerous face detection methods have been de- thousands) of features and the resulting systems take
veloped following Viola and Jones’ work [1], mainly too much time to train. Another development in face
detection has been to learn different cascade struc-
• Shengcai Liao and Stan Z. Li are with the National Laboratory of Pat- tures for multiview face detection, such as parallel
tern Recognition and the Center for Biometrics and Security Research,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190,
cascade [14], pyramid architecture [15], and Width-
China. E-mail: {scliao,szli}@nlpr.ia.ac.cn First-Search (WFS) tree [16]. All these methods need
• Anil K. Jain is with the Dept. of Computer Science and Engineering, to learn one cascade classifier for each specific facial
Michigan State University, East Lansing, MI 48824 USA. He is
also affiliated with the Dept. of Brain & Cognitive Engineering,
view (or view range). In unconstrained scenarios,
Korea University, Anamdong, Seongbukgu, Seoul 136-713, Republic however, it is not easy to define all possible views
of Korea. E-mail: [email protected] of a face, and the computational cost increases with
an increasing number of classifiers in complex cascade
2
structure. Moreover, these approaches require manual viewpoints, without pose labeling or clustering
labeling of face pose in each training image. in the training stage.
While some of the available methods [14], [15], The advantages of the proposed approach include:
[16] can handle multiview faces, they are not able • The NPD feature evaluation is extremely fast,
to simultaneously consider other challenges such as requiring a single memory access using a look
occlusion. In fact, since these methods require parti- up table.
tioning multiview data into known poses, occlusion • Multiscale face detection can be easily achieved
is not easy to handle in this way. On the other hand, by applying pre-scaled detection templates.
while several studies addressed face detection under • The unconstrained face detector does not depend
occlusion [17], [18], [19], [20], [21], they constrained on pose specific cascade structure design; pose
themselves to detect only frontal faces under occlu- labeling or clustering in the training stage is also
sion. As discussed in [22], a robust face detection al- not required.
gorithm should be effective under arbitrary variations • The face detector is able to handle illumina-
in pose and occlusion, which remains an unresolved tion variations, pose variations, occlusions, out-
challenging problem. of-focus blur, and low resolution face images in
In this paper, we are interested in developing effec- unconstrained scenarios.
tive features and robust classifiers for unconstrained
The remainder of this paper is organized as follows.
face detection with arbitrary facial variation. First, we
In Section 2 we review the related work. In Section 3
propose a simple pixel-level feature, called the Nor-
we introduce the NPD feature space. The proposed
malized Pixel Difference (NPD). An NPD is computed
NPD based face detection method is presented in Sec-
as the ratio of the difference between any two pixel
tion 4. Experimental results are provided in Section 5.
intensity values to the sum of their values, in the
Finally, we summarize the contributions in Section 6.
same form as the Weber Fraction in experimental psy-
chology [23]. The NPD feature has several desirable
properties, such as scale invariance, boundedness, and 2 R ELATED WORK
ability to reconstruct the original image. we further As indicated in a survey of face detection method-
show that NPD features can be obtained from a look s [24], the most popular face detection methods are
up table, and the resulting face detection template can appearance based, which use local feature represen-
be easily scaled for multiscale face detection. tation and classifier learning. Viola and Jones’ face
Secondly, we develop a method to construct a single detector [1] was the first one to apply rectangular
cascade AdaBoost classifier that can effectively deal Haar-like features in a cascaded AdaBoost classifier
with complex face manifolds and handle arbitrary for real-time face detection. Many approaches have
pose and occlusion conditions. While the individual been proposed around the Viola-Jones detector to
NPD feature may have “weak” discriminative ability, advance the state of the art in face detection. Lienhart
our work indicates that a subset of NPD features and Maydt [4] proposed an extended set of Haar-like
can be optimally learned and combined to construct features, where 45◦ rotated rectangular features were
more discriminative features in a regression tree. In introduced. Li et al. [5] proposed another extension
this way, different types of faces can be automatically of Haar-like features, where the rectangles can be
divided into different leaves of a regression tree, and spatially set apart with a flexible distance. A similar
the complex face manifold in high dimensional space feature, called the diagonal filter was also proposed
can be partitioned in the learning process. This is by Jones and Viola [6]. Various other local texture
a “divide and conquer” strategy to tackle uncon- features have been introduced for face detection, such
strained face detection in a single classifier, without as the modified census transform [7], local binary
pre-labeling of views in the training set of face images. pattern (LBP) [8], MB-LBP [11], LBP histogram [10],
The resulting face detector is robust to variations in and the locally assembled binary feature [12]. These
pose, occlusion, and illumination, as well as to blur features have been shown to be robust to illumination
and low image resolution. variations. Mita et al. [9] proposed the joint Haar-
The novelty of this work is summarized as follows: like features to capture the co-occurrence of effective
• A new type of feature, called NPD is proposed, Haar-like features. Huang et al. [16] proposed a sparse
which is efficient to compute and has several feature set in a granular space, where granules were
desirable properties, including scale invariance, represented by rectangles, and each individual sparse
boundedness, and enabling reconstruction of the feature was learned as a combination of granules. A
original image. problem with the approaches in [9] and [16] is that the
• A subset of NPD features is automatically learned joint feature space is very large, making the optimal
and combined in regression trees to boost their combination a difficult task.
discriminability. In this way, only a single cas- While more sophisticated features may provide bet-
cade AdaBoost classifier is needed to handle un- ter discrimination power than Haar-like features for
constrained faces with occlusions and arbitrary the face detection task, they generally increase the
3
computational cost. In contrast, ordinal relationships the Vector Boost algorithm for multiclass learning,
among image regions are simple yet effective image which is well suited for multiview pose estimation.
features [25], [26], [27], [28], [29], [30]. Sinha [25] stud- However, all these methods need to learn a cascade
ied several robust ordinal relationships in face images classifier for each specific view (or view range) of
and developed a face detection method accordingly. a face, requiring an input face image to go through
Liao et al. [28] further showed that ordinal features different branches of the detection structure. Hence,
can be effectively learned by AdaBoost classifier for their computational cost generally increases with the
face recognition. Sadr et al. [26] showed that pixelwise number of classifiers in complex cascade structures.
ordinal features (ordinal relationship between any two Moreover, these approaches require manual labeling
pixels) can faithfully encode image structures. Baluja of the face pose in each training image.
et al. [27] showed that simple pixelwise ordinal fea- Instead of designing a detection structure, Lin and
tures are good enough for discriminating between five Liu [19] proposed to learn the multiview face de-
facial orientations, a relatively simpler task than face tector as a single cascade classifier. They derived a
detection. Wang et al. [30] applied the random forest multiclass boosting algorithm, called MBHBoost by
classifier together with pixelwise ordinal features for sharing features among different classes. This is a
facial landmark localization. Abramson and Steux [29] simpler approach to multiview face detection than
proposed a pixel control point based feature for face designing complex cascade structures. Nevertheless, it
detection, where each feature is associated with two still requires manual labeling of poses. In uncontrolled
sets of pixel locations (control points). However, it is environments, however, it is not easy to define specific
not easy to learn control point based features because views of a face by discretizing the pose space, because
of the huge number of control point combinations. a face could be in arbitrary pose simultaneously in
Besides different feature representations, some re- yaw (out-of-plane), roll (in-plane), and pitch (up-and-
searchers have also tried different AdaBoost algo- down) angles. To avoid manual labeling, Seemann et
rithms and weak classifiers. For weak classifiers uti- al. [35] suggested learning viewpoint clusters auto-
lized in boosting, Lienhart et al. [31] and Brubaker et matically for object detection. However, for human
al. [32] have shown that classification and regression faces, Kim and Cipolla [36] showed that clustering by
trees (CART) [33] work better than simple decision traditional techniques like K-Means does not result
stumps. In this paper, we show that the optimal ordi- in categorized poses. They hence proposed a multi-
nal features and their combinations can be learned by classifier boosting (MCBoost) for human perceptual
integrating the proposed NPD features in a regression clustering of object images, which showed promise
tree. In this way, unconstrained face variations can be for clustering face poses. However, the clusters are
automatically partitioned into different leaves of the not always related to pose variations; in addition to
learned regression tree. different pose clusters, they also obtained clusters
Given that the original Viola-Jones face detector has with various illumination variations.
limitations for multiview face detection [24], various Face detection in presence of occlusion is also an
cascade structures have been proposed to tackle mul- important issue in unconstrained face detection, but
tiview face detection [6], [14], [15], [16]. Jones and it has received less attention compared to multiview
Viola [6] extended their face detector by training one face detection. This is probably because, compared
face detector for each specific pose. To avoid evaluat- to pose variations, it is more difficult to categorize
ing all face detectors on each scanning subwindow, arbitrary occlusions into predefined classes. Hotta [17]
they developed a pose estimation step (similar to proposed a local kernel based SVM method for face
Rowley et al. [34]) before face detection, and then only detection, which was better than global kernel based
the face detector trained on that estimated pose was SVM in detecting occluded frontal faces. Lin et al. [18]
applied. In this two-stage detection structure, if the considered 8 kinds of manually defined facial oc-
pose estimation is not reliable, the face is not likely clusions by training 8 additional cascade classifiers
to be detected in the second stage. Wu et al. [14] besides the standard face detector. Lin and Liu [19]
proposed a parallel cascade structure for multiview further proposed the MBHBoost algorithm to handle
face detection, where all face detectors tuned to dif- faces with one of 12 in-plane rotations or one of 8
ferent views have to be evaluated for each scanning types of occlusions, with each kind of rotation and
window; they did use the first few cascade layers of occlusion treated as a different class. Chen et al. [20]
all face detectors to estimate the pose for speedup. proposed a modified Viola-Jones face detector, where
Li and Zhang [15] proposed a coarse-to-fine pyramid the trained detector was divided into sub-classifiers
architecture for multiview face detection, where the related to several predefined local patches, and the
entire range of face poses was divided into increas- outputs of sub-classifiers were fused. Goldmann et
ingly smaller subranges, resulting in a more efficient al. [21] proposed a component-based approach for
detection structure. Huang et al. proposed a WFS face detection, where the two eyes, nose, and mouth
tree based multiview face detection approach, which were detected separately, and further connected in a
also works in a coarse-to-fine manner. They proposed topology graph. However, none of the above meth-
4
ods considered face detection with both occlusions between the two pixels x and y. Compared to the
and pose variations simultaneously in unconstrained absolute difference |x − y|, NPD is invariant to scale
scenarios. As discussed in [22], a robust face detector change of the pixel intensities.
should be effective under arbitrary variations in pose Weber, a pioneer in experimental psychology, stated
and occlusion, which has not yet been solved. that the just-noticeable difference in the magnitude
Recently, unconstrained face detection has gained change of a stimulus is proportional to the magnitude
attention. Jain and Learned-Miller [3] developed the of the stimulus, rather than its absolute value [23].
FDDB database and benchmark for the developmen- This is known as the Weber’s Law. In other words, the
t of unconstrained face detection algorithms. This human perception of difference in stimulus is often
database contains images collected from the Internet, measured as a fraction of the original stimulus, that
and presents challenging scenarios for face detection. is, in a form ΔI/I, which is called the Weber Fraction.
Subburaman and Marcel [37] proposed a fast bound- Chen et al. [43] proposed a local image descriptor,
ing box estimation technique for face detection, where called Weber’s Law Descriptor for face recognition,
the bounding box is predicted by small patch based which was computed from Weber Fractions of pixels
local search. Jain and Learned-Miller [38] proposed in a 3 × 3 window. The proposed feature in Eq. (1)
an online domain adaption approach to improve has also been used in other fields such as remote
the performance of the Viola-Jones face detector on sensing, where the Normalized Difference Vegetation
the FDDB database. Li et al. [13] proposed the use Index (NDVI) [44] is defined as the difference to sum
of SURF feature [39] in an AdaBoost cascade, and ratio between the visible red and the near infrared
area under the curve (AUC) criterion to speed up spectra to estimate the green vegetation coverage.
the face detector training. Zhu and Ramanan [40] The NPD feature has a number of desirable proper-
proposed to jointly detect face, estimate pose, and ties. First, the NPD feature is antisymmetric, so either
localize face landmarks in the wild. Shen et al. [41] f (x, y) or f (y, x) is adequate for feature representa-
proposed an exemplar-based face detection approach, tion, resulting in a reduced feature space. Therefore,
which retrieves images from a large annotated face in an s × s image patch (vectorized as p × 1, where
dataset; facial landmark locations are inferred from p = s · s), NPD feature f (xi , xj ) for pixel pairs
the annotations. Li et al. [42] proposed a probabilistic 1 ≤ i < j ≤ p is computed, resulting in d = p(p − 1)/2
elastic part (PEP) model to adapt any pre-trained features. For example, in a 20×20 face template, there
face detector to a specific image collection like FDDB. are (20 × 20) × (20 × 20 − 1)/2 = 79, 800 NPD features
This method extracts the PEP representation for each in total. We call the resulting feature space the NPD
candidate face detected by a general face detector, and feature space, denoted as Ωnpd (∈ Rd ).
trains a classifier with the top positive and negative Second, the sign of f (x, y) is an indicator of the or-
samples. Despite the availability of these methods for dinal relationship between x and y. Ordinal relation-
unconstrained face detection, the detection accuracy ship has been shown to be an effective encoding for
is still not satisfactory, especially when the detector is object detection and recognition [25], [26], [28] because
required to have low false alarms. ordinal relationship encodes the intrinsic structure of
an object image and it is invariant under various
illumination changes [25]. However, simply using the
3 N ORMALIZED P IXEL D IFFERENCE F EA - sign to encode the ordinal relationship is likely to be
TURE S PACE sensitive to noise when x and y have similar values.
The Normalized Pixel Difference (NPD) feature be- In the next section we will show how to learn robust
tween two pixels in an image is defined as ordinal relationships with NPD features.
x−y Third, the NPD feature is scale invariant, which is
f (x, y) = , (1) expected to be robust against illumination changes.
x+y
This is important for image representation, since il-
where x, y ≥ 0 are intensity values of the two pixels1 , lumination change is always a troublesome issue for
and f (0, 0) is defined as 0 when x = y = 0. both object detection and recognition.
The NPD feature measures the relative difference Fourth, as shown in Appendix A, the NPD feature
between two pixel values. The sign of f (x, y) indicates f(x,y) is bounded in [-1,1]. The bounded property
the ordinal relationship between the two pixels x and makes the NPD feature amenable to histogram bin-
y , and the magnitude of f (x, y) measures the relative ning or threshold learning in tree-based classifiers [1].
difference (as a percentage of the joint intensity x + y) Fig. 2 shows that f (x, y) is a bounded function and it
between x and y. Note that the definition f (0, 0) 0 is defines a nonlinear surface.
reasonable because, in this case, there is no difference Theorem 1 (Reconstruction): Given the NPD fea-
ture vector f = (f (x1 , x2 ), f (x1 , x3 ), . . . ,f (xp−1 , xp ))T
1. For ease of representation, sometimes we also denote x and y ∈ Ωnpd , the original image intensity values I =
as pixels instead of pixel values. We use subscripts to differentiate
between pixel and pixel values only when pixel locations are under (x1 , x2 , . . . , xp )T can be reconstructed up to a scale
discussion. factor.
5
The proof of Theorem 1 is shown in Appendix B, at each branch node, but also learn optimal thresholds
which also gives a linear-time approach to reconstruct for splitting. Generally, one of the following two cases
the original image up to a scale factor. Theorem 1 are leaned for each NPD feature at a branch node:
states that each point in the feature space Ωnpd cor- x−y
f (x, y) = < θ1 < 0, (2)
responds to a group of intensity-scaled images in x+y
the original pixel intensity space. In contrast, the
x−y
scale invariance property says that all intensity-scaled f (x, y) = ≥ θ2 > 0, (3)
x+y
images are “compressed” to a point in the bounded
feature space Ωnpd . Therefore, Ωnpd is a feature space where θ1 and θ2 are the thresholds. Eq. (2) applies if
which is invariant to scale variations, but it carries all the object pixel x is notably darker than pixel y, while
the necessary information from the original space. Eq. (3) covers the case when pixel x is notably brighter
than pixel y. The learned thresholds allow the ordinal
4 NPD FOR FACE DETECTION encodings in the learned regression trees to represent
the intrinsic object structure. To learn such regression
4.1 Learning Object Structures trees, we use the CART algorithm [33] with the NPD
Ordinal relationship [25] is a well-known simple and features.
basic concept: it compares the brightness of any two
image regions, and encodes the result with 1 (brighter)
4.2 Face Detector
or 0 (darker) accordingly. Sinha [25] showed that
ordinal features can represent the intrinsic structure of Given that the proposed NPD features contain re-
objects such as a human face, and they are insensitive dundant information, we also apply the AdaBoost
to illumination changes. Instead of encoding ordinal algorithm to select the most discriminative features
relationship between two image regions, in this paper, and construct strong classifiers [1]. We adopt the Gen-
we learn robust ordinal relationships between pairs of tle AdaBoost algorithm [2] to learn the NPD feature
pixels via the NPD feature. For a face pattern which is based regression trees.
well structured, automatically learned combinations As in [1], a cascade classifier is further learned for
of ordinal features may represent a face better than rapid face detection. We only learn one single cascade
manual configurations. Therefore, we propose to learn classifier for unconstrained face detection robust to
a combination of simple ordinal features by boosted occlusions and pose variations. This implementation
regression trees [33]. By providing a training set of has the advantage that there is no need to label the
face and nonface images, a weak classifier is learned pose of each face image manually or cluster the poses
by a regression tree. At each node, the tree checks before training the detector. In the learning process,
the optimal ordinal feature value, and then passes the the algorithm automatically divides the whole face
input data to the next branch accordingly. See Fig. 3. manifold into several sub-manifolds by regression
Regression tree is also well suited for face detection trees.
with arbitrary pose variations, since similar views can Below is a summary of how the proposed method
be clustered in the same leaf node of the tree. handles the unconstrained face detection problem.
Ordinal relationship can always be generated by • Pose. Pose variations are handled by learning
the default threshold of 0, but it will be sensitive to NPD features in boosted regression trees, where
noise especially when the two pixels to be compared different views can be automatically partitioned
have similar values. In this paper, we learn robust into different leaves of the regression trees.
ordinal relationships and their combinations by learn- • Occlusion. In contrast to Haar-like features that
ing regression trees with NPD features. In this way, are sensitive to occlusions because of large sup-
regression trees not only learn optimal NPD features port [18], NPD features are computed by only
6
samples. Fig. 7 illustrates such modified images. 34%, Smooth NPD, EXP−1
31%, Smooth NPD, EXP−2
For EXP-2, we used the detector trained with data
0.5
30%, NPD, EXP−2
26%, NPD, EXP−1
outside FDDB, as described in the previous subsec- 0.4
21%, Zhu−Ramanan [42]
True Positive Rate
tion. For evaluation, this detector was applied on 16%, Illuxtech Inc.
5%, Olaworks Inc.
each subset of the FDDB database separately, and an 0.3 3%, VJGPR [40]
2%, Mikolajczyk et al. [49][3]
sults reported on the FDDB website3 . The ROC curves 0.1 0%, Segui et al. [51]
n/a, PEP [44]
n/a, Subburaman−Marcel [39]
of various algorithms are depicted in Fig. 8 for the 0
1 10 100 500
discrete score metric and in Fig. 9 for the continuous False Positives
score metric. Note that all the baseline results are for
Fig. 9. ROC curves for face detection on the FDDB
3. http://vis-www.cs.umass.edu/fddb/results.html database [3] with the continuous score metric.
8
Fig. 10. Detected faces in the FDDB database [3] by the proposed NPD method. Green boxes are detections by
the NPD detector, while red ellipses are ground truth annotations.
outperforms all the baseline methods except Olawork- the NPD detectors trained for EXP-1 and EXP-2 are
s Inc. However, the proposed NPD detector is much comparable, though the training data size for EXP-
better than Olaworks’ detector when FP < 10. In 2 is several times larger than that for EXP-1. This
fact, when FP=0 (shown in the legend), the proposed result indicates that FDDB contains representative
NPD detector detects 45% of the annotated FDDB images for unconstrained face detection. However, it
faces in coarse sense (50% overlap with ground truth), is not easy to handle all this data in training a single
while the detection rates of all baseline detectors are detector (recall the large variations in face appearance
below 30%. Note that with a sub training set that in Fig. 6). Note that the generic NPD features are
was previously used for SURF Cascade [13], NPD for learned in regression trees to divide and conquer the
EXP-2 shows much better performance than SURF complex face manifolds.
Cascade. Further, the Smooth NPD is slightly better
Similar observations can be found in Fig. 9 for the
than NPD, but with an additional cost of smoothing
continuous score metric, except that Zhu-Ramanan is
computation. It is also observed that the results of
slightly better than the proposed method when FP> 5,
9
TABLE 1
Comparison of detection rates (%) with both discrete
and continuous metrics for EXP-2 on the FDDB
database [3]*
Fig. 12. Detected faces in the GENKI-SZSL dataset [45] by the proposed NPD method.
Fig. 14. Detected faces in the CMU-MIT dataset [34] by the proposed NPD method.
0.75 0.75
0.5 0.5
True Positive Rate
TABLE 2
Comparison of detector complexity. performs better than Haar and LBP, especially at
low false positives, indicating that combining optimal
Haar LBP POF NPD-stump NPD-tree pixel-level features in regression trees provides better
#weak classifiers 150 108 276 1,597 176 discrimination between faces and nonfaces. On the
#features learned 1,763 1,269 3,082 1,597 2,035
#feature evaluations 33.9 30.4 44.3 36.5 34.4
other hand, one can also observe that except at low
false positives, NPD performs about the same or just
slightly better than Haar-like features and LBP.
We also tried a variation of NPD, defined as
for continuous metric. The improvement is larger at
f (x, y) = √x−y . This is denoted as NPD2. With
smaller false positives. 2 2 x +y
Next, we fixed the regression tree based weak the same setting as NPD, we trained another detector
learner, but tried three other local features, namely based on NPD2. The testing results on FDDB are
Haar-like features [1], LBP [52], and pixelwise ordinal also shown in Fig. 16; the performances of NPD and
feature (POF) [30]. Since LBP is a discrete label, we NPD2 are about the same, with NPD2 being slightly
treated it as a categorical variable in the regression better. However, considering that NPD is simpler than
tree learning, that is, for branching at each tree node, NPD2, we still suggest the formulation of Eq. (1).
the algorithm finds the optimal criterion that splits
the discrete LBP codes into two groups. Using the 5.6 Evaluation Under Specific Detection Chal-
same training set as in Section 5.1, we trained the three lenge
detectors using Haar, LBP, and POF, respectively. The In the following, we evaluate how the proposed NPD
model complexity of these detectors is summarized face detector performs under illumination variation,
in Table 2. It can be observed that, the NPD model pose variation, occlusion, and blur (or low resolution).
appears to be more efficient than the POF model, Note that these four challenges are often encountered
though it requires slightly more feature evaluations simultaneously in an image. In our selection of the
than the Haar and LBP models. However, it should be four subsets, one per specific challenge, we focused on
noted that the computation of Haar-like features re- the main source of variation in each image. For each
quires computing integral images, while for LBP, each challenge, we selected 100 images from the FDDB
feature needs to compare 8 pairs of pixels and con- database [3] (examples are shown in Fig. 17), and
vert the resulting binary string to the corresponding ran the NPD detector described in Subsection 5.1 on
decimal number. In contrast, using look up tables as each subset separately. Fig. 18 shows that the NPD
aforementioned, computing the NPD feature requires face detector performs the best on the illumination
only one memory access. subset. This is not surprising since the proposed NPD
The four detectors with different local features were features are robust against illumination variations.
tested on the FDDB database, and the corresponding Further, the NPD method performs better for face
ROC curves are shown in Fig. 16 for both the discrete images with pose variation than with occlusion or
and continuous score metrics. The NPD detector per- blur. These results indicate that occlusion and blur
forms better than the Haar, LBP, and POF detectors are the two major challenges for unconstrained face
with the same regression tree based weak learners. detection, which have not been well addressed in the
The performance improvements due to NPD features literature.
over Haar, LBP, and POF features are about 6%, The NPD face detector is also compared with the
10%, and 6%, respectively, for discrete metric, and Viola-Jones face detector implemented in OpenCV 2.4,
about 4%, 6%, and 4%, respectively, for continuous and the commercial face detector PittPatt on the four
metric, at FP=1. NPD is better than POF, because subsets of FDDB discussed above. The resulting ROC
with NPD features the regression tree learns optimal curves with the discrete score metric are shown in
thresholds to form more robust ordinal rules. NPD Fig. 19. These plots show that the proposed NPD
12
TABLE 3 R EFERENCES
Comparison of face detection speed (as FPS).
[1] P. Viola and M. Jones, “Rapid object detection using a boosted
CPU Resolution NPD OpenCV SURF [13]* cascade of simple features,” in IEEE Computer Society Confer-
640 × 480 19.4 2.1 5.8 ence on Computer Vision and Pattern Recognition, 2001.
Atom N450 [2] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic re-
800 × 600 12.1 1.3 -
@1.6GHz 1280 × 720 6.8 0.7 - gression: a statistical view of boosting,” The Annals of Statistics,
(1 core, 2 threads) 1920 × 1080 3.0 0.3 - vol. 28, no. 2, pp. 337–374, April 2000.
640 × 480 177.6 24.4 71.3 [3] V. Jain and E. Learned-Miller, “FDDB: A benchmark for
i5-2400 face detection in unconstrained settings,” University of Mas-
800 × 600 112.6 16.2 -
@3.1GHz 1280 × 720 63.3 8.9 - sachusetts, Amherst, Tech. Rep. UM-CS-2010-009, 2010.
(4 cores, 4 threads) 1920 × 1080 29.6 3.6 - [4] R. Lienhart and J. Maydt, “An extended set of Haar-like
* “-” means data is not available for the SURF detector [13]. features for rapid object detection,” in Proceedings of the IEEE
This is because we do not have access to the code, and [13] International Conference on Image Processing, 2002.
only reports detection speed at resolution 640×480 or lower. [5] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “S-
tatistical learning of multi-view face detection,” in Proceedings
of the 7th European Conference on Computer Vision, 2002.
[6] M. Jones and P. Viola, “Fast multi-view face detection,” Mit-
subishi Electric Research Lab TR-2003-96, 2003.
real-time (29.6 FPS) on i5 desktop PC for processing [7] B. Froba and A. Ernst, “Face detection with the modified
1920 × 1080 high definition videos. For the standard census transform,” in Proceedings of the 6th IEEE International
VGA (640 × 480) videos, the NPD detector on i5 pro- Conference on Automatic Face and Gesture Recognition, 2004.
[8] H. Jin, Q. Liu, H. Lu, and X. Tong, “Face detection using
cessor can detect faces at even faster speed (177.6 FPS). improved LBP under bayesian framework,” in Proceedings of
On the low-end Atom platform, the NPD detector can the 3rd International Conference on Image and Graphics, 2004.
still run in near real-time (19.4 FPS) for processing [9] T. Mita, T. Kaneko, and O. Hori, “Joint Haar-like features for
face detection,” in Proceedings of the 10th IEEE International
VGA videos. The reasons for the high processing Conference on Computer Vision, vol. 2, 2005, pp. 1619–1626.
speed of NPD are two folds. First, the NPD feature [10] H. Zhang, W. Gao, X. Chen, and D. Zhao, “Object detection
is simple, involving only two pixels. Further with the using spatial histogram features,” Image and Vision Computing,
vol. 24, no. 4, pp. 327–341, 2006.
look up table technique, the evaluation of each NPD
[11] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, “Face detec-
feature requires only one memory access. Second, the tion based on multi-block LBP representation,” in Proceedings
NPD feature can be easily scaled to various sizes of the IAPR/IEEE International Conference on Biometrics, 2007.
of detection templates. Therefore, pre-calculating and [12] S. Yan, S. Shan, X. Chen, and W. Gao, “Locally assembled
binary (LAB) feature with feature-centric cascade for fast and
storing multiscale templates can speed up detection accurate face detection,” in Proceedings of IEEE Computer Society
because rescaling the input image is avoided. Conference on Computer Vision and Pattern Recognition, 2008.
[13] J. Li, T. Wang, and Y. Zhang, “Face detection using SURF
cascade,” in ICCV BeFIT workshop, 2011.
[14] B. Wu, H. Ai, C. Huang, and S. Lao, “Fast rotation invariant
multi-view face detection based on real AdaBoost,” in IEEE
6 S UMMARY AND F UTURE W ORK Conference on Automatic Face and Gesture Recognition, 2004.
[15] S. Li and Z. Zhang, “Floatboost learning and statistical face
detection,” IEEE Transactions on Pattern Analysis and Machine
We have proposed a fast and accurate method for face Intelligence, vol. 26, no. 9, pp. 1112–1123, 2004.
detection in cluttered scenes. The method is based [16] C. Huang, H. Ai, Y. Li, and S. Lao, “High-performance ro-
on the normalized pixel difference (NPD) feature in tation invariant multiview face detection,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp.
conjunction with boosted regression trees. An analysis 671–686, 2007.
of NPD feature shows that it possesses properties [17] K. Hotta, “A robust face detector under partial occlusion,” in
of scale invariance, boundedness, and reconstruction International Conference on Image Processing, 2004.
[18] Y. Lin, T. Liu, and C. Fuh, “Fast object detection with occlu-
ability. We have developed a method for learning the sions,” in Proceedings of the European Conference on Computer
optimal set of NPD features and their combinations. Vision, 2004, pp. 402–413.
As a result, a single cascade AdaBoost classifier is [19] Y. Lin and T. Liu, “Robust face detection with multi-class
boosting,” in Proceedings of IEEE Computer Society Conference
able to achieve promising results for face detection on Computer Vision and Pattern Recognition, 2005.
with large pose variations and occlusions. Evalua- [20] J. Chen, S. Shan, S. Yang, X. Chen, and W. Gao, “Modification
tions on three public domain databases, namely FD- of the adaboost-based detector for partially occluded faces,”
in 18th International Conference on Pattern Recognition, 2006.
DB, GENKI, and CMU-MIT show that the proposed [21] L. Goldmann, U. Monich, and T. Sikora, “Components and
method outperforms state-of-the-art methods for un- their topology for robust face detection in the presence of
constrained face detection. The proposed NPD face partial occlusions,” IEEE Transactions on Information Forensics
and Security, vol. 2, no. 3, pp. 559–569, 2007.
detector can process 1920 × 1080 video frames in [22] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in
realtime, which is about 6 times faster than the Viola- images: A survey,” IEEE Transactions on Pattern Analysis and
Jones face detector implemented in OpenCV 2.4. The Machine Intelligence, vol. 24, no. 1, pp. 34–58, 2002.
reported results also show that occlusions and blur are [23] E. H. Weber, “Tastsinn und gemeingefühl,” in Handwörterbuch
der Physiologie, R. Wagner, Ed. Brunswick: Vieweg, 1846, pp.
two big challenges for face detection. Our future work 481–588.
will use the NPD feature and the classifier learning [24] C. Zhang and Z. Zhang, “A survey of recent advances in face
method for other applications such as face attribute detection,” Microsoft Research, Tech. Rep. MSR-TR-2010-66,
June 2010.
classification (e.g. pose estimation, age estimation, and [25] P. Sinha, “Qualitative representations for recognition,” in Bio-
gender classification) and pedestrian detection. logically Motivated Computer Vision Workshop, 2002.
14
[26] J. Sadr, S. Mukherjee, K. Thoresz, , and P. Sinha, “Toward the [51] L. Bourdev and J. Brandt, “Robust object detection via soft
fidelity of local ordinal encoding,” in Proceedings of the Annual cascade,” in IEEE Computer Society Conference on Computer
Conference on Neural Information Processing Systems, 2001. Vision and Pattern Recognition, vol. 2, 2005, pp. 236–243.
[27] S. Baluja, M. Sahami, and H. Rowley, “Efficient face orientation [52] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution
discrimination,” in International Conference on Image Processing, gray-scale and rotation invariant texture classification with
vol. 1, 2004, pp. 589–592. local binary patterns,” IEEE Transactions on Pattern Analysis
[28] S. Liao, Z. Lei, X. Zhu, Z. Sun, S. Z. Li, and T. Tan, “Face and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.
recognition using ordinal features,” in Proceedings of the 1st
IAPR International Conference on Biometrics, Hong Kong, 2006.
[29] Y. Abramson, B. Steux, and H. Ghorayeb, “Yet even faster
(YEF) real-time object detection,” International Journal of Intel-
ligent Systems Technologies and Applications, vol. 2, no. 2, pp.
102–112, 2007.
[30] L. Wang, L. Ding, X. Ding, and C. Fang, “2D face fitting-
assisted 3D face reconstruction for pose-robust face recogni-
tion,” Soft Computing-A Fusion of Foundations, Methodologies and
Applications, vol. 15, no. 3, pp. 417–428, 2011.
[31] R. Lienhart, A. Kuranov, and V. Pisarevsky, “Empirical analysis
of detection cascades of boosted classifiers for rapid object
detection,” MRL, Intel Labs, Tech. Rep., May 2002.
[32] S. Brubaker, J. Wu, J. Sun, M. Mullin, and J. Rehg, “On the
design of cascades of boosted ensembles for face detection,”
Georgia Institute of Technology, Tech. Rep. GIT-GVU-05-28,
2005.
[33] L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classifica-
tion and Regression Trees. Chapman & Hall/CRC, 1984.
[34] H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant
neural network-based face detection,” in IEEE International
Conference on Computer Vision and Pattern Recognition, 1998.
[35] E. Seemann, B. Leibe, and B. Schiele, “Multi-aspect detection
of articulated objects,” in Proceedings of IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2006.
[36] T. Kim and R. Cipolla, “MCBoost: Multiple classifier boosting
for perceptual co-clustering of images and visual features,”
Proceedings of Neural Information Processing Systems, 2008.
[37] V. B. Subburaman and S. Marcel, “Fast bounding box estima-
tion based face detection,” in ECCV Workshop on Face Detection:
Where we are and what next, 2010.
[38] V. Jain and E. Learned-Miller, “Online domain adaptation of
a pre-trained cascade of classifiers,” in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2011.
[39] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
robust features (SURF),” Computer Vision and Image Understand-
ing, vol. 110, no. 3, pp. 346–359, 2008.
[40] X. Zhu and D. Ramanan, “Face detection, pose estimation,
and landmark localization in the wild,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2012.
[41] X. Shen, Z. Lin, J. Brandt, and Y. Wu, “Detecting and aligning
faces by image retrieval,” in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2013.
[42] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang, “Probabilistic
elastic part model for unsupervised face detector adaptation,”
in IEEE International Conference on Computer Vision, 2013.
[43] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikäinen, X. Chen,
and W. Gao, “WLD: A robust local image descriptor,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32,
no. 9, pp. 1705–1720, Sept. 2010.
[44] F. Kriegler, W. Malila, R. Nalepka, and W. Richardson, “Pre-
processing transformations and their effects on multispectral
recognition,” in Proceedings of the Sixth International Symposium
on Remote Sensing of Environment, 1969, pp. 97–131.
[45] http://mplab.ucsd.edu, “The MPLab GENKI Database,
GENKI-SZSL Subset.”
[46] T. L. Berg, A. C. Berg, J. Edwards, and D. Forsyth, “Whos in
the picture,” Advances in neural information processing systems,
vol. 17, pp. 137–144, 2004.
[47] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human detec-
tion based on a probabilistic assembly of robust part detec-
tors,” in European Conference on Computer Vision (ECCV), 2004.
[48] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Robust
face detection by simple means,” in DAGM Computer Vision in
Applications Workshop, 2012.
[49] S. Seguı́, M. Drozdzal, P. Radeva, and J. Vitrià, “An integrated
approach to contextual face detection.” in International Confer-
ence on Pattern Recognition Applications and Methods, 2012.
[50] PittPatt Software Developer Kit, Pittsburgh Pattern Recogni-
tion, Inc., http://www.pittpatt.com.
1
A PPENDIX A the first p − 1 rows. To show this, let’s denote the row
B OUNDEDNESS OF NPD containing fij −1 and fij +1 by rij . We will show that
Lemma 1 (Boundedness): ∀x, y ≥ 0, the NPD fea- fij − 1 fij + 1
ture f(x,y) is well bounded in [-1,1]. In addition, rij = r1i + r1j , (g)
f1i + 1 f1j + 1
f (x, y) = 1 if and only if x > 0 and y = 0; and
f (x, y) = −1 if and only if x = 0 and y > 0. holds for all i > 1 and j > i. In fact, it is easy to verify
Proof: From the definition of NPD we know that that the above equation holds for all columns of rij ,
x ≥ 0, y ≥ 0, and f (0, 0) = 0 ∈ [−1, 1]. When either x r1i , and r1j after the first column. So, we only need
or y is nonzero, for example, y ≥ 0 but x > 0, Eq. (1) to show that, for the first column, we have
can be reformulated as (f1i − 1)(fij − 1) (f1j − 1)(fij + 1)
+ = 0, (h)
x−y 2x 2 f1i + 1 f1j + 1
f (x, y) = = −1= − 1 ≤ 1. (a)
x+y x+y 1 + xy which is equivalent to
The inequality in Eq. (a) holds because y ≥ 0, and f1i f1j fij − f1i + f1j − fij = 0. (i)
the last equality holds if and only if x > 0 and y =
0. Similarly, when x ≥ 0 but y > 0, Eq. (1) can be This can be verified by substituting each feature with
reformulated as its definition in Eq. (1).
x−y 2y 2 Given that rank(F) = p − 1, we know that the
f (x, y) = =1− =1− x ≥ −1. (b) nullspace of F contains only one nonzero vector,
x+y x+y y +1
which is a solution to Eq. (e). Furthermore, from
The inequality in Eq. (b) holds because x ≥ 0, and the Lemma 1 we can infer that (fij −1)(fij +1) ≤ 0, hence
last equality holds if and only if x = 0 and y > 0. Eq. (d) tells that xi xj ≥ 0, ∀i, j. Consequently, Eq. (e)
always has a nonnegative solution x̂, and all solutions
A PPENDIX B to Eq. (e) must be cx̂, where c is a scale factor.
Given this proof, we make four observations below:
P ROOF OF T HEOREM 1
• For a solution, c can be any real value, but to
Denote fij = f (xi , xj ). From Eq. (1) we have satisfy the constraint that all pixel intensity values
fij (xi + xj ) = xi − xj . (c) are nonnegative, c should be positive.
• The solution to Eq. (e) spans a one-dimensional
Equivalently, subspace (the nullspace).
• A specific solution can be obtained by assigning
(fij − 1)xi + (fij + 1)xj = 0. (d)
x1 = 1 and solving for the other variables from
Therefore, we have the following set of linear equa- the first p − 1 rows of Eq. (e) in linear time.
tions • When the original image is x = 0, it can also be
Fx = 0, (e) reconstructed by cx̂ where x̂i = 1, ∀i, and c = 0.
However, in this case a solution with c > 0 is
where not generally regarded as a scaled version of the
f12 − 1 f12 + 1 0 ··· 0 original image x = 0.
f13 − 1 0 f13 + 1 ··· 0
··· ··· ··· ··· ···
F=
f1p − 1 0 0 ··· f1p + 1
(f)
0 f23 − 1 f23 + 1 ··· 0
··· ··· ··· ··· ···
0 0 0 · · · fp−1,p + 1
is a sparse d × p matrix with each row containing
at most two nonzero entries. Furthermore, from the
formulation of F we know that each row of F contains
at least one nonzero entry, because (fij − 1) 6= (fij + 1)
always holds for all i and j. Without loss of gener-
ality, let’s assume f12 + 1 6= 0. Then it follows that
f1j +1 6= 0, ∀j. Because if ∃j such that f1j +1 = 0, then
from Lemma 1 we know that x1 = 0. This will further
lead to f12 +1 = 0, which violates the assumption that
f12 +1 6= 0. Therefore, the first p−1 rows in the matrix
F are linearly independent of each other.
We will further prove that rank(F) = p − 1. In fact,
any row of the matrix F can be linearly expressed by