A Fast Face Detection Method Via Convolutional Neural
A Fast Face Detection Method Via Convolutional Neural
Network
b Beijing Key Laboratory of Digital Media, and School of Computer Science and Engineering, Beihang
Abstract
Current face or object detection methods via convolutional neural network (such as
OverFeat, R-CNN and DenseNet) explicitly extract multi-scale features based on an
image pyramid. However, such a strategy increases the computational burden for face
detection. In this paper, we propose a fast face detection method based on discrim-
inative complete features (DCFs) extracted by an elaborately designed convolutional
neural network, where face detection is directly performed on the complete feature
maps. DCFs have shown the ability of scale invariance, which is beneficial for face de-
tection with high speed and promising performance. Therefore, extracting multi-scale
features on an image pyramid employed in the conventional methods is not required
in the proposed method, which can greatly improve its efficiency for face detection.
Experimental results on several popular face detection datasets show the efficiency and
the effectiveness of the proposed method for face detection.
Keywords: Fast face detection, Convolutional neural network, Discriminative
complete feature maps
1. Introduction
In the Marr’s theory of vision [1], a representational framework for vision was
proposed, which consists of three levels: primal sketch, 1.5-D sketch and 3-D model
representation. How to extract a proper image representation corresponding to the three
levels is a critical problem in computer vision. So far, there are some milestone rep-
resentations during the development of computer vision. For example, SIFT [2] and
HOG [3] features have exhibited the good property of local invariance. Canny fea-
tures [4] can capture the 1.5-D sketch, which represents the low-level image structure.
The deconvolutional network [5] tries to characterize both mid-level and high-level
∗ Corresponding author.
Email addresses: [email protected] (Guanjun Guo), [email protected]
(Hanzi Wang), [email protected] (Yan Yan), [email protected] (Jin Zheng),
[email protected] (Bo Li)
2
(
c)
(
d)
(
a) (
b)
(
f)
(
g)
(
e)
Figure 1: The visualization of DCFs and general CNN features by using the t-SNE [24] visualization algo-
rithm. (a) A test image. (b) The corresponding DCFs of the test image. (c) The features obtained from a set
of the sliding windows. (d) Resizing the obtained features into the size of the fully-connected layer of CNN.
(e) The visualization of the distribution of DCFs (shown in the brown and cyan dots) and general features
of CNN (shown in the blue and yellow dots). (f) and (g) The general features extracted based on an image
pyramid via CNN.
2. Related Work
3
literatures [29, 30], CNN plays a role similar to the ventral pathway, where the hierar-
chical features of CNN correspond to the neuronal representations from the visual area
1 (V1) to the visual area 4 (V4). CNN works well for many different vision tasks by
deeply learning the features of data. For instance, Farabet et al. [31] show the impres-
sive performance of CNN on scene labeling. Alex et al. [32] and He et al. [33] use
CNN to classify the ImageNet dataset [8] and achieve the accuracy of 83% and 94.3%,
respectively. R-CNN [6] improves mean average precision (mAP) by more than 30%
relative to the previous best result on the VOC 2012 dataset [34], and it achieves a mAP
of 53.3%.
Although CNN has shown good performance for object detection, its computational
efficiency is not high. OverFeat [18] obtains the multi-scale dense features by using the
sliding window strategy, and it achieves mAP of 24.3% in the ILSVRC2013 competi-
tion [8]. However, the detection process of OverFeat is very time-consuming, because
it requires running the classifier and regressor networks across all the possible locations
and scales. To improve the efficiency, R-CNN first generates many class-independent
proposal windows, and then extracts features on the proposal windows with the trained
CNN on a multi-scale images pyramid. Afterwards, a score is assigned to each proposal
window by applying a linear SVM to the features. R-CNN takes about 15 seconds to
run a detector on a 500 × 375 image. Another recently proposed method, called SPP-
net [7], removes the constraint of the fixed-sized input in the conventional CNN by
replacing the pooling layer with a spatial pyramid pooling layer. Instead of extracting
features from image regions, SPP-net directly obtains features from the feature maps
before the fully-connected layer. However, SPP-net can not deal with flexible CNN
architectures for different vision tasks. To reduce the computational complexity of the
CNN-based object detection methods and design flexible CNN architectures, Giusti et
al. [27] fragment the extended maps generated from each max-pooling layer. When a
sliding-window moves in a test image, the corresponding features are obtained by look-
ing up all the fragments in each layer for classification. This object detection method
is theoretically faster than the other object detection methods in which features are ob-
tained by using a CNN model on each patch in an image. However, considering the
image size and the number of layers, this method is still not computationally efficient.
Based on the above reviews, we can see that most current object detection methods
based on deep learning still consume much time during the object detection process.
Several object detection methods based on CNN are applied to the task of face de-
tection. For example, Lin et al. [35] propose the MLetNet method, where the number
of the output nodes of the LeNet CNN structure [13] is modified for face detection, to
detect the masked faces of possible terrorists. FaceCraft [36] jointly trains a region pro-
posal network (RPN) [19] and a fast R-CNN model [37] to improve the detection rate
of the fast R-CNN method for face detection. However, as mentioned above, fast R-
CNN is time-consuming since it extracts features on an image pyramid. MTCNN [38]
utilizes a cascaded multi-task framework, which exploits the inherent correlation be-
tween the tasks of detection and alignment, to boost up the performance of both face
detection and alignment. Since MTCNN uses multiple CNN networks to detect faces,
it is still not computationally efficient. Several other face detection methods also apply
fast R-CNN or faster R-CNN to the task of face detection, such as [21, 20]. However,
these two methods are not effective at detecting small-sized faces since the object pro-
4
Max-pooling LCN DCFs Convolution Response Maps
Convolution
fragment2
fragment1
Input layer Convolution layer Max-pooling layer LCN layer Fully connect layer
Figure 2: The proposed DCFs-based method for object detection. Different fragments are created corre-
sponding to different offsets in a max-pooling layer, which preserves all the possible discriminative features.
LCN is the abbreviation of Local Contrast Normalization.
posal generation methods are designed for generic objects instead of faces. Recently,
SAFD [39] uses a scale proposal network to predict the possible scales of the faces in a
test image, which can effectively boost the performance of CNN based face detectors.
Next, we propose a fast face detection method by directly using the sliding window
strategy on discriminative complete feature maps.
In this section, we describe the details of the proposed fast face detection method
via CNN. In Section 3.1, we give the overview of the proposed method. In Section 3.2,
we show the details of the proposed DCFs. In Section 3.3, a fast face detection method
is given by performing direct classification on DCFs. In Section 3.4, an ensemble
model is used to further reduce the classification error on DCFs.
5
(
A) (
B) (
C) (
D)
A D
of
fset
A
B C
of
fset
B
of
fset
C
of
fset
D
(
E)
Figure 3: Illustration of the fragments of the extended feature maps based on different offsets of the max-
pooling kernel. (A), (B), (C), (D) respectively denote different fragments of the extended maps after max-
pooling corresponding to different offsets in (E). (E) denotes the input feature map for a max-pooling layer.
For the kernel size of 2 × 2, A, B, C, D in (E) denote different offsets respectively.
features in a feature map is enforced. It has been empirically proven that the LCN lay-
ers can reduce the error rates and make supervised learning considerably faster [41].
6
and each fragment is independent to the other fragments at the same layer. Based on
the fragments in the extended feature maps, we can extract DCFs and directly perform
classification on DCFs. The definition of each layer, which is used to extract DCFs, is
given next.
Given an input image x, the trained filter bank k and bias b, the output of the
convolution layer can be written as:
X
xloj = f ( xl−1 l l
oi ∗ kij + bj ), (1)
i∈Mj
where o denotes the index of fragments; l denotes the current layer; i and j denote the
indexes of the input and output feature maps, respectively; Mj represents a selection
of the input feature maps, which are the output features from the previous layer; ∗
denotes the operator of convolution; f denotes the activation function, where f (x) =
max(0, x) is the ReLU [25] activation function. The output of ReLU is 0 if the input
is less than 0, and raw output otherwise. Thus, ReLU constrains the output features to
be sparse.
For the max-pooling layer, we have
where m and n respectively denote the row and column index of a feature map in the
current layer, respectively; s denotes the kernel size of the max-pooling layer; p =
s × (m − 1) + κ + 1 and q = s × (n − 1) + κ + 1, where κ (0 < κ < s) denotes
the offset, p and q denote the starting positions of the row and column of each tiled
max-pooling kernel, respectively; the colon is used to pick out the selected rows or
columns of the feature maps, and here o ≤ s2 . The max operation has the property of
the local invariance [42].
Inspired by the computational neuroscience, the local contrast normalization (LCN)
layer [40] mimics the cells in V1 to enforce local competition in the feature maps. In
order to extract features from an input image with an arbitrary size, we set the local
contrast normalization layer as:
xloj (m, n)
x̂loj (m, n) = , (3)
min(N −1,j+ r2 )
xloi (m, n)2 )β
P
(κ + α
i=max(0,j− r2 )
where r denotes the number of the adjacent feature maps (usually set to be a positive
integer in the training process); N denotes the total number of the feature maps at
the current layer. κ, α, β are the hyperparameters which are set to be appropriate
float point values in the training process. Eqs. (1), (2) and (3) can be used to extract
the DCF features, where all the discriminative information for face detection is kept.
Thus, we call the features obtained from the feature maps before the fully-connected
layer as the discriminative complete features (DCFs). As a result, direct classification
on the feature maps before the fully connected layer can improve the efficiency of the
detection procedure.
7
3.3. Speedup for Face Detection Based on DCFs
Based on the above-mentioned three layers (i.e., the convolution layer, the max-
pooling layer, the local contrast layer), the input images are mapped into a feature
space, where the features for face detection are linear separable. Therefore, we can
directly perform classification on DCFs extracted from each patch corresponding to
a sliding window by using the weight vector of the fully-connected layer. Instead
of using the sliding window technique on the input images, direct classification on
DCFs can significantly improve the efficiency for face detection. As a matter of fact,
the weight vector of the fully-connected layer can be reshaped into a kernel matrix
so that the dot product between DCFs and the weight vector is transformed into the
convolution between DCFs and the kernel matrix [43]. In addition, the convolution can
be implemented by using the Sparse Fast Fourier Transform algorithm [44], which also
improves the speed for face detection since the DCFs of an image are sparse.
To show the efficiency of the proposed method for face detection, the theoretical
analysis on the computational complexity of the floating point operations (FLOPS)
required by the proposed method and a couple of other methods (including the patch-
based method [6] and the image-based method [27] ) is presented next. The patch-
based method applies a trained CNN model to each overlapping patch in the test im-
age. The image-based method performs the convolution only once for the test image.
The main difference between the image-based method and the proposed DCFs-based
method is that the image-based method applies the sliding window technique on the
original test image, while the proposed DCFs-based method directly uses the sliding
window technique on the feature maps. Since the size of the feature maps is far smaller
than that of the original image, the proposed method is much more efficient. In addi-
tion, the convolution operation in CNN is speeded up by using the Sparse Fast Fourier
Transform algorithm in the proposed face detection method. The FLOPS required by
the patch-based method F LOP Slpatch , the image-based method F LOP SlImage and
the proposed DCFs-based method F LOP SlDCF are given in detail next. Here, we
only analyze the computational complexity of the FLOPS required by the three meth-
ods on the convolutional layer and the fully-connected layer, because these two layers
are the most time-consuming.
For the convolutional layer, the FLOPS required by the patch-based method is writ-
ten as:
F LOP Slpatch = 2A|Pl−1 ||Pl |wl2 s2l , (4)
where A, wl , sl denote the number of the pixels of an input image, the number of the
pixels of the feature map in the lth layer, and the kernel size at the lth layer, respec-
tively; |Pl−1 | and |Pl | denote the numbers of the feature maps in the previous layer
and in the current layer, respectively. The patch-based method is quite time-consuming
since the number of convolution operations on each feature map is equal to the number
of pixels of an input image, which causes a large number of redundant computations.
In contrast, the FLOPS required by the image-based method is written as:
where Al denotes the number of the pixels in the feature map; Fl denotes the number
of the fragments in the current layer. As the size of feature maps decreases, the FLOPS
8
(
b)
(
a)
(
d)
(
c) (
e)
Figure 4: Face detection on the DCFs feature maps. (a) The DCFs obtained by the proposed method. (b)
The response map corresponding to the DCFs at the last layer. (c) The resized feature maps obtained by the
nearest neighbor interpolation method. (d) The output response map of the resized DCFs. (e) The resized
response map (the same size as (b)) for further processing by using non-maximum suppression.
where A∗l denotes the number of the non-zeros pixels in the feature map.
For the fully-connected layer, the FLOPS required by the patch-based method, the
image-based method and the DCFs-based method are respectively written as Eq. (7),
Eq. (8) and Eq. (9).
F LOP Slpatch = 2A|Pl−1 ||Pl |sl . (7)
9
estimated scale of a face candidate is obtained by using the non-maximum suppression
technique, and the response maps of all fragments are merged into one response map
using non-maximum suppression. Finally, by backprojecting the estimated scale and
the position of the centroids of the connected regions in the response map onto the
original image, the locations of the faces can be detected with the estimated scale in
the original image.
ln(H[w])
v=d e, (10)
ln(R[w])
4. Experiments
In this section, the detailed evaluation of the performance of the proposed method
on face detection is shown. Section 4.1 gives the parameter settings used in training the
CNN model. Section 4.2 visualizes the trained CNN network. Section 4.3 presents the
comparison of efficiency and performance between the proposed method and several
state-of-the-art face detection methods on two public face datasets.
10
Table 1: The CNN structure for face detection.
Layer Type Input #Channels #Filters Filter Size Activation
Layer1 conv Data 3 16 5×5 ReLU
Layer2 pool Layer1 16 - - -
Layer3 cmrnorm Layer2 16 - - -
Layer4 conv Layer3 16 16 5×5 ReLU
Layer5 cmrnorm Layer4 16 - - -
Layer6 pool Layer5 16 - - -
Layer7 FC Layer6 - - - -
Layer8 softmax Layer7 - - - -
(
a)
Figure 5: The trained filters of the first convolutional layer in the CNN model used by the proposed method
for face detection.
11
Table 2: The hyperparameters used for training the CNN model of the proposed method.
LayerType epsW epsB momW momB wc
Layer1 0.0030 0.0040 0.9000 0.9000 0.0000
Layer4 0.0001 0.0002 0.9000 0.9000 0.0000
Layer7 0.0002 0.0003 0.9000 0.9000 0.0100
LayerType scale pow - - -
Layer3 0.0010 0.7500 - - -
Layer5 0.0010 0.7500 - - -
Figure 6: One fragment of DCFs at the sixth layer. The number of the feature maps in one fragment is equal
to the number of channels at the sixth layer in Table 1.
12
(
a) (
b) (
c)
Figure 7: The visualization of the separability of the features extracted by CNN with different architectures.
(a) General features of 32 × 32 images extracted by CNN ( the blue and red points denote the face features
and the nonface features, respectively ). (b) The resized DCFs in accompany with the general features (also
see Fig. 1 for details). (c) The features of 32 × 32 images extracted by CNN with the activation function of
sigmoid ( the red and blue points denote the face features and the nonface features, respectively).
activation function are also evaluated. All the CNN models use the same structure as
shown in Table 1. Using the t-SNE [24] visualization algorithm, 5,000 randomly se-
lected resized DCFs and general features are shown in Fig. 7. From the figure, we can
see that the resized DCFs are still robust for linear classification since the DCFs are
sparse. However, the dense features extracted by the traditional CNN are not robust for
linear classification.
13
Table 3: The FLOPS[·1010 ] required by the three methods at three layers with a given parameter setting.
Layer
Total
1 4 7
A 800× 600 800× 600 800× 600 -
|Pl−1 | 1 16 16 -
|Pl | 16 16 16 -
Al 800×600 400×300 200×150 -
sl 25 25 1024 -
Fl 1 4 16 -
F LOP S P atch 34.56×5 157.2864×5 0.98304×5 964.1472
Image
F LOP S 0.002×5 0.6144×5 0.98304×5 7.9972
DCF
F LOP S 0.002 0.135168 0.00096×5 0.141968
Table 4: The CPU and GPU time (in seconds) used by the seven CNN based face detection methods and the
proposed face detection method on a 800 × 600 test image. Note that YOLO is implemented in C/C++, and
the other competing methods are implemented in Matlab and C/C++.
Method CPU time [s] GPU time [s] Total time
Patch-based 2.3 25.1 28.1
Image-based 0 2.7 2.7
SPP-net 2.3 0.3 2.6
Fast R-CNN 2.7 10.1 12.8
Faster R-CNN 0.3 0.4 0.7
DeepIR 0.3 0.4 0.7
YOLO 0 0.04 0.04
DCF-based 0.1 0.09 0.19
14
more computation. Moreover, the object proposal generation methods employed in the
three competing methods consume more CPU time (about 2 seconds) than the proposed
method (about 0.1 seconds). In term of GPU time, the running time of the proposed
method is 0.09, which is faster than that of the other six competing methods (about 3.3
to 278.9 times faster) except for YOLO. The YOLO method, which is implemented in
C/C++, runs faster than the proposed face detection method, which is implemented in
Matlab and C/C++. However, the proposed face detection method obtains higher true
positive rate than YOLO (see Fig. 9).
Next, we compare the performance obtained by different methods for face detec-
tion on the FDDB and AFW datasets. On the FDDB dataset, we use the criterion given
in [53, 55], which measures the degree of match between a detected and a manually
annotated face region. The cutoff ratio of the intersected areas to the joined areas of
the two regions is set to 0.5 in our case. We compare the proposed method with the
following detection methods: (1) ACF-multiscale [56], (2) HeadHunter [57], (3) CBSR
[58], (4) Boosted Exemplar [59], (5) SURF frontal/multiview [60], (6) SURF Gentle-
Boost [60], (7) Face++ [61], (8) PEP-Adapt [62], (9) Viola-Jones [63] and (10) the
image-based method [27]. We select these methods since most of these methods are
the state-of-the-art face detection methods based on hand-crafted features, and their
results have been reported in the FDDB website [53]. In addition, we apply the image-
based method to face detection as a comparison. The methods in [64, 39], which also
use CNN for the task of face detection, are not compared since their source codes are
not available in public. The comparison of the discrete ROC curves obtained by the
competing methods is shown in Fig. 8(a), from which we can see that the proposed
method achieves better performance than most of the competing methods except for
HeadHunter. Due to the fact that the proposed method uses CNN to learn effective
features of faces, the proposed method obtains better performance than that obtained
by the most other competing methods. Although the image-based method also learns
features for faces by using CNN, it obtains worse performance than that obtained by
the other competing methods except for Viola-Jones. The main reason is that the fea-
tures obtained the image-based method are less discriminative and not robust to scale
variations. Moreover, the image-based method does not use an extra CNN model to
eliminate false detection.
On the AFW dataset, we report the Precision-Recall curves for evaluation. Using
the same settings as in [58], we compare the proposed method with the following eleven
methods: (1) DPM [54], (2) CBSR [58], (3) TSM [54], (4) Kalal [65], (5) Face++ [61],
(6) Google Picasa, (7) Face.com, (8) multiHOG [54], (9) Viola-Jones [63], (10) Shen-
Retrieval [66], and (11) Image-based method [27]. We select these methods since
they are state-of-the-art face detection methods, which have been tested on the AFW
dataset. The comparison of the Precision-Recall curves is shown in Fig. 8(b). The
average precision obtained by the proposed method on the AFW dataset is 91.6%,
which is worse than CBSR, Google Picasa and Face.com due to the lack of a large
number of training faces in the wild. However, the proposed method shows excellent
classification performance when the recall ratio is high. The recall ratio obtained by the
proposed method is higher than that obtained by DPM, Kalal, multiHOG, Viola-Jones
and the image-based method when the precision ratio is greater than 0.7.
Several object detection methods based on CNN (such as R-CNN, faster R-CNN,
15
Face Detection ROC curve on FDDB
1
0.9
0.8
Acf
True positive rate
BoostedExamplerBased
0.7
CBSR
Face++
0.6 HeadHunter
ICCV-XZJY-PEPAdapt
0.5 SURF-GentleBoost-MV
SURFGentleBoost
ViolaJonesScore-N0
0.4
The image-based method
The proposed method
0.3
0 500 1000
False positive
(a)
0.9
0.8
85.5% DPM
0.7 Google Picasa
Face.com
Precision Rate
(b)
Figure 8: The comparison of the discrete ROC and PR curves for face detection obtained by the competing
methods on the FDDB and AFW datasets. (a) The results of ROC curves obtained by the competing methods
on the FDDB dataset. (b) The results of PR curves obtained by the competing methods on the AFW dataset.
16
Face Detection ROC curve on FDDB
1
0.9
0.8
True positive rate
0.7
0.6
RCNN
0.5 fast RCNN
faster RCNN
0.4
DeepIR
YOLO
The proposed method
0.3
0 500 1000
False positive
Figure 9: The comparison of the discrete ROC curves obtained by the six face detection methods based on
CNN on the FDDB dataset.
YOLO) can be used for face detection, and they can achieve comparative performance
on the FDDB dataset. To show the effectiveness of the proposed method, we im-
plement and evaluate several state-of-the-art object detection methods on the FDDB
dataset. We compare the proposed method against the following five face detection
methods based on CNN: (1) R-CNN [20], (2) fast R-CNN [21], (3) faster R-CNN [20],
(4) DeepIR [22] and (5) YOLO [23]. The five competing methods perform face de-
tection by using CNN classifiers on the face proposals generated by different object
proposal generation methods. The YOLO method applies a single neural network to
a whole image. The network divides the image into regions and predicts bounding
boxes and probabilities of each region being an object. Then, these bounding boxes are
weighed by the predicted probabilities. Fig. 9 shows the ROC curves obtained by the
six competing methods on the FDDB dataset. As can be seen, the proposed method ob-
tains the best performance. The reason is that the proposed method can obtain effective
features when a sliding window strategy is applied on DCFs. The R-CNN, fast R-CNN
and faster R-CNN methods obtain worse performance than the other competing meth-
ods since the performance of the three methods for face detection is greatly affected
by the quality of object proposals generated by the object proposal generation meth-
ods. More specifically, both R-CNN and fast R-CNN use the Selective Search object
proposal generation method (SS) to generate object proposals, and faster R-CNN uses
the Region Proposal Network (RPN) to generate object proposals. However, both SS
and RPN are not effective at generating object proposals for faces, especially for small-
sized faces. DeepIR improves RPN by re-training the PRN model on the face dataset,
and it increases the discriminative ability of the classifier by using the hard negative
mining strategy. Thus, DeepIR obtains better performance than R-CNN, fast R-CNN
17
Figure 10: The comparison results of face detection. Red bounding boxes (left) denote the result obtained
by the proposed method. Blue bounding boxes (right) denote the result obtained by Face++.
and faster R-CNN. However, DeepIR obtains worse performance than the proposed
method since the latter directly uses a sliding window strategy on the DCFs, which can
achieve high face recall than the object proposal generation method used in DeepIR.
Some detection results are given in Fig. 10. We can see that the proposed method
can find more correct faces with occlusions and different poses, where the false detec-
tion can be eliminated by using an extra CNN model during the classification process.
5. Conclusion
In this paper, we propose a fast face detection method based on DCFs extracted by
an elaborately designed CNN. Compared with the state-of-the-art face detection meth-
ods using CNN, the proposed method performs direct classification on DCFs, which
can significantly improve the efficiency during the face detection process. Moreover,
the proposed method can effectively detect small-sized faces by using the sliding win-
dow strategy on DCFs. In contrast, current state-of-the-art face detection methods
based on CNN are hard to detect small-sized faces since the performance of these
methods is greatly affected by the quality of object proposals generated by the ob-
ject proposal generation methods. Experimental results have shown that the proposed
DCFs-based face detection method can achieve promising performance on several pop-
ular face detection datasets.
18
Acknowledgments
This work was supported by the National Natural Science Foundation of China un-
der Grants U1605252, 61472334, 61571379 and 61370124, and by the Natural Science
Foundation of Fujian Province of China under Grant 2017J01127.
References
References
[4] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal.
Mach. Intell. 8 (6) (1986) 679–698.
[5] M. Zeiler, G. Taylor, R. Fergus, Adaptive deconvolutional networks for mid and
high level feature learning, in: Proc. IEEE Int. Conf. Comput. Vis., 2011, pp.
2018–2025.
[6] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate
object detection and semantic segmentation, in: Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2014, pp. 580–587.
[7] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)
(2015) 1904–1916.
[8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, L. Fei-Fei, Imagenet large
scale visual recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252.
[9] Q.-Q. Tao, S. Zhan, X.-H. Li, T. Kurihara, Robust face detection using local cnn
and svm based on kernel combination, Neurocomputing 211 (2016) 98 – 105.
[10] F. H. C. Tivive, A. Bouzerdoum, A hierarchical learning network for face detec-
tion with in-plane rotation, Neurocomputing 71 (16) (2008) 3253 – 3263.
[11] K. Lu, X. An, J. Li, H. He, Efficient deep network for vision-based object detec-
tion in robotic applications, Neurocomputing 245 (2017) 31 – 45.
[12] G. Guo, H. Wang, C. Shen, Y. Yan, H.-Y. Liao, Automatic image cropping for vi-
sual aesthetic enhancement using deep neural networks and cascaded regression,
IEEE Transactions on Multimedia 99 (2018) 1–13.
19
[13] Y. LeCun, L. Bottou, G. Orr, K.-R. Mller, Efficient backprop, in: Proc. Adv.
Neural Inf. Process. Syst., 1998, pp. 9–50.
[14] J. Uijlings, K. van de Sande, T. Gevers, A. Smeulders, Selective search for object
recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.
[15] C. L. Zitnick, P. Dollár, Edge Boxes: Locating Object Proposals from Edges, in:
Proc. Eur. Comput. Vis. Conf., 2014, pp. 391–405.
[16] G. Guo, H. Wang, W. L. Zhao, Y. Yan, X. Li, Object discovery via cohesion
measurement, IEEE Trans. on Cybernetics 1 (99) (2017) 1–14.
[17] F. N. Iandola, M. W. Moskewicz, S. Karayev, R. B. Girshick, T. Darrell,
K. Keutzer, Densenet: Implementing efficient convnet descriptor pyramids,
CoRR.
[18] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat:
Integrated recognition, localization and detection using convolutional networks,
International Conference on Learning Representations.
[19] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object de-
tection with region proposal networks, in: Proc. Adv. Neural Inf. Process. Syst.,
2015, pp. 91–99.
[20] H. Jiang, E. G. Learned-Miller, Face detection with the faster R-CNN, CoRR
abs/1606.03473.
[21] D. Triantafyllidou, A. Tefas, A fast deep convolutional neural network for face
detection in big visual data, in: Advances in Big Data: Proc. of the INNS Conf.
on Big Data, 2017, pp. 61–70.
[22] X. Sun, P. Wu, S. C. H. Hoi, Face detection using deep learning: An improved
faster RCNN approach, CoRR abs/1701.08289.
[23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified,
real-time object detection, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2016, pp. 779–788.
[24] L. van der Maaten, G. Hinton, Visualizing high-dimensional data using t-sne,
Journal of Machine Learning Research 9 (2008) 2579–2605.
[25] G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in:
International Conference on Machine Learning, 2010, pp. 807–814.
[26] R. C. Gonzalez, R. E. Woods, Digital Image Processing (3rd Edition), Prentice-
Hall, Inc., 2006.
[27] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, J. Schmidhuber, Fast
image scanning with deep max-pooling convolutional neural networks, in: IEEE
Int. Conf. on Image Processing, 2013, pp. 4034–4038.
20
[28] J. J. DiCarlo, D. Zoccolan, N. C. Rust, How does the brain solve visual object
recognition?, Neuron 73 (2012) 415–34.
[29] U. Güçlü, M. A. J. van Gerven, Deep Neural Networks Reveal a Gradient in the
Complexity of Neural Representations across the Ventral Stream, The Journal of
Neuroscience 35 (27) (2015) 10005–10014.
[30] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action
recognition in videos, in: Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 568–
576.
[31] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for
scene labeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1915–1929.
[32] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep con-
volutional neural networks, in: Proc. Adv. Neural Inf. Process. Syst., 2012, pp.
1097–1105.
[33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[34] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman, The
pascal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010)
303–338.
[35] S. Lin, L. Cai, X. Lin, R. Ji, Masked face detection via a modified lenet, Neuro-
computing 218 (2016) 197 – 202.
[36] H. Qin, J. Yan, X. Li, X. Hu, Joint training of cascaded cnn for face detection, in:
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
[37] R. Girshick, Fast r-cnn, in: Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–
1448.
[38] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using mul-
titask cascaded convolutional networks, IEEE Signal Processing Letters 23 (10)
(2016) 1499–1503.
[39] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, X. Hu, Scale-aware face detection, in: Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
[40] N. Pinto, D. D. Cox, J. J. Dicarlo, Why is real-world visual object recognition
hard, PLoS Computational Biology 4 (1) (2008) 415–434.
[41] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage
architecture for object recognition?, in: Proc. IEEE Int. Conf. Comput. Vis., 2009,
pp. 2146–2153.
[42] D. Scherer, A. Muller, S. Behnke, Evaluation of pooling operations in convolu-
tional architectures for object recognition, in: Proc. of International Conference
on Artificial Neural Networks: Part III, 2010, pp. 92–101.
21
[43] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for semantic seg-
mentation, IEEE Trans. Pattern Anal. Mach. Intell. 39 (4) (2017) 640–651.
[44] B. Ghazi, H. Hassanieh, P. Indyk, D. Katabi, E. Price, L. Shi, Sample-optimal
average-case sparse fourier transform in two dimensions, Annual Allerton Con-
ference on Communication, Control, and Computing (2013) 1258–1265.
[45] S. V. Kozyrev, Classification by ensembles of neural networks, P-Adic Numbers,
Ultrametric Analysis, and Applications 4 (1) (2012) 27–33.
[46] L. G. Valiant, A theory of the learnable, Communications of the ACM 27 (11)
(1984) 1134–1142.
[47] R. Herbrich, T. Graepel, A pac-bayesian margin bound for linear classifiers: Why
svms work, in: Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 224–230.
[48] R. B. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional net-
works for accurate object detection and segmentation, IEEE Trans. Pattern Anal.
Mach. Intell. 38 (1) (2016) 142–158.
[49] H. Nam, B. Han, Learning multi-domain convolutional neural networks for visual
tracking, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–
4302.
[50] Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point
detection, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3476–
3483.
[51] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, D. Zhao, The cas-peal
large-scale chinese face database and baseline evaluations, IEEE Trans. Sys. Man
Cyber. Part A 38 (1) (2008) 149–161.
[52] A. Vedaldi, K. Lenc, Matconvnet – convolutional neural networks for matlab, in:
Proceeding of the ACM Int. Conf. on Multimedia, 2015, pp. 689–692.
[53] V. Jain, E. Learned-Miller, Fddb: A benchmark for face detection in uncon-
strained settings, Tech. Rep. UM-CS-2010-009, University of Massachusetts,
Amherst (2010).
[54] D. Ramanan, Face detection, pose estimation, and landmark localization in the
wild, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2879–2886.
[55] S. Zhan, Q.-Q. Tao, X.-H. Li, Face detection using representation learning, Neu-
rocomputing 187 (2016) 19 – 26.
[56] B. Yang, J. Yan, Z. Lei, S. Z. Li, Aggregate channel features for multi-view face
detection, in: Proceedings of International Joint Conference on Biometrics, 2014,
pp. 1–8.
[57] M. Mathias, M. P. R. Benenson, L. V. Gool, Face detection without bells and
whistles, in: Proc. Eur. Conf. Comput. Vis., 2014, pp. 720–735.
22
[58] J. Yan, Z. Lei, L. Wen, S. Z. Li, The fastest deformable part model for object
detection, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2497–
2504.
[59] H. Li, Z. Lin, J. Brandt, X. Shen, G. Hua, Efficient boosted exemplar-based face
detection, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1843–
1850.
[60] J. Li, Y. Zhang, Learning surf cascade for fast and accurate object detection, in:
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3468–3475.
[61] M. Inc., Face++ research toolkit, www.faceplusplus.com (2013).
[62] H. Li, G. Hua, Z. Lin, J. Brandt, J. Yang, Probabilistic elastic part model for
unsupervised face detector adaptation, in: Proc. IEEE Int. Conf. Comput. Vis.,
2013, pp. 793–800.
[63] P. Viola, M. Jones, Robust real-time object detection, Int. J. Comput. Vis. 57 (2)
(2001) 137–154.
[64] H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade
for face detection, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp.
5325–5334.
[65] Z. Kalal, K. Mikolajczyk, J. Matas, Face-tld: Tracking-learning-detection applied
to faces, in: IEEE International Conference on Image Processing, 2010, pp. 3789–
3792.
[66] X. Shen, Z. Lin, J. Brandt, Y. Wu, Detecting and aligning faces by image retrieval,
in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3460–3467.
23