Signal Processing: Qiang Zhang, Li Zhuo, Jiafeng Li, Jing Zhang, Hui Zhang, Xiaoguang Li
Signal Processing: Qiang Zhang, Li Zhuo, Jiafeng Li, Jing Zhang, Hui Zhang, Xiaoguang Li
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, a vehicle color recognition method using lightweight convolutional neural network (CNN) is
Received 30 July 2017 proposed. Firstly, a lightweight CNN network architecture is specifically designed for the recognition task,
Revised 6 December 2017
which contains five layers, i.e. three convolutional layers, a global pooling layer and a fully connected
Accepted 16 January 2018
layer. Different from the existing CNN based methods that only use the features output from the final
layer for recognition, in this paper, the feature maps of intermediate convolutional layers are all applied
Keywords: for recognition based on the fact that these convolutional features can provide hierarchical representa-
Multiple-Layer Feature Representations tions of the images. Spatial Pyramid Matching (SPM) strategy is adopted to divide the feature map, and
Lightweight convolutional neural network each SPM sub-region is encoded to generate a feature representation vector. These feature representation
Vehicle color recognition
vectors of convolutional layers and the output feature vector of the global pooling layer are normalized
Spatial Pyramid Matching
and cascaded as a whole feature vector, which is finally utilized to train Support Vector Machine classi-
fier to obtain the recognition model. The experimental results show that, compared with the state-of-art
methods, the proposed method can obtain more than 0.7% higher recognition accuracy, up to 95.41%,
while the dimensionality of the feature vector is only 18% and the memory footprint is only 0.5%.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.sigpro.2018.01.021
0165-1684/© 2018 Elsevier B.V. All rights reserved.
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 147
histogram feature vector can obtain the highest recognition accu- Table 1
Comparison results of the proposed network structure and AlexNet structure.
racy.
Apart from above vehicle color recognition method based on AlexNet Proposed network
global feature extraction, Hu et al. [6] provided an alternate way of Input 227 × 227 × 3 227 × 227 × 3
vehicle color recognition by constructing the color reflection model Layer 1 conv,96 conv, 48
to remove the non-vehicle color areas in the image, such as back- Layer 2 conv,256 conv, 128
ground, vehicle wheels and windows, then directly extracting fea- Layer 3 conv,384 conv, 192
Layer 4 conv,384 –
tures from the major vehicle color regions to train the SVM classi-
Layer 5 conv,256 –
fier. Layer 6 fc,4096 GAP,192
These methods mentioned above basically adopt handcrafted Layer 7 fc,4096 –
features. Therefore, they are tending to achieve higher execution Output fc,8 fc,8
Memory 227.6M 1.1M
speed but weaker generalization capability and lower recognition
accuracy. In addition, the design of handcrafted features requires
professional knowledge. An appropriate feature usually takes mas-
sive experience and time to validate its effectiveness for a certain image. Specifically, SPM strategy is adopted to divide the feature
task. Therefore, it is difficult to manually design proper features for maps of the convolutional layers into four sub-regions. Then, each
new data and tasks. sub-region is encoded to obtain a feature representation vector of
The second stage is based on deep learning. The research of the layer. Next, these feature representation vectors and the out-
this stage mainly focuses on two aspects: the first is to make put feature vector of the final global pooling layer are normalized
use of deep neural network to obtain the feature representa- and cascaded as a whole feature vector to represent the content of
tion of the images then apply it for vehicle color recognition in the image. Finally, the linear SVM is used as the classifier for ve-
combination of traditional classifier. Hu et al. [7] proposed a ve- hicle color recognition. The experimental results demonstrate that,
hicle color recognition method, in which the deep features of the features from the intermediate layers can help to improve the
AlexNet [8], and kernel SVM are used, combined with SPM learning recognition accuracy. And compared with the state-of-art methods,
strategy. The other is to use an end-to-end deep neural network the proposed method can get higher recognition accuracy by more
structure, which merges the feature extraction and classifier into than 0.7%, up to 95.41%, while with the lower dimensionality of
a unified framework through joint optimization. Rachmadi et al. the feature vector and smaller memory size of CNN model.
[9] designed a parallel CNN network to achieve end-to-end vehi- The rest of this paper is organized as follows. Section 2 de-
cle color recognition, which learns the recognition model from big scribes the main ideas and details of the proposed method. Exper-
data by using two convolutional networks, and integrated the two imental results and analysis are presented in Section 3. Finally, the
parts by a fully connected layer. The research results indicate that conclusions are drawn in Section 4.
for vehicle color recognition, compared to the traditional recogni-
tion methods of handcrafted features + classifier, the feature rep-
2. The proposed method
resentations learnt from deep learning have strong generalization
capability and the recognition performance can be improved obvi-
Convolutional neural network has become the dominant ma-
ously.
chine learning approach for visual recognition tasks. To fulfill com-
The representative deep network structures used in vehicle
plex tasks, which usually need to identify hundreds or even thou-
color recognition mainly include AlexNet [8], GoogleNet [13], VGG-
sands of categories, CNN model usually has a large number of pa-
Net [14], and so on, which are originally designed for complex
rameters. When used for small and medium size datasets, over-
classification tasks and characteristic of a massive amount of data.
fitting often occurs. In vehicle color recognition, color categories
Therefore, the network structures usually possess large number of
and the size of the datasets are both limited. For example, in liter-
parameters and are easily subject to the occurrence of over-fitting
ature [6], vehicle color contains red, yellow, blue and green, totally
phenomenon. In addition, they require large computational and
four categories. And in [5], the vehicle colors contains white, black,
storage resources. But, recently, with the advance of research, some
gray, red, blue, green, and yellow, totally seven categories. In this
researchers found that for some specific tasks or applications, a
paper, the public vehicle color dataset in [1,7,9] is adopted, which
lightweight network structure can also achieve the most advanced
includes eight color categories to recognize. In order to achieve
result [10].
a good tradeoff between the performance and the computational
Deep neural network demonstrates strong learning ability and
complexity, a lightweight CNN network architecture is designed to
highly efficient feature extraction capability, which can extract in-
extract the features for vehicle color recognition, which includes
formation from low-level raw data to high-level abstract seman-
three convolutional layers, a global pooling layer and a fully con-
tic concepts. The hierarchical feature representation can entitle it
nected layer, totally five layers.
with prominent advantages when extracting global features and
The framework of the proposed method is depicted in Fig. 1.
context semantic information of the images. However, the existing
Considering that the feature maps of the convolutional layers con-
methods commonly make use of the output features of the last
tains rich information of the vehicle images, all feature representa-
layer for recognition while neglecting the feature information of
tions of the convolutional layers are employed and combined with
the previous layers. Actually, these features of lower layers con-
the output feature vector of the global pooling layer to form a
tain considerable information of images, which may promote the
whole vector to represent the content of the images. The linear
recognition performance. However, if all the features from the in-
SVM classifier is used to train the classification model. In the fol-
termediate layers can be employed, the extremely high dimension
lowing, the details of the proposed method will be described.
of feature vectors will result in training the classification model
too difficult or even failure. To address this problem, a traded-
off solution is proposed in this paper. Firstly, a lightweight CNN 2.1. Lightweight convolutional neural network structure design
network architecture is designed for the vehicle color recognition
task, which contains five layers, i.e. three convolutional layers, a The lightweight convolutional neural network architecture de-
global pooling layer and a fully connected layer. The feature maps signed in this paper is shown with the black dotted line in Fig. 1.
of convolutional layers are used for feature representation of the The number of neurons at the three convolutional layers is 48, 128,
148 Q. Zhang et al. / Signal Processing 147 (2018) 146–153
and 192 respectively. Table 1 shows the comparison results of the In addition, since the hierarchical non-linear operation of CNN is
proposed network with the classical AlexNet [8] network structure. implemented by the activation function, the selection of the activa-
It can be seen from the Table 1 that, the proposed lightweight tion function poses huge influence on the final recognition perfor-
CNN network structure is mainly illustrated in three aspects. mance. The common activation functions in deep learning include
Firstly, this structure reduces the number of convolutional layers Sigmoid, tanh and ReLU (Rectified Linear Units). In this paper, ReLU
and corresponding active layers, which requires fewer non-linear function is adopted as the activation function to activate the neu-
operations. This can decrease the computational complexity and rons.
improve the learning speed of the network to some degree. Mean-
while, it lessens the number of convolutional kernels in the con- 2.1.2. Normalized layer
volutional layers, which declines the amount of the optimized pa- The network model parameters are updated constantly during
rameters in learning process, thus increasing the learning speed the training process, which often leads to shift in input data distri-
of the network and reducing the risk of over-fitting. Finally, the bution of each subsequent layer. Meanwhile, the learning process
lightweight CNN network adopts a global average pooling layer as requires the operation on each layer to comply with input data dis-
the link layer between the third convolutional layer and the final tribution. Therefore, to overcome the influence of input data distri-
fully connected layer, which can avoid concentrating the majority bution on recognition performance, the data are normalized prior
of network parameters onto the fully connected layer. By these to the convolutional layer respectively in this paper. Firstly, the av-
ways, the network structure can be greatly simplified. The mem- erage of images is subtracted from the input image; then, before
ory footprint by the lightweight network proposed in this paper is the convolutional operation on the second and third layers, the Lo-
only 1.1 M, far smaller than the 227.6 M of AlexNet. How to design cal Response Normalization operation (LRN) is performed on the
each component of lightweight CNN network structure is detailed input data. The normalization operation can make the data distri-
as follows. bution more rationally and more easily to be distinguished. More-
over, the network convergence rate and recognition accuracy can
2.1.1. Convolutional layer be improved during the training process. Denote aix,y as the activ-
As an indispensable part of CNN, convolutional layer serves ity of a neuron computed by applying kernel i at position (x, y),
to enhance the original features of signals and reduce the and bix,y the normalization result is given by the Eq. (2):
noises. Multiple image features can be extracted from diverse β
min(N−1,i+n/2)
convolutional kernels. For the proposed CNN network structure in
2
bix,y = aix,y / k+α j
ax,y (2)
this paper, the convolutional layer performs discrete convolutional
j=max(0,i−n/2)
operation on the input signal. As for an input image I, if K is the
convolutional kernel at the current layer, the output image H ob- where the sum that runs over n “adjacent” kernel maps at the
tained after the discrete convolutional operation can be expressed same spatial position, and N is the total number of kernels in the
as: layer. The constants k, n, α , and β are hyper-parameters whose val-
ues are determined using a validation set. In this paper, they are
w −1
h−1
set as k = 2, n = 5, α = 10−4 , and β = 0.75 respectively.
H[x, y] = I[m, n]K[x − m, y − n] (1)
m=0 n=0
2.1.3. Pooling layer
where x and y are the coordinates of input image, m and n are the The common pooling operations include average pooling, max-
coordinates of convolutional kernel respectively. Updating and iter- imum pooling and random pooling, etc.. In the proposed CNN net-
ations obtain the kernel parameters K during the training process. work structure, maximum pooling operation is carried out on the
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 149
output feature maps of the first and third convolutional layers. Table 2
The size and dimension of the features at different layers.
The pooling operation can reduce the dimensionality of the fea-
ture map, accelerate convergence rate, lower down computational Layer Kernel Stride Feature Map Size Dimension
complexity and provide certain rotation invariance. Conv1 7×7 4 56 × 56 × 48 150,528
Pooling 1 3×3 2 28 × 28 × 48 37,632
2.1.4. Fully connected layer and loss function Conv2 3×3 2 14 × 14 × 128 25,088
Conv3 3×3 1 14 × 14 × 192 37,632
The convolutional layer, pooling layer and activation function
Pooling 2 3×3 3 6 × 6 × 192 6912
serves to map the raw data onto the feature space in the hidden GAP 6×6 1 1 × 1 × 192 192
layer while the fully connected layer maps the “distributed feature
representation” learned onto the marked sample space. In other
words, the label information of the categories from the target loss
function will be transmitted to the previous convolutional layers
through fully connected layer. Therefore, a fully connected layer is
usually set up at the last layer of CNN to realize mapping between
label information and the feature space.
On the other hand, there are generally multiple fully connected
layers with too many parameters in traditional CNN network, eas-
ily leading to over-fitting (for instance, AlexNet contains three fully
connected layers). To reduce the number of parameters of the fully
connected layer, in this paper, there designs a global average pool-
ing (GAP) layer [11] between the third convolutional layer and the
final fully connected layer. Global average pooling operation is car-
ried out on the feature map of the third convolutional layer so that
each feature map can output only an average value, equivalent to
a dimensionality reduction operation, which can fasten the param-
eter learning process of the fully connected layer, and thus, can
sharply decrease the quantity of network parameters and avoid the
risk of over-fitting.
Fully connected layer serves as a classifier in CNN while the
selection of loss function directly determines the performance of
classification model learned by the network. The common loss
functions include Euclidean loss used in real value regression,
Triplet loss used in human facial identification and Softmax loss
used in single-label recognition. Considering that vehicle color
recognition belongs to single label sample categorization, in this Fig. 2. Visualization of 48 convolutional kernels of the first convolutional layer. The
paper Softmax function is adopted as the loss function: convolutional kernels of size 7 × 7 × 3 are learned on the 227 × 227 × 3 input
images.
k
p y ( i ) = j |x ( i ) ; θ = eθ j x /
T (i )
e θl
T (i )
x
(3)
l=1
effectively extract rich color feature information for vehicle color
where p is the probability of input sample x( i ) belonging to jth cat- recognition.
egory, and θ is the parameter of the network layer. One of the main challenges to vehicle color recognition is to
Overall, our proposed CNN structure contains 5 layers. The first reduce the interference of non-vehicle color regions in the im-
layer uses 7 × 7 × 3 kernel with total 48 kernels, second layer and age, such as background, vehicle windows and wheels. In the
third layer use 3 × 3 × 48 kernel with total 128 kernels, 3 × 3 × 128 traditional methods based on handcrafted features, the interfer-
kernel with total 192 kernels, respectively. Pooling operations af- ence reduction methods mainly include two categories: the first is
ter first convolutional layer use size of 3 × 3 with 2 pixel strides to divide the image into sub-regions and determine the weights
and after third convolutional layer 3 × 3 with 3 pixel strides respec- of each sub-region by training classifiers in order to reduce the
tively. At global average pooling layer, the average pooling kernel weights of non-vehicle color regions [1]; the second is to construct
size is set as 6 × 6 × 192. And the number of neurons of fully con- a mathematical model to measure the correlation between the ve-
nected layer is 8, with the same size of vehicle color categories hicle color and non-vehicle color regions, then, removing the non-
to be recognized. The input of the network is a 3-channel color vehicle color regions [6] according to the measurement results.
image whose resolution is 227 × 227 × 3. In our method, the pro- To verify the feature representation capability proposed in this
posed CNN architecture is trained using the examples to extract paper, the feature maps of the convolutional layers are visualized.
the features. For each input image, the output of each functional Firstly, the average feature map of each convolutional layer is cal-
layer, such as convolutional layers and pooling layers, are regarded culated according to Eq. (4), and then visualized by heat-map. The
as the features and used for vehicle color recognition. The size and visualization result is shown in Fig. 4, where three average convo-
dimension of the features at different layer are shown in Table 2. lutional feature maps are listed.
i
1 j
c
2.1.5. Visualization of convolutional Kernels and feature maps Fi = i fi (4)
To better understand how the proposed CNN network extracts c
l=1
the color information, the convolutional kernels of the first convo-
j
lutional layer are visualized in Fig. 2. As seen in Fig. 2, the first where fi is the jth feature map of ith convolutional layer, ci is the
convolutional layer can extract rich color features of the input im- number of convolutional kernels of ith layer, Fi is the average fea-
age. All vehicle color variations in dataset are presented in the ker- ture map of ith layer and its size is consistent with that of a single
nels. In other words, the proposed lightweight CNN structure can feature map at ith layer. For instance, the size of feature maps at
150 Q. Zhang et al. / Signal Processing 147 (2018) 146–153
Table 3
The number of images for each color.
Color Black Blue Cyan Gray Green Red White Yellow Total
Number 3442 1086 281 3046 482 1941 4742 581 15,601
Table 4
Comparison of the recognition accuracy using different deep features.
Conv1 0.9402 0.8656 0.8500 0.7242 0.7635 0.9763 0.8996 0.9241 0.8679
Pooling 1 0.9518 0.9116 0.9214 0.7597 0.7884 0.9711 0.8979 0.9103 0.8890
Conv2 0.9582 0.930 0.9643 0.8398 0.8340 0.9845 0.9350 0.9690 0.9269
Conv3 0.9698 0.9687 1 0.8411 0.7925 0.9887 0.9494 0.9655 0.9345
Pooling 2 0.9669 0.9761 0.9929 0.8575 0.8257 0.9918 0.9464 0.9689 0.9408
GAP 0.9797 0.9705 0.9929 0.8582 0.8423 0.9918 0.9688 0.9793 0.9479
Multiple Layers 0.9756 0.9834 0.9857 0.8884 0.8672 0.9938 0.9629 0.9759 0.9541
in this paper, which is mainly consisted of deep feature extraction based on linear SVM and the 12,288-dimensional feature
and SVM classifier. The proposed CNN architecture is used to ex- vector.
tract the deep features, and the SPM strategy and encoding pro- (2) The method proposed in [9], where a parallel CNN network
cessing are used to obtain a feature representation vector of each is proposed to achieve end-to-end vehicle color recognition.
layer. Specifically, all feature representations of the convolutional Its parallel structure is embodied in adopting two convo-
layers are employed and combined with the output feature vector lutional networks for learning. The two channels of output
of the global pooling layer to form a whole vector to represent the features are inputted into a fully connected layer for data in-
content of the images. The recognition accuracy using the features tegration and totally the dimension of feature vector is 4096.
from different layers are shown in Table 4. (3) The method proposed in [7], which combines deep features
It can be seen from Table 4 that, the less the layer, the weaker and Kernel-SVM to train the classification model. SPM strat-
the feature representation and distinction capability and the lower egy is proposed to improve the representation capability of
the recognition accuracy. Compared with the method only using the feature vector. The dimension of feature vector is 11,520.
the features of a final single layer, the combined features from
multiple layers can effectively improve the recognition accuracy, up
to 0.62%. This is because the features from the intermediate layers In our proposed method, linear SVM is chosen as the classifier.
can provide other supplemental information to the final global fea- And the dimension of feature vector is 2032. Table 5 shows the
tures, thus, better recognition performance can be obtained. experimental comparison result on Vehicle Color dataset using the
The optimal recognition performance in the table using the fea- four methods.
tures of a single layer is obtained when applying the output fea- It can be seen from Table 5 that, compared with other three
tures from global average pooling layer to train SVM classifier. This methods, the proposed method can achieve the highest recognition
result indicates that global average pooling operation can not only accuracy with the lowest feature dimension. Especially, when com-
decrease the network complexity and reduce the dimension of fea- pared with the state-of-art method in [7], the proposed method
tures but also effectively maintain the feature representation capa- can obtain even more than 0.7% better recognition accuracy, while
bility. the dimension of the features is only 18% and the memory foot-
print by proposed CNN network is only 0.5%.
3.4. Comparison of recognition performance of different methods
Even more, from Tables 4 and 5, it also can be seen that,
when only using the features of GAP layer for color recognition,
To verify the effectiveness of the proposed method in this pa-
it still can achieve a lightly higher recognition accuracy than the
per, we performed the comparison experiment with the state-of-
method in [7], while the feature dimension of GAP layer is only
the-art vehicle color recognition methods, including:
192, far lower than the 11,520-dimension deep feature adopted by
(1) The method proposed in [1], where the color histogram is the comparison method. It can be concluded that the designed
combined with Feature Context strategy to construct Bag of lightweight CNN network can efficiently extract the features of the
Words model. The vehicle color recognition is performed vehicle colors.
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 153
Table 5
Comparison results of recognition performance using different methods.
BoW + FC [1] 0.9713 0.9451 0.9787 0.8461 0.7834 0.9876 0.9414 0.9457 0.9249
Parallel CNN [9] 0.9738 0.9410 0.9645 0.8608 0.8257 0.9897 0.9666 0.9794 0.9447
SPM + kernel SVM [7] 0.9796 0.9642 0.9886 0.8686 0.8406 0.9926 0.9619 0.9787 0.9469
Proposed Method 0.9756 0.9834 0.9857 0.8884 0.8672 0.9938 0.9629 0.9759 0.9541
In summary, the reasons why the vehicle color recognition No. 61370189, No. 61471013, and No. 61602018), the Importa-
method in this paper can obtain quick and precise recognition per- tion and Development of High-Caliber Talents Project of Beijing
formance are as follows. Municipal Institutions(No. CIT&TCD20150311), the Science and
Technology Development Program of Beijing Education Committee
(1) According to the requirements of the vehicle color recog-
(No. KM201510 0 050 04).
nition task, a lightweight convolutional neural network is
specially designed, greatly simplifying the network structure References
and improving the recognition speed.
(2) The proposed method makes full use of the features of the [1] P. Chen, X. Bai, W. Liu, Vehicle color recognition on urban road by feature con-
intermediate layers, which can provide more supplemental text, Intell. Transp. Syst. IEEE Trans. 15 (5) (2014) 2340–2346.
[2] G. Qiu, K.M. Lam, Spectrally layered color indexing, Image Video Retrieval
information to the final global features to more efficiently (20 02) 10 0–107.
describe the characteristics of the vehicle images, and thus, [3] J. Huang, S.R. Kumar, M. Mitra, et al., Image indexing using color correlo-
improving the recognition accuracy. grams, in: Computer Vision and Pattern Recognition, 1997. Proceedings, 1997
IEEE Computer Society Conference on, IEEE, 1997, pp. 762–768.
(3) SPM strategy is adopted to represent the feature maps of the [4] N. Baek, S.M. Park, K.J. Kim, et al., Vehicle color classification based on the sup-
convolutional layers compactly, which can embed the spatial port vector machine method, in: Advanced Intelligent Computing Theories and
information into the representation vectors to improve their Applications. With Aspects of Contemporary Intelligent Computing Techniques,
2007, pp. 1133–1139.
descriptive and discrimination capability. Through encoding [5] E. Dule, M. Gokmen, M.S. Beratoglu, A convenient feature vector construc-
of SPM division, the feature representation capability can be tion for vehicle color recognition, in: Proc11th WSEAS International Confer-
enhanced while the dimension will not bring heavy burden ence on Neural Networks, Evolutionary Computing and Fuzzy systems, 2010,
pp. 250–255.
on the computational complexity.
[6] W. Hu, J. Yang, L. Bai, et al., A new approach for vehicle color recogni-
tion based on specular-free image, Sixth International Conference on Ma-
4. Conclusion chine Vision (ICMV 13), International Society for Optics and Photonics, 2013
90671Q-90671Q.
[7] C. Hu, X. Bai, L. Qi, et al., Vehicle color recognition with spatial pyramid deep
In this paper, a vehicle color recognition method based on learning, Intell. Transp. Syst. IEEE Trans. 16 (5) (2015) 2925–2934.
lightweight convolutional neural network is proposed. Compared [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
to the conventional methods, the proposed method in this paper volutional neural networks, in: Advances in Neural Information Processing Sys-
tems, 2012, pp. 1097–1105.
enjoys two merits: the first is to design a special lightweight con- [9] Rachmadi R.F., Purnama I. Vehicle color recognition using convolutional neural
volutional neural network for vehicle color recognition task, reduc- network[J]. arXiv:1510.07391, 2015.
ing the quantity of network parameters and lowering the demand [10] Springenberg J.T., Dosovitskiy A., Brox T., et al. Striving for simplicity: Tthe all
convolutional net[J]. arXiv:1412.6806, 2014.
for computational and storage resources during network training; [11] Lin M., Chen Q., Yan S. Network in network[J]. arXiv:1312.4400, 2013.
the second is to apply SPM strategy to embed spatial information [12] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid
into convolutional feature map, combining the features from low matching for recognizing natural scene categories, in: Computer Vision and
Pattern Recognition, 2006 IEEE Computer Society Conference on, 2, IEEE, 2006,
to high layers to improve the deep feature representation capabil- pp. 2169–2178.
ity. [13] C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, in: Proceed-
In this paper, the encoding method of SPM division remains ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 1–9.
a little coarse. In the future work, new encoding method will be
[14] Simonyan K., Zisserman A. Very deep convolutional networks for large-scale
designed to further improve feature representation capability and image recognition[J]. arXiv:1409.1556, 2014.
recognition performance. [15] Y. Jia, E. Shelhamer, J. Donahue, et al., Caffe: convolutional architecture for fast
feature embedding, in: Proceedings of the 22nd ACM International Conference
on Multimedia, ACM, 2014, pp. 675–678.
Acknowledgments