Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
80 views8 pages

Signal Processing: Qiang Zhang, Li Zhuo, Jiafeng Li, Jing Zhang, Hui Zhang, Xiaoguang Li

Uploaded by

Alejo Moreno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views8 pages

Signal Processing: Qiang Zhang, Li Zhuo, Jiafeng Li, Jing Zhang, Hui Zhang, Xiaoguang Li

Uploaded by

Alejo Moreno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Signal Processing 147 (2018) 146–153

Contents lists available at ScienceDirect

Signal Processing
journal homepage: www.elsevier.com/locate/sigpro

Vehicle color recognition using Multiple-Layer Feature Representations


of lightweight convolutional neural network
Qiang Zhang a,c, Li Zhuo a,b,c,∗, Jiafeng Li a,c, Jing Zhang a,c, Hui Zhang a,c, Xiaoguang Li a,c
a
Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing, China
b
Collaborative Innovation Center of Electric Vehicles in Beijing, Beijing, China
c
College of Microelectronics, Faculty of Information Technology, Beijing University of Technology, Beijing, China

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, a vehicle color recognition method using lightweight convolutional neural network (CNN) is
Received 30 July 2017 proposed. Firstly, a lightweight CNN network architecture is specifically designed for the recognition task,
Revised 6 December 2017
which contains five layers, i.e. three convolutional layers, a global pooling layer and a fully connected
Accepted 16 January 2018
layer. Different from the existing CNN based methods that only use the features output from the final
layer for recognition, in this paper, the feature maps of intermediate convolutional layers are all applied
Keywords: for recognition based on the fact that these convolutional features can provide hierarchical representa-
Multiple-Layer Feature Representations tions of the images. Spatial Pyramid Matching (SPM) strategy is adopted to divide the feature map, and
Lightweight convolutional neural network each SPM sub-region is encoded to generate a feature representation vector. These feature representation
Vehicle color recognition
vectors of convolutional layers and the output feature vector of the global pooling layer are normalized
Spatial Pyramid Matching
and cascaded as a whole feature vector, which is finally utilized to train Support Vector Machine classi-
fier to obtain the recognition model. The experimental results show that, compared with the state-of-art
methods, the proposed method can obtain more than 0.7% higher recognition accuracy, up to 95.41%,
while the dimensionality of the feature vector is only 18% and the memory footprint is only 0.5%.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction complex scenes and non-vehicle color components further escalate


the difficulty of vehicle color recognition.
Vehicle recognition plays an important part in the applications In recent years, many researchers have proposed various solu-
of Intelligent Transportation System, criminal investigation and so tions for vehicle color recognition. In general, the research on ve-
on. Its task is to recognize the type, color and license plate of the hicle color recognition can be divided into two stages:
targeted vehicle. Color is one of the basic attributes of vehicles, The first stage is based on handcrafted features combined with
therefore, color recognition plays a significant role in vehicle recog- classifiers [1,4–6], which focus on designing various features of ve-
nition. However, due to complex impacts of illuminations, weather hicle colors manually, such as color histogram [1], color moment
conditions, noises and image capture qualities, vehicle color recog- [2] and color correlogram [3], and then these features are used
nition has become a challenging task. Firstly, illumination, noises to train the classifiers, such as Support Vector Machine (SVM), K-
and special weather conditions will lead to obvious changes in vi- Nearest Neighbor (KNN) and Artificial Neural Network (ANN). Chen
sual appearance of vehicle colors. For instance, the vehicle colors et al. [1] extracted the color histogram features and introduced
under different illuminations vary greatly, which makes the vehi- spatial information using Spatial Pyramid Matching (SPM) and Fea-
cle recognition very difficult. Secondly, considering the differences ture Context (FC) to establish Bag of Words (BoW) model, then
in camera positions and parameters, there are significant differ- combined linear SVM to solve the vehicle color recognition prob-
ences in angles and focal distances. In addition, the size and area lem. Baek et al. [4] also extracted the color histogram of H and
of vehicles in the image change greatly, which poses another great S components in HSV color space to construct the bi-dimensional
challenge to vehicle color recognition. Finally, the interference of feature vectors and further combined SVM classifiers for vehicle
color recognition. Another approach was proposed by Dule et al.
[5], which extracted different color features from various color
spaces, and tested the vehicle color recognition performance using

Corresponding author. different classifiers, including KNN, ANN and SVM. The experimen-
E-mail addresses: [email protected] (L. Zhuo), [email protected] (J. Zhang), tal result indicates that ANN classifier combined with 8-bin color
[email protected] (H. Zhang), [email protected] (X. Li).

https://doi.org/10.1016/j.sigpro.2018.01.021
0165-1684/© 2018 Elsevier B.V. All rights reserved.
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 147

histogram feature vector can obtain the highest recognition accu- Table 1
Comparison results of the proposed network structure and AlexNet structure.
racy.
Apart from above vehicle color recognition method based on AlexNet Proposed network
global feature extraction, Hu et al. [6] provided an alternate way of Input 227 × 227 × 3 227 × 227 × 3
vehicle color recognition by constructing the color reflection model Layer 1 conv,96 conv, 48
to remove the non-vehicle color areas in the image, such as back- Layer 2 conv,256 conv, 128
ground, vehicle wheels and windows, then directly extracting fea- Layer 3 conv,384 conv, 192
Layer 4 conv,384 –
tures from the major vehicle color regions to train the SVM classi-
Layer 5 conv,256 –
fier. Layer 6 fc,4096 GAP,192
These methods mentioned above basically adopt handcrafted Layer 7 fc,4096 –
features. Therefore, they are tending to achieve higher execution Output fc,8 fc,8
Memory 227.6M 1.1M
speed but weaker generalization capability and lower recognition
accuracy. In addition, the design of handcrafted features requires
professional knowledge. An appropriate feature usually takes mas-
sive experience and time to validate its effectiveness for a certain image. Specifically, SPM strategy is adopted to divide the feature
task. Therefore, it is difficult to manually design proper features for maps of the convolutional layers into four sub-regions. Then, each
new data and tasks. sub-region is encoded to obtain a feature representation vector of
The second stage is based on deep learning. The research of the layer. Next, these feature representation vectors and the out-
this stage mainly focuses on two aspects: the first is to make put feature vector of the final global pooling layer are normalized
use of deep neural network to obtain the feature representa- and cascaded as a whole feature vector to represent the content of
tion of the images then apply it for vehicle color recognition in the image. Finally, the linear SVM is used as the classifier for ve-
combination of traditional classifier. Hu et al. [7] proposed a ve- hicle color recognition. The experimental results demonstrate that,
hicle color recognition method, in which the deep features of the features from the intermediate layers can help to improve the
AlexNet [8], and kernel SVM are used, combined with SPM learning recognition accuracy. And compared with the state-of-art methods,
strategy. The other is to use an end-to-end deep neural network the proposed method can get higher recognition accuracy by more
structure, which merges the feature extraction and classifier into than 0.7%, up to 95.41%, while with the lower dimensionality of
a unified framework through joint optimization. Rachmadi et al. the feature vector and smaller memory size of CNN model.
[9] designed a parallel CNN network to achieve end-to-end vehi- The rest of this paper is organized as follows. Section 2 de-
cle color recognition, which learns the recognition model from big scribes the main ideas and details of the proposed method. Exper-
data by using two convolutional networks, and integrated the two imental results and analysis are presented in Section 3. Finally, the
parts by a fully connected layer. The research results indicate that conclusions are drawn in Section 4.
for vehicle color recognition, compared to the traditional recogni-
tion methods of handcrafted features + classifier, the feature rep-
2. The proposed method
resentations learnt from deep learning have strong generalization
capability and the recognition performance can be improved obvi-
Convolutional neural network has become the dominant ma-
ously.
chine learning approach for visual recognition tasks. To fulfill com-
The representative deep network structures used in vehicle
plex tasks, which usually need to identify hundreds or even thou-
color recognition mainly include AlexNet [8], GoogleNet [13], VGG-
sands of categories, CNN model usually has a large number of pa-
Net [14], and so on, which are originally designed for complex
rameters. When used for small and medium size datasets, over-
classification tasks and characteristic of a massive amount of data.
fitting often occurs. In vehicle color recognition, color categories
Therefore, the network structures usually possess large number of
and the size of the datasets are both limited. For example, in liter-
parameters and are easily subject to the occurrence of over-fitting
ature [6], vehicle color contains red, yellow, blue and green, totally
phenomenon. In addition, they require large computational and
four categories. And in [5], the vehicle colors contains white, black,
storage resources. But, recently, with the advance of research, some
gray, red, blue, green, and yellow, totally seven categories. In this
researchers found that for some specific tasks or applications, a
paper, the public vehicle color dataset in [1,7,9] is adopted, which
lightweight network structure can also achieve the most advanced
includes eight color categories to recognize. In order to achieve
result [10].
a good tradeoff between the performance and the computational
Deep neural network demonstrates strong learning ability and
complexity, a lightweight CNN network architecture is designed to
highly efficient feature extraction capability, which can extract in-
extract the features for vehicle color recognition, which includes
formation from low-level raw data to high-level abstract seman-
three convolutional layers, a global pooling layer and a fully con-
tic concepts. The hierarchical feature representation can entitle it
nected layer, totally five layers.
with prominent advantages when extracting global features and
The framework of the proposed method is depicted in Fig. 1.
context semantic information of the images. However, the existing
Considering that the feature maps of the convolutional layers con-
methods commonly make use of the output features of the last
tains rich information of the vehicle images, all feature representa-
layer for recognition while neglecting the feature information of
tions of the convolutional layers are employed and combined with
the previous layers. Actually, these features of lower layers con-
the output feature vector of the global pooling layer to form a
tain considerable information of images, which may promote the
whole vector to represent the content of the images. The linear
recognition performance. However, if all the features from the in-
SVM classifier is used to train the classification model. In the fol-
termediate layers can be employed, the extremely high dimension
lowing, the details of the proposed method will be described.
of feature vectors will result in training the classification model
too difficult or even failure. To address this problem, a traded-
off solution is proposed in this paper. Firstly, a lightweight CNN 2.1. Lightweight convolutional neural network structure design
network architecture is designed for the vehicle color recognition
task, which contains five layers, i.e. three convolutional layers, a The lightweight convolutional neural network architecture de-
global pooling layer and a fully connected layer. The feature maps signed in this paper is shown with the black dotted line in Fig. 1.
of convolutional layers are used for feature representation of the The number of neurons at the three convolutional layers is 48, 128,
148 Q. Zhang et al. / Signal Processing 147 (2018) 146–153

Fig. 1. Framework of the proposed vehicle color recognition method.

and 192 respectively. Table 1 shows the comparison results of the In addition, since the hierarchical non-linear operation of CNN is
proposed network with the classical AlexNet [8] network structure. implemented by the activation function, the selection of the activa-
It can be seen from the Table 1 that, the proposed lightweight tion function poses huge influence on the final recognition perfor-
CNN network structure is mainly illustrated in three aspects. mance. The common activation functions in deep learning include
Firstly, this structure reduces the number of convolutional layers Sigmoid, tanh and ReLU (Rectified Linear Units). In this paper, ReLU
and corresponding active layers, which requires fewer non-linear function is adopted as the activation function to activate the neu-
operations. This can decrease the computational complexity and rons.
improve the learning speed of the network to some degree. Mean-
while, it lessens the number of convolutional kernels in the con- 2.1.2. Normalized layer
volutional layers, which declines the amount of the optimized pa- The network model parameters are updated constantly during
rameters in learning process, thus increasing the learning speed the training process, which often leads to shift in input data distri-
of the network and reducing the risk of over-fitting. Finally, the bution of each subsequent layer. Meanwhile, the learning process
lightweight CNN network adopts a global average pooling layer as requires the operation on each layer to comply with input data dis-
the link layer between the third convolutional layer and the final tribution. Therefore, to overcome the influence of input data distri-
fully connected layer, which can avoid concentrating the majority bution on recognition performance, the data are normalized prior
of network parameters onto the fully connected layer. By these to the convolutional layer respectively in this paper. Firstly, the av-
ways, the network structure can be greatly simplified. The mem- erage of images is subtracted from the input image; then, before
ory footprint by the lightweight network proposed in this paper is the convolutional operation on the second and third layers, the Lo-
only 1.1 M, far smaller than the 227.6 M of AlexNet. How to design cal Response Normalization operation (LRN) is performed on the
each component of lightweight CNN network structure is detailed input data. The normalization operation can make the data distri-
as follows. bution more rationally and more easily to be distinguished. More-
over, the network convergence rate and recognition accuracy can
2.1.1. Convolutional layer be improved during the training process. Denote aix,y as the activ-
As an indispensable part of CNN, convolutional layer serves ity of a neuron computed by applying kernel i at position (x, y),
to enhance the original features of signals and reduce the and bix,y the normalization result is given by the Eq. (2):
noises. Multiple image features can be extracted from diverse  β
min(N−1,i+n/2)
convolutional kernels. For the proposed CNN network structure in
   2
bix,y = aix,y / k+α j
ax,y (2)
this paper, the convolutional layer performs discrete convolutional
j=max(0,i−n/2)
operation on the input signal. As for an input image I, if K is the
convolutional kernel at the current layer, the output image H ob- where the sum that runs over n “adjacent” kernel maps at the
tained after the discrete convolutional operation can be expressed same spatial position, and N is the total number of kernels in the
as: layer. The constants k, n, α , and β are hyper-parameters whose val-
ues are determined using a validation set. In this paper, they are

w −1 
h−1
set as k = 2, n = 5, α = 10−4 , and β = 0.75 respectively.
H[x, y] = I[m, n]K[x − m, y − n] (1)
m=0 n=0
2.1.3. Pooling layer
where x and y are the coordinates of input image, m and n are the The common pooling operations include average pooling, max-
coordinates of convolutional kernel respectively. Updating and iter- imum pooling and random pooling, etc.. In the proposed CNN net-
ations obtain the kernel parameters K during the training process. work structure, maximum pooling operation is carried out on the
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 149

output feature maps of the first and third convolutional layers. Table 2
The size and dimension of the features at different layers.
The pooling operation can reduce the dimensionality of the fea-
ture map, accelerate convergence rate, lower down computational Layer Kernel Stride Feature Map Size Dimension
complexity and provide certain rotation invariance. Conv1 7×7 4 56 × 56 × 48 150,528
Pooling 1 3×3 2 28 × 28 × 48 37,632
2.1.4. Fully connected layer and loss function Conv2 3×3 2 14 × 14 × 128 25,088
Conv3 3×3 1 14 × 14 × 192 37,632
The convolutional layer, pooling layer and activation function
Pooling 2 3×3 3 6 × 6 × 192 6912
serves to map the raw data onto the feature space in the hidden GAP 6×6 1 1 × 1 × 192 192
layer while the fully connected layer maps the “distributed feature
representation” learned onto the marked sample space. In other
words, the label information of the categories from the target loss
function will be transmitted to the previous convolutional layers
through fully connected layer. Therefore, a fully connected layer is
usually set up at the last layer of CNN to realize mapping between
label information and the feature space.
On the other hand, there are generally multiple fully connected
layers with too many parameters in traditional CNN network, eas-
ily leading to over-fitting (for instance, AlexNet contains three fully
connected layers). To reduce the number of parameters of the fully
connected layer, in this paper, there designs a global average pool-
ing (GAP) layer [11] between the third convolutional layer and the
final fully connected layer. Global average pooling operation is car-
ried out on the feature map of the third convolutional layer so that
each feature map can output only an average value, equivalent to
a dimensionality reduction operation, which can fasten the param-
eter learning process of the fully connected layer, and thus, can
sharply decrease the quantity of network parameters and avoid the
risk of over-fitting.
Fully connected layer serves as a classifier in CNN while the
selection of loss function directly determines the performance of
classification model learned by the network. The common loss
functions include Euclidean loss used in real value regression,
Triplet loss used in human facial identification and Softmax loss
used in single-label recognition. Considering that vehicle color
recognition belongs to single label sample categorization, in this Fig. 2. Visualization of 48 convolutional kernels of the first convolutional layer. The
paper Softmax function is adopted as the loss function: convolutional kernels of size 7 × 7 × 3 are learned on the 227 × 227 × 3 input
images.
  
k
p y ( i ) = j |x ( i ) ; θ = eθ j x /
T (i )
e θl
T (i )
x
(3)
l=1
effectively extract rich color feature information for vehicle color
where p is the probability of input sample x( i ) belonging to jth cat- recognition.
egory, and θ is the parameter of the network layer. One of the main challenges to vehicle color recognition is to
Overall, our proposed CNN structure contains 5 layers. The first reduce the interference of non-vehicle color regions in the im-
layer uses 7 × 7 × 3 kernel with total 48 kernels, second layer and age, such as background, vehicle windows and wheels. In the
third layer use 3 × 3 × 48 kernel with total 128 kernels, 3 × 3 × 128 traditional methods based on handcrafted features, the interfer-
kernel with total 192 kernels, respectively. Pooling operations af- ence reduction methods mainly include two categories: the first is
ter first convolutional layer use size of 3 × 3 with 2 pixel strides to divide the image into sub-regions and determine the weights
and after third convolutional layer 3 × 3 with 3 pixel strides respec- of each sub-region by training classifiers in order to reduce the
tively. At global average pooling layer, the average pooling kernel weights of non-vehicle color regions [1]; the second is to construct
size is set as 6 × 6 × 192. And the number of neurons of fully con- a mathematical model to measure the correlation between the ve-
nected layer is 8, with the same size of vehicle color categories hicle color and non-vehicle color regions, then, removing the non-
to be recognized. The input of the network is a 3-channel color vehicle color regions [6] according to the measurement results.
image whose resolution is 227 × 227 × 3. In our method, the pro- To verify the feature representation capability proposed in this
posed CNN architecture is trained using the examples to extract paper, the feature maps of the convolutional layers are visualized.
the features. For each input image, the output of each functional Firstly, the average feature map of each convolutional layer is cal-
layer, such as convolutional layers and pooling layers, are regarded culated according to Eq. (4), and then visualized by heat-map. The
as the features and used for vehicle color recognition. The size and visualization result is shown in Fig. 4, where three average convo-
dimension of the features at different layer are shown in Table 2. lutional feature maps are listed.
i
1 j
c
2.1.5. Visualization of convolutional Kernels and feature maps Fi = i fi (4)
To better understand how the proposed CNN network extracts c
l=1
the color information, the convolutional kernels of the first convo-
j
lutional layer are visualized in Fig. 2. As seen in Fig. 2, the first where fi is the jth feature map of ith convolutional layer, ci is the
convolutional layer can extract rich color features of the input im- number of convolutional kernels of ith layer, Fi is the average fea-
age. All vehicle color variations in dataset are presented in the ker- ture map of ith layer and its size is consistent with that of a single
nels. In other words, the proposed lightweight CNN structure can feature map at ith layer. For instance, the size of feature maps at
150 Q. Zhang et al. / Signal Processing 147 (2018) 146–153

Fig. 3. Visualization of three convolutional average features.

the third convolutional layer is 14 × 14 × 192 and then the size of


average feature map is 14 × 14, as shown in Fig. 3(d).
It can be found in Fig. 3 that, larger feature value in heat-
map (yellow region in Fig. b, c and d) corresponds to the main
vehicle color regions while smaller feature value (dark region) cor-
responds to non-vehicle color regions, such as background and
windows, and distorted color region caused by illumination. This
indicates that the color features extracted are robust to the inter-
ference.
The average feature map of the second pooling layer is visu-
alized in Fig. 4 and the heat-map value occupied by each point
is marked as shown. It can be seen that the features obtained
from the proposed CNN network can determines the correspond-
ing weights adaptively according to the vehicle color contribution
in different regions, which means the excellent representation ca-
pability.

2.2. Convolutional layer feature representation based on SPM


Fig. 4. Visualization of the average feature map of the second pooling layer.
As described above, deep neural network has strong learn-
ing capability and highly efficient feature representation capabil-
ity, which can extract information from low to high level. If all the
features from the intermediate layers can be utilized, the feature
representation capability can be enhanced to a certain extent, but method that embeds spatial information into feature vector [12]. It
it also results in high dimension of feature vectors. Therefore, the aims to extract features from different sub-regions and aggregat-
existing methods commonly adopt the features output from the ing the features of all the regions together to describe an image.
last layer of deep network for recognition, but neglect the features In order to reduce the dimensionality of the features, the proposed
output from the intermediate layers. To make full use of these method divides the feature map into four sub-regions using SPM,
features, this paper proposes a feature representation method for and then encodes the sub-regions to obtain the feature represen-
the convolutional layer based on SPM strategy. SPM is a classical tation vector of the convolutional layer.
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 151

Table 3
The number of images for each color.

Color Black Blue Cyan Gray Green Red White Yellow Total

Number 3442 1086 281 3046 482 1941 4742 581 15,601

The proposed method divides the CNN convolutional feature


map (as shown in Fig. 1) using SPM, whose advantage is that only
one deep feature extraction process is required, avoiding multiple
feature extractions. At the same time, the spatial information is
embedded into the representation vector, thus, improving the de-
scription and discrimination capability.
In this paper, we use the Eq. (5) to encode each SPM division
and obtain a compact feature representation of the feature map:
fik
Pi = , k ∈ 1, 2, . . . , c i (5)
w×h
where fik is the kth feature map of the ith layer, w × h is the size of
the current feature map, and ci is the number of neurons of the ith
layer convolutional layer. Pi represents the output of encoding and
has dimensions 1 × 1 × ci . For example, the feature representation Fig. 5. Some examples of the Vehicle Color dataset.
vector of the third convolutional layer is a 1 × 1 × 192 dimensions
vector.
After L2-normalization, the feature representation vectors of the dataset come from front images (or slight angle change) cap-
each convolutional layer and the output feature vector of the tured from road monitoring system. Besides, each image only con-
global pooling layer are cascaded as a whole vector to represent tains one vehicle. The dataset has lots of environmental changes,
the content of the image. L2 normalization operation can be ex- such as weather condition and illumination. Some examples of the
pressed as: Vehicle Color dataset are shown in Fig. 5. There are multiple vehi-

 ci cle types in the dataset, such as trucks, cars and buses, which pose
 great challenge to vehicle color recognition. In the experiments, to
x j = x j /
 x2j (6) compare and study the proposed method with the other methods,
j=1 the same setting is adopted, that is to divide the dataset into train-
ing data and testing data with ration of 1:1 [1,7,9].
where xj is the jth eigenvalues of Pi , and xj represents the values
after the L2 normalization operation.
We use vGAP to represent the normalized feature value of the 3.2. Performance of lightweight convolutional neural network
global pooling layer. Similarly, vconv1 , vconv2 , vconv3 represent the dif-
ferent normalized feature of the convolutional layers. The feature To verify the recognition performance of lightweight convolu-
cascade process can be expressed as: tional neural network proposed in this paper, a comparison exper-
iment is carried out on this network and several other commonly
V = [vconv1 , vconv2 , vconv3 , vGAP ] (7) used CNN networks, including AlexNet [8], GoogleNet [13] and
where the dimensions of vconv1 , vconv2 , vconv3 and vGAP are VGG-Net [14], whose depth is 8, 22 and 16 respectively. The CNN
240(48 × 5), 640(128 × 5), 960(192 × 5), and 192 respectively. Thus, networks are all constructed on Caffe (Convolutional Architecture
the dimension of the combined feature V is 2032. Finally, the linear for Fast Feature Embedding) platform [15]. In the training process,
SVM is trained as a classifier using the combined feature vector. the same random initialization parameters and the iteration train-
ing network parameters of random gradient lowering algorithm
3. Experimental results and analysis are adopted. The trained model is used for vehicle color recogni-
tion. Fig. 6 shows the experimental comparison results. It should
To verify the effectiveness of the vehicle color recognition be noted that, for fair comparison, the CNN networks of AlexNet,
method proposed in this paper, we conducted the experiments on GoogLeNet and Vgg16-Net all don’t use pre-training strategy, but
a public Vehicle Color dataset [1] provided by Chen et al. The ex- only end-to-end training mode.
perimental results are compared with the state-of-the-art vehicle It can be seen from Fig. 6(a) that, compared to the existing
color recognition methods. The comparative experimental platform deep CNN networks, the proposed lightweight CNN network can
is set as following: 3.3-GHZ 4-core CPU, 8GB-RAM, Tesla-K20C GPU, even obtain higher recognition accuracy, reaching 94.73%, while
and Ubuntu 64-bit operating system. Next, each experiment will be the depth is only 5, much shallower and lighter than the other
introduced in details. three CNN networks. In addition, it can be seen from Fig. 6(b) that,
the proposed network is characteristic of faster network conver-
3.1. Vehicle color dataset gence rate and shorter computational time during vehicle color
recognition. This is because the proposed lightweight CNN net-
In this paper, a public Vehicle Color dataset is adopted for work has simpler structure, fewer parameters and much lower
experimental verification and comparison [1,7]. This dataset con- complexity.
tains 15,601 vehicle images, including black, blue, blue green, gray,
green, red, white and yellow, totally eight color categories. Table 3 3.3. Impacts of deep features at different layers on recognition results
shows the number of images for each color, in which there are
282 cyan vehicle images with the minimum proportion, and 4743 To verify the impact of different features on vehicle color recog-
white vehicles with the maximum proportion. All the images in nition accuracy, the framework of “deep features + SVM” is adopted
152 Q. Zhang et al. / Signal Processing 147 (2018) 146–153

Fig. 6. Comparison results of different networks for vehicle color recognition.

Table 4
Comparison of the recognition accuracy using different deep features.

Layer Black Blue Cyan Gray Green Red White Yellow AP

Conv1 0.9402 0.8656 0.8500 0.7242 0.7635 0.9763 0.8996 0.9241 0.8679
Pooling 1 0.9518 0.9116 0.9214 0.7597 0.7884 0.9711 0.8979 0.9103 0.8890
Conv2 0.9582 0.930 0.9643 0.8398 0.8340 0.9845 0.9350 0.9690 0.9269
Conv3 0.9698 0.9687 1 0.8411 0.7925 0.9887 0.9494 0.9655 0.9345
Pooling 2 0.9669 0.9761 0.9929 0.8575 0.8257 0.9918 0.9464 0.9689 0.9408
GAP 0.9797 0.9705 0.9929 0.8582 0.8423 0.9918 0.9688 0.9793 0.9479
Multiple Layers 0.9756 0.9834 0.9857 0.8884 0.8672 0.9938 0.9629 0.9759 0.9541

in this paper, which is mainly consisted of deep feature extraction based on linear SVM and the 12,288-dimensional feature
and SVM classifier. The proposed CNN architecture is used to ex- vector.
tract the deep features, and the SPM strategy and encoding pro- (2) The method proposed in [9], where a parallel CNN network
cessing are used to obtain a feature representation vector of each is proposed to achieve end-to-end vehicle color recognition.
layer. Specifically, all feature representations of the convolutional Its parallel structure is embodied in adopting two convo-
layers are employed and combined with the output feature vector lutional networks for learning. The two channels of output
of the global pooling layer to form a whole vector to represent the features are inputted into a fully connected layer for data in-
content of the images. The recognition accuracy using the features tegration and totally the dimension of feature vector is 4096.
from different layers are shown in Table 4. (3) The method proposed in [7], which combines deep features
It can be seen from Table 4 that, the less the layer, the weaker and Kernel-SVM to train the classification model. SPM strat-
the feature representation and distinction capability and the lower egy is proposed to improve the representation capability of
the recognition accuracy. Compared with the method only using the feature vector. The dimension of feature vector is 11,520.
the features of a final single layer, the combined features from
multiple layers can effectively improve the recognition accuracy, up
to 0.62%. This is because the features from the intermediate layers In our proposed method, linear SVM is chosen as the classifier.
can provide other supplemental information to the final global fea- And the dimension of feature vector is 2032. Table 5 shows the
tures, thus, better recognition performance can be obtained. experimental comparison result on Vehicle Color dataset using the
The optimal recognition performance in the table using the fea- four methods.
tures of a single layer is obtained when applying the output fea- It can be seen from Table 5 that, compared with other three
tures from global average pooling layer to train SVM classifier. This methods, the proposed method can achieve the highest recognition
result indicates that global average pooling operation can not only accuracy with the lowest feature dimension. Especially, when com-
decrease the network complexity and reduce the dimension of fea- pared with the state-of-art method in [7], the proposed method
tures but also effectively maintain the feature representation capa- can obtain even more than 0.7% better recognition accuracy, while
bility. the dimension of the features is only 18% and the memory foot-
print by proposed CNN network is only 0.5%.
3.4. Comparison of recognition performance of different methods
Even more, from Tables 4 and 5, it also can be seen that,
when only using the features of GAP layer for color recognition,
To verify the effectiveness of the proposed method in this pa-
it still can achieve a lightly higher recognition accuracy than the
per, we performed the comparison experiment with the state-of-
method in [7], while the feature dimension of GAP layer is only
the-art vehicle color recognition methods, including:
192, far lower than the 11,520-dimension deep feature adopted by
(1) The method proposed in [1], where the color histogram is the comparison method. It can be concluded that the designed
combined with Feature Context strategy to construct Bag of lightweight CNN network can efficiently extract the features of the
Words model. The vehicle color recognition is performed vehicle colors.
Q. Zhang et al. / Signal Processing 147 (2018) 146–153 153

Table 5
Comparison results of recognition performance using different methods.

Method Black Blue Cyan Gray Green Red White Yellow AP

BoW + FC [1] 0.9713 0.9451 0.9787 0.8461 0.7834 0.9876 0.9414 0.9457 0.9249
Parallel CNN [9] 0.9738 0.9410 0.9645 0.8608 0.8257 0.9897 0.9666 0.9794 0.9447
SPM + kernel SVM [7] 0.9796 0.9642 0.9886 0.8686 0.8406 0.9926 0.9619 0.9787 0.9469
Proposed Method 0.9756 0.9834 0.9857 0.8884 0.8672 0.9938 0.9629 0.9759 0.9541

In summary, the reasons why the vehicle color recognition No. 61370189, No. 61471013, and No. 61602018), the Importa-
method in this paper can obtain quick and precise recognition per- tion and Development of High-Caliber Talents Project of Beijing
formance are as follows. Municipal Institutions(No. CIT&TCD20150311), the Science and
Technology Development Program of Beijing Education Committee
(1) According to the requirements of the vehicle color recog-
(No. KM201510 0 050 04).
nition task, a lightweight convolutional neural network is
specially designed, greatly simplifying the network structure References
and improving the recognition speed.
(2) The proposed method makes full use of the features of the [1] P. Chen, X. Bai, W. Liu, Vehicle color recognition on urban road by feature con-
intermediate layers, which can provide more supplemental text, Intell. Transp. Syst. IEEE Trans. 15 (5) (2014) 2340–2346.
[2] G. Qiu, K.M. Lam, Spectrally layered color indexing, Image Video Retrieval
information to the final global features to more efficiently (20 02) 10 0–107.
describe the characteristics of the vehicle images, and thus, [3] J. Huang, S.R. Kumar, M. Mitra, et al., Image indexing using color correlo-
improving the recognition accuracy. grams, in: Computer Vision and Pattern Recognition, 1997. Proceedings, 1997
IEEE Computer Society Conference on, IEEE, 1997, pp. 762–768.
(3) SPM strategy is adopted to represent the feature maps of the [4] N. Baek, S.M. Park, K.J. Kim, et al., Vehicle color classification based on the sup-
convolutional layers compactly, which can embed the spatial port vector machine method, in: Advanced Intelligent Computing Theories and
information into the representation vectors to improve their Applications. With Aspects of Contemporary Intelligent Computing Techniques,
2007, pp. 1133–1139.
descriptive and discrimination capability. Through encoding [5] E. Dule, M. Gokmen, M.S. Beratoglu, A convenient feature vector construc-
of SPM division, the feature representation capability can be tion for vehicle color recognition, in: Proc11th WSEAS International Confer-
enhanced while the dimension will not bring heavy burden ence on Neural Networks, Evolutionary Computing and Fuzzy systems, 2010,
pp. 250–255.
on the computational complexity.
[6] W. Hu, J. Yang, L. Bai, et al., A new approach for vehicle color recogni-
tion based on specular-free image, Sixth International Conference on Ma-
4. Conclusion chine Vision (ICMV 13), International Society for Optics and Photonics, 2013
90671Q-90671Q.
[7] C. Hu, X. Bai, L. Qi, et al., Vehicle color recognition with spatial pyramid deep
In this paper, a vehicle color recognition method based on learning, Intell. Transp. Syst. IEEE Trans. 16 (5) (2015) 2925–2934.
lightweight convolutional neural network is proposed. Compared [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
to the conventional methods, the proposed method in this paper volutional neural networks, in: Advances in Neural Information Processing Sys-
tems, 2012, pp. 1097–1105.
enjoys two merits: the first is to design a special lightweight con- [9] Rachmadi R.F., Purnama I. Vehicle color recognition using convolutional neural
volutional neural network for vehicle color recognition task, reduc- network[J]. arXiv:1510.07391, 2015.
ing the quantity of network parameters and lowering the demand [10] Springenberg J.T., Dosovitskiy A., Brox T., et al. Striving for simplicity: Tthe all
convolutional net[J]. arXiv:1412.6806, 2014.
for computational and storage resources during network training; [11] Lin M., Chen Q., Yan S. Network in network[J]. arXiv:1312.4400, 2013.
the second is to apply SPM strategy to embed spatial information [12] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid
into convolutional feature map, combining the features from low matching for recognizing natural scene categories, in: Computer Vision and
Pattern Recognition, 2006 IEEE Computer Society Conference on, 2, IEEE, 2006,
to high layers to improve the deep feature representation capabil- pp. 2169–2178.
ity. [13] C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, in: Proceed-
In this paper, the encoding method of SPM division remains ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 1–9.
a little coarse. In the future work, new encoding method will be
[14] Simonyan K., Zisserman A. Very deep convolutional networks for large-scale
designed to further improve feature representation capability and image recognition[J]. arXiv:1409.1556, 2014.
recognition performance. [15] Y. Jia, E. Shelhamer, J. Donahue, et al., Caffe: convolutional architecture for fast
feature embedding, in: Proceedings of the 22nd ACM International Conference
on Multimedia, ACM, 2014, pp. 675–678.
Acknowledgments

The work in this paper is supported by the National Nat-


ural Science Foundation of China (No. 61531006, No. 61372149,

You might also like