Improved Deep Network for Self-Driving Scene Classification
Improved Deep Network for Self-Driving Scene Classification
Abstract— A self-driving car is a hot research topic in the congestion, frequent traffic accidents, and other problems, the
field of the intelligent transportation system, which can greatly intelligent transportation system comes into being [1]–[3],
alleviate traffic jams and improve travel efficiency. Scene classi- which combines various advanced information technologies
fication is one of the key technologies of self-driving cars, which
can provide the basis for decision-making in self-driving cars. with the whole ground traffic management system to achieve
In recent years, deep learning-based solutions have achieved good efficient, convenient, and safe traffic control [4], [5].
results in the problem of scene classification. However, some The self-driving car is a vehicle that perceives the envi-
problems should be further studied in the scene classification ronment and runs with little or no manual input [6], [7],
methods, such as how to deal with the similarities among different which is an indispensable part of the intelligent transportation
categories and the differences among the same category. To deal
with these problems, an improved deep network-based scene system. Because the driverless system relies on the results of
classification method is proposed in this article. In the proposed environment perception to make driving behavior decisions,
method, an improved faster region with convolutional neural environment perception has become the research hot spot in
network features (RCNN) network is used to extract the features the self-driving car field [8]. The specific tasks of environment
of representative objects in the scene to obtain local features, perception in the field of self-driving include scene classifica-
where a new residual attention block is added to the Faster RCNN
network to highlight local semantics related to driving scenarios. tion, obstacle detection, lane recognition, and so on [9], [10].
In addition, an improved Inception module is used to extract Scene classification is one of the most important and challeng-
global features, where a mixed Leaky ReLU and ELU function ing tasks in the self-driving car field because the environment
is presented, to reduce the possible redundancy of the convolution of the traffic is complicated and volatile, and the categories
kernel and enhance the robustness. Then, the local features and are various [11].
the global features are fused to realize the scene classification.
Finally, a private dataset is built from the public datasets for the The scene classification of the self-driving car refers to the
specialized application of scene classification in the self-driving information of the road and its surroundings is obtained by the
field, and the proposed method is tested on the proposed dataset. onboard camera, radar, or other sensors, and then, the state
The experimental results show that the accuracy of the proposed of the current position is recognized by the corresponding
method can reach 94.76%, which is higher than the state-of-the- processing methods [12], [13]. To achieve a higher level of
art methods.
intelligent driving, the self-driving car needs to understand
Index Terms— Deep network, faster region with convolutional the high-level semantic information of its location to make the
neural network features (RCNN), feature fusion, scene classifi- decision of driving strategy and path planning. For example,
cation, self-driving car.
the car should slow down near the school, pay attention
to the use of anti-skid mode/function in rainy and snowy
I. I NTRODUCTION weather, keep driving at high speed on the highway, and so
Fig. 2. Total structure of the proposed deep network for the scene classification of self-driving cars.
Fig. 3. Representative objects defined for the scene classification of self-driving cars. (a) Zebra crossings and pedestrians in the crosswalk. (b) Gas tanks in
the gas station. (c) Parking cars and lines in the parking lot. (d) Houses in street. (e) Isolation belts in the highway.
common sense and are used as labels for network training. A. Improved Faster RCNN Network for Local Feature
In this study, a total of seven representative objects of the Extraction
usual traffic scenes are defined, which are shown in Fig. 3. 1) Structure of the Improved Faster RCNN Network: The
The representative objects are zebra crossing, pedestrian, gas local feature extraction is based on the Faster RCNN [27].
tank, parking car, parking line, house, and isolation belt. The The main reason for using the Faster RCNN is that the
scenes defined in this study are crosswalk, gas station, parking performances of the Faster RCNN series are significantly
lot, street, and highway. In the proposed method, the prede- better than other networks (see [28] for details). In this study,
fined representative objects are automatically detected by the the structure of the improved Faster RCNN is shown in
improved Faster RCNN. There is no need to artificially decide Fig. 4, where the VGG16 Net is used as the underlying
what objects should be detected during the scene classification framework to get the feature map of the whole image, which
process. The output of the improved Inception_V1 network consists of 13 convolution layers and four pooling layers,
is the global features of the whole image. The feature fusion and the activation function is ReLU. The residual attention
module fuses local features and global features. These network module combines the top–down attention map with the input
structures are described in detail as follows. bottom–up convolution features, to obtain the feature map and
5001614 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
output to the next layer. The details about the residual attention pi represents the probability that the i th anchor is predicted
module will be introduced in Section II-A2. as the target, and pi∗ is the ground truth, namely,
In the proposed deep network, the region proposal network
(RPN) [29] is used to generate region proposals, where one ∗ 0, Negative label
pi = (2)
branch is used to judge whether the anchor belongs to the 1, Positive label.
foreground or background, and the other branch is used to form
The RPN classification loss (L cls ) is a cross-entropy, which
the bounding box coordinates. RPN generates nine anchors for
is denoted by
each pixel in the feature map and sorts the anchors from large
to small according to the score of the input foreground. Then, L cls pi , pi∗ = − log pi∗ pi + 1 − pi∗ (1 − pi ) . (3)
the first 12 000 anchors are selected. Finally, only 2000 anchors
are reserved based on the none-maximum suppression (NMS) The RPN location regression loss (L reg ) is denoted by
algorithm (see [30] for details), which are input into the next
L reg = R ti − ti∗ (4)
layer.
The region-of-interest (ROI) pooling network [31] maps the where R(·) is the smooth L1 loss function, which is defined
input ROIs to the last layer of the VGG16 network to get as
a proposal feature map with a fixed size (that is, 300 ∗ 7 ∗
0.5x 2 , if |x| < 1
7 ∗ 512 in this study). Finally, this fixed size feature map is R(x) = (5)
fully connected by the predication network, and the Softmax |x| − 0.5, otherwise.
is used to classify the specific categories. At the same time, The loss function of the predication network is the same as
the smooth L1 loss function is used to complete the bounding that of the RPN network, where the classification loss function
box regression operation to obtain the accurate positions of uses the cross-entropy and the regression loss function uses the
the objects (see [32] for details). smooth L1 loss function.
In this article, the loss function of the improved Faster 2) Residual Attention Module: The existing methods of
RCNN has four parts: the RPN classification loss, the RPN image scene classification mainly focus on the multilayer
location regression loss, the classification loss, and the location CNNs, but the large amount of redundant information con-
regression loss of the predication network. The loss function tained in images is not conducive to scene classification. These
of the RPN network L RPN is defined as follows: CNN-based methods do not clearly distinguish between key
1 information and redundant information; thus, the efficiency
L RPN ({ pi }, {ti }) = L cls pi , pi∗
Ncls i and accuracy of image scene classification are affected, and
1 ∗ the ability to extract features is limited. The spatial attention
+λ pi L reg ti , ti∗ (1) mechanism is widely used in visual tasks [35], which can
Nreg i
adaptively learn to focus on more prominent regional feature
where Ncls is the number of the anchors in the minibatch; maps in the scene and input these feature maps into the
Nreg is the number of the anchor locations; ti = {tx , t y , tw , th } subsequent bottom–up feature extraction process. However,
is the predicted coordinates of bounding box for the i th anchor; the spatial attention mechanism will lose the previous feature
ti∗ = {tx∗ , t y∗ , tw∗ , th∗ } is the real coordinates of the bounding maps, so the bottom–up process is interrupted by the attention
box of the objects; and λ is a coefficient to balance the model.
classification loss and location regression loss, which is an To deal with the problem introduced above, a single-layer
insensitive parameter and set as 1 in this study [33], [34]. spatial attention model with residual connection is proposed
NI et al.: IMPROVED DEEP NETWORK-BASED SCENE CLASSIFICATION METHOD FOR SELF-DRIVING CARS 5001614
to integrate the attention map and convolution feature map. where Z add represents the fusion feature tensor of single
In the proposed residual attention module (see Fig. 5), the channels and X i represents the single-channel feature tensor
input feature map is first normalized by batch normalization of the i th target region (the total number of channels is 512 in
and operated by a single 1 ∗ 1 convolution layer. The 1 ∗ 1 con- this study).
volution layer can be used in general to change the filter space Finally, the local fusion feature vector Z is operated by a
dimensionality (see [36] and [37] for details). The purpose of flat layer and two fully connected layers, namely,
the 1 ∗ 1 convolution layer used here is to reduce the number
of channels and improve the calculation performance. After X p = ϕ p (X)
the convolution layer, based on the spatial attention module, X f = ϕf Xp (9)
the attention mask is generated, and different weights are given
to different regions of the feature map to get a new feature where ϕ p represents the flat operation; X p denotes the output
map. Then, the feature value of any point (i, j ) in the feature of the flat layer, which is a flat tensor; ϕ f represents a
map processed by the spatial attention mechanism is two-layer fully connected operation; and X f is the final local
feature, with the size of 1 ∗ 5 in this study.
Foutput (i, j ) = Finput (i, j ) ⊗ ai j (6) The pseudocode of the local feature extraction process based
where Foutput and Finput are the output and input feature on the proposed network is shown in Fig. 6.
values of the point (i, j ) through the spatial attention module,
respectively; ai j is the attention weight of the point (i, j ); and B. Improved Inception_V1 Network for Global Feature
⊗ represents the dot product operation. Extraction
To avoid the disappearance of feature value before the
attention module, a residual connection is introduced; then, In this article, a global feature extraction network is pro-
(6) is modified as posed based on the Inception network [40]. The main reason
for using the Inception network is that it can extract more
Foutput (i, j ) = Finput (i, j ) ⊗ ai j + Finput (i, j ). (7) information from input images at different scales by using
Remark 1: Based on the proposed residual attention module, a global average pooling layer instead of a fully connected
the feature map with spatial attention and convolution is layer, which can greatly reduce the number of parameters
combined with the input feature map. Thus, the bottom–up while increasing the depth of the network, and has a good
feature extraction process will not be interrupted, and the top performance in the classification problem. In this study, the
information of the image will also be taken into account. Inception_V1 network is improved for the global feature
3) Local Feature Extraction: The detailed process of the extraction, and its structure is shown in Fig. 7.
local feature extraction based on the proposed Faster RCNN As shown in Fig. 7, the Inception_V1 network has nine
is introduced as follows. First, based on the pretrained local Inception blocks in total, and each Inception block has four
network, 300 region proposals are generated, and a further branches. The first branch performs a 1 ∗ 1 convolution on the
NMS algorithm is applied to them (the threshold for the input, which can proceed cross-channel feature transformation
NMS is 0.3 in this study). Then, the target bounding box to improve the expression ability of the network; the second
with confidence greater than 0.5 is detected by the crop layer, branch first uses a 1 ∗ 1 convolution and then performs a
which is converted to the position coordinates [y1, x1, y2, x2] 3 ∗ 3 convolution; the third branch uses a 1 ∗ 1 convolution
in the original feature map (the feature map after the attention and then performs a 5 ∗ 5 convolution; and the fourth
module). The parts in the bounding box are extracted from the branch uses a 1 ∗ 1 convolution after a 3 ∗ 3 max-pooling
original feature map and resized uniformly by the following directly. Each Inception block uses an aggregation operation
pooling layer to obtain N local feature maps with the same to combine these four branches. Besides the Inception blocks,
size (it is 7 ∗ 7 ∗ 512 in this article). Finally, the N local there are three convolution layers and two max-pooling layers
feature maps are fused by an elementwise addition operation after the input layer. In addition, there are an average pooling
(see [38] and [39] for details), which is denoted as follows: layer and a fully connected layer before the output layer in
the Inception_V1 network.
N
Z add = Xi (8) As we know, the activation function is a key part of the
i=1 deep network, which is used to realize the nonlinear mapping
5001614 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
Fig. 8. Some images with different scenes in the special dataset built for scene classification of self-driving cars in this article.
1) The scene image to be classified is input into the are used directly for scene classification in self-driving cars.
proposed deep network. To deal with this problem, a special dataset is established
2) Generate 300 region proposals for the image by the to train and verify the performance of the deep network
pretrained improved Faster RCNN. Then, the local fea- in the scene classification of self-driving cars. There are
tures of the image are extracted by performing some five categories of the scene in the proposed dataset, namely,
subsequent operations on these 300 region proposals, crosswalk, gas station, parking lot, highway, and street. Each
such as NMS, flat, and two-layer fully connection category has 15 000 pictures that are selected from the public
[see (8) and (9)]. datasets KITTI [42] and Place365 [26].
3) Meanwhile, the global features of the image are The proposed dataset contains various traffic scenes at
extracted by the improved Inception_V1 network. different locations, time periods, and various light and weather
4) Then, the local features and global features are fused conditions, such as day and night, and cloudy and sunny days.
by (12) to obtain the fused feature vector. Because the scenes are to be classified by a self-driving car,
5) Send the fused feature vector to a fully connected layer, the images should be obtained by the cameras mounted on
and the size of the feature vector is changed to 1 ∗ 5. the car. Thus, when selecting the images, their availability
6) Conduct scene classification through Softmax classifier, should be fully considered, including the shooting angle,
and finally, output the scene category that the image shooting distance, and representative objects in the images.
belongs to. The workload to establish this special dataset is enormous,
which is to ensure the performance of the deep network based
III. E XPERIMENTS on this dataset. Some images with different scenes of the
self-driving cars are shown in Fig. 8.
A. Datasets Remark 3: In this study, the special dataset is set up
There are many public datasets widely used for image by selecting pictures with clear and distinct representative
classification training and testing, such as KITTI and objects of scene images from the public dataset because the
UMC [42], [43]. However, these datasets are not set up for self-driving car will only be in one scene at the same time in
scene classification of self-driving cars especially. Thus, the most cases. Other complex situations will be further studied
accuracy and efficiency will be low if these public datasets in the future.
5001614 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
Fig. 10. Curve of the loss function and accuracy of the proposed deep
network after 20 000 iterations on the training set.
TABLE I
PARAMETERS OF THE P ROPOSED D EEP N ETWORK
AND THE E XPERIMENTAL E NVIRONMENT
Fig. 11. Curve of the loss function and accuracy of the proposed deep
network after 20 000 iterations on the validation set.
TABLE II
E XPERIMENTAL R ESULTS OF S CENE C LASSIFICATION BASED ON D IFFERENT D EEP N ETWORKS
C. Comparison Experiments
To show the efficiency of the proposed method, some
comparison experiments are conducted, where some other
state-of-the-art deep learning-based methods are tested on the
same dataset used in Section III-B. These state-of-the-art deep
learning-based methods include MobileNet (with 14 Conv,
13 DW, one Pool, one FC, and one Softmax layers) [45],
ResNet101 (with 101 layers) [46], AlexNet (with eight lay-
ers) [47], EfficientNet (with 16 MBConv, two Conv, one Pool,
and one FC layers) [48], and Inception_V1 (with 22 layers;
see Fig. 7 for details) [49]. The main reason that these deep
learning methods are selected for comparative experiments is
that these methods are classic deep learning models in scene
classification with good performance. To test these methods
under different situations, the dataset is divided into three
parts, namely, sunny day, rainy day, and night. The comparison
results are listed in Table II. Some scene classification results
based on these methods are shown in Fig. 13.
The results in Table II show that the total accuracy of
the proposed method can reach 94.76% and increases 4.67%
(relative value) compared with the general Inception_V1,
which obtains the second-best result in this scene classification
experiment. In the experiment, all the state-of-the-art deep
learning-based methods can have a relatively high accuracy
of the scene classification on a sunny day. From the results
in Table II, we can also see that the classification accuracies
of all these methods decrease obviously on a rainy day and at
night because the feature extraction is difficult for the images
on a rainy day and night. However, the proposed deep network
can have higher accuracy on a sunny day, the rainy day, and at
night because the proposed deep network uses both the local
features and global features to realize the scene classification
(see the scene classification results in Fig. 13 for details).
The standard deviation of the accuracies on different situations
based on the proposed method is the smallest of these methods,
which also shows that the proposed deep network has good
performance in various situations.
IV. D ISCUSSION
The total performance of the proposed method has been
proven on the special dataset for self-driving cars by some
Fig. 13. Some scene classification results based on the six deep learning
experiments in Section III. In this section, some additional methods.
comparison experiments are conducted to discuss the perfor-
mance of the key parts of the proposed network, including global feature network based on Inception_V1. Based on these
the local feature network based on Faster RCNN and the experiments, not only the reasons why these deep networks
5001614 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
TABLE III
E XPERIMENTAL R ESULTS OF L OCAL F EATURE E XTRACTION
are used in the proposed model but also the ablation analyses improved Faster RCNN and the other two networks are used
(including the attention module in the general Faster RCNN to detect all these representative objects of the input images.
and the mixed activation function with ELU and Leaky ReLU The comparison experiment results are listed in Table III, and
in the Inception_V1) are given out. In addition, the effective- some experimental results are shown in Fig. 14.
ness of the presented special dataset for the scene classification The results in Table III and Fig. 14 show that the two
of self-driving cars and the generalization performance of the Faster RCNN-based methods have better performance in the
proposed model are discussed by the comparison experiments detection task of representative objects than YOLOv5, which
conducted on a public dataset and some real-world traffic is the main reason why the proposed deep network uses
videos. Faster RCNN for the local feature extraction. Compared
with the general Faster RCNN, the improved Faster RCNN
A. About the Local Feature Extraction Network increases 1.31% in mAP and reduces 27.27% in the standard
First, the performance of the improved Faster RCNN for deviation (relative value). On a rainy day, the accuracy of
local feature extraction is discussed by a comparison experi- the improved Faster RCNN increases 2.58% relative to the
ment with the general Faster RCNN and YOLOv5 (the latest general Faster RCNN, where the effect is more obvious. The
version of the YOLO object detection algorithm [50]). In the experimental results show that the introduction of a residual
proposed deep network, the task of the Faster RCNN is to connection module based on spatial attention in the Faster
detect the representative objects of the scene images, including RCNN can help the network to extract the local features more
zebra crossings and pedestrians in crosswalks, and fuel tanks efficiently.
of gas stations (see Fig. 3). Thus, before the experiment, The results in Fig. 14 also show that the improved
each image is manually marked on the dataset. Then, the Faster RCNN in this article can finish the detection task of
NI et al.: IMPROVED DEEP NETWORK-BASED SCENE CLASSIFICATION METHOD FOR SELF-DRIVING CARS 5001614
TABLE IV
C OMPARISON E XPERIMENT R ESULTS BASED ON
D IFFERENT G LOBAL F EATURE N ETWORKS
Fig. 16. Some images in the traffic videos obtained from an automobile data recorder.
V. C ONCLUSION
The scene classification for self-driving cars based on the
deep network is studied, and an improved integrated deep
network is presented in this article. In the proposed deep
network, the Inception network and Faster RCNN network are
this study. This means that there is a dataset offset between the used to extract global features and local features, respectively,
public dataset and the special dataset used to train the deep which are two of the main CNN models for visual computing
network model, which will seriously affect the performance with excellent performance. In the proposed deep network,
of scene classification based on deep learning methods. There these two networks are improved to increase accuracy and
are lots of images in the public dataset without considering computing efficiency. To further improve the efficiency of
the shooting angle and distance, especially the existence of the deep learning-based method in self-driving cars, a special
representative objects (such as the parking lots without parking dataset is set up based on some public datasets. In addition,
lines), which will reduce the scene classification accuracy. various comparison experiments are conducted, and the results
On the other hand, the result on the public dataset based show that the proposed deep network has a better performance
on the proposed method has the highest accuracy in all deep than the state-of-the-art deep networks in the scene classi-
networks, which shows that the proposed method has good fication task for self-driving cars. However, there are some
robustness. limitations of the proposed method, including the division
In the experiment of the real-world traffic videos, there are problem of the scene categories and the classification problem
some scenes that are common but difficult to recognize, includ- for some heterogeneous scenes, such as the roadside parking
ing the dimly lit underground parking lots and complicated lots and gas stations along the street. These problems should
internal roads (see Fig. 16). The proposed method can also get be further studied.
the best result, and its accuracy reaches 83.46%, which further In future work, the special dataset for self-driving should
shows that the proposed method has a better generalization be further studied, including the situations with hetero-
ability and can satisfy the requirement in the task of scene geneous road agents, to make it more suitable for the
classification for self-driving cars (see Table V). deep learning-based methods of scene classification in the
Remark 4: The generalization of the trained model for a self-driving car field. On the other hand, how to perform a
new test distribution is a very important problem, and there more nuanced division of the scene categories for self-driving
are many related works on this problem [53]–[55]. In this cars is a subject worthy of study. In addition, other deep
NI et al.: IMPROVED DEEP NETWORK-BASED SCENE CLASSIFICATION METHOD FOR SELF-DRIVING CARS 5001614
network models (such as VGG and AlexNet) will be further [23] L. Chen, W. Zhan, W. Tian, Y. He, and Q. Zou, “Deep integration:
studied to check their performance for different tasks in the A multi-label architecture for road scene recognition,” IEEE Trans.
Image Process., vol. 28, no. 10, pp. 4883–4898, Oct. 2019.
self-driving car field, including lane recognition and obstacle [24] K. Zheng and H. A. H. Naji, “Road scene segmentation based on deep
detection. learning,” IEEE Access, vol. 8, pp. 140964–140971, 2020.
[25] P. Tang, H. Wang, and S. Kwong, “G-MS2F: GoogLeNet based multi-
stage feature fusion of deep CNN for scene recognition,” Neurocomput-
R EFERENCES ing, vol. 225, pp. 188–197, Feb. 2017.
[26] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, “Knowledge guided
[1] M. Huang, X. Yan, Z. Bai, H. Zhang, and Z. Xu, “Key technologies of disambiguation for large-scale scene classification with multi-resolution
intelligent transportation based on image recognition and optimization CNNs,” IEEE Trans. Image Process., vol. 26, no. 4, pp. 2055–2068,
control,” Int. J. Pattern Recognit. Artif. Intell., vol. 34, no. 10, Sep. 2020, Apr. 2017.
Art. no. 2054024. [27] C. Dai et al., “Video scene segmentation using tensor-train faster-RCNN
[2] A. Mammeri, T. Zuo, and A. Boukerche, “Extending the detection range for multimedia IoT systems,” IEEE Internet Things J., vol. 8, no. 12,
of vision-based vehicular instrumentation,” IEEE Trans. Instrum. Meas., pp. 9697–9705, Jun. 2021.
vol. 65, no. 4, pp. 856–873, Apr. 2016. [28] Z. Liu, Y. Lyu, L. Wang, and Z. Han, “Detection approach based on an
[3] M. Karaduman and H. Eren, “Smart driving in smart city,” in Proc. 5th improved faster RCNN for brace sleeve screws in high-speed railways,”
Int. Istanbul Smart Grid Cities Congr. Fair (ICSG), İstanbul, Turkey, IEEE Trans. Instrum. Meas., vol. 69, no. 7, pp. 4395–4403, Jul. 2020.
Apr. 2017, pp. 115–119.
[29] C. Peng, K. Zhao, and B. C. Lovell, “Faster ILOD: Incremental learning
[4] F. Duarte, “Self-driving cars: A city perspective,” Sci. Robot., vol. 4,
for object detectors based on faster RCNN,” Pattern Recognit. Lett.,
no. 28, Mar. 2019, Art. no. eaav9843, doi: 10.1126/scirobotics.aav9843.
vol. 140, pp. 109–115, Dec. 2020.
[5] N. A. Greenblatt, “Self-driving cars and the law,” IEEE Spectr., vol. 53,
[30] G. Han, J.-P. Su, and C.-W. Zhang, “A method based on multi-
no. 2, pp. 46–51, Feb. 2016.
convolution layers joint and generative adversarial networks for vehicle
[6] J. Ni, Y. Chen, Y. Chen, J. Zhu, D. Ali, and W. Cao, “A survey on
detection,” KSII Trans. Internet Inf. Syst., vol. 13, no. 4, pp. 1795–1811,
theories and applications for self-driving cars based on deep learning
2019.
methods,” Appl. Sci., vol. 10, no. 8, p. 2749, Apr. 2020.
[31] Y. Tian et al., “Lane marking detection via deep convolutional neural
[7] R. Hussain and S. Zeadally, “Autonomous cars: Research results, issues,
network,” Neurocomputing, vol. 280, pp. 46–55, Mar. 2018.
and future challenges,” IEEE Commun. Surveys Tuts., vol. 21, no. 2,
[32] Z.-H. Feng, J. Kittler, M. Awais, and X.-J. Wu, “Rectified wing
pp. 1275–1313, 2nd Quart., 2019.
loss for efficient and robust facial landmark localisation with convo-
[8] L. Jones, “Driverless when and cars: Where?” Eng. Technol., vol. 12,
lutional neural networks,” Int. J. Comput. Vis., vol. 128, nos. 8–9,
no. 2, pp. 36–40, 2017.
pp. 2126–2145, 2020.
[9] Q. Xu, M. Wang, Z. Du, and Y. Zhang, “A positioning algorithm of
autonomous car based on map-matching and environmental perception,” [33] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
in Proc. 33rd Chin. Control Conf. (CCC), Nanjing, China, Jul. 2014, real-time object detection with region proposal networks,” IEEE Trans.
pp. 707–712. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[10] Q. Zou, H. Jiang, Q. Dai, Y. Yue, L. Chen, and Q. Wang, “Robust lane [34] N. Yao, G. Shan, and X. Zhu, “Substation object detection based on
detection from continuous driving scenes using deep neural networks,” enhance RCNN model,” in Proc. 6th Asia Conf. Power Electr. Eng.
IEEE Trans. Veh. Technol., vol. 69, no. 1, pp. 41–54, Jan. 2020. (ACPEE), Chongqing, China, Apr. 2021, pp. 463–469.
[11] Y. Parmar, S. Natarajan, and G. Sobha, “DeepRange: Deep-learning- [35] M. Guo, D. Xue, P. Li, and H. Xu, “Vehicle pedestrian detection
based object detection and ranging in autonomous driving,” IET Intell. method based on spatial pyramid pooling and attention mechanism,”
Transp. Syst., vol. 13, no. 8, pp. 1256–1264, Aug. 2019. Information, vol. 11, no. 12, pp. 1–15, 2020.
[12] J.-R. Xue, J.-W. Fang, and P. Zhang, “A survey of scene understanding [36] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
by event reasoning in autonomous driving,” Int. J. Autom. Comput., Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA, USA,
vol. 15, no. 3, pp. 249–266, 2018. Jun. 2015, pp. 1–9.
[13] Y. Yang, F. Chen, F. Wu, D. Zeng, Y.-M. Ji, and X.-Y. Jing, “Multi-view [37] F. Chollet, “Xception: Deep learning with depthwise separable convo-
semantic learning network for point cloud based 3D object detection,” lutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Neurocomputing, vol. 397, pp. 477–485, Jul. 2020. Honolulu, HI, USA, Jul. 2017, pp. 1800–1807.
[14] H. Xu and G. Srivastava, “Automatic recognition algorithm of traffic [38] C. Zou and M. Wei, “Cluster-based deep convolutional networks for
signs based on convolution neural network,” Multimedia Tools Appl., spectral reconstruction from RGB images,” Neurocomputing, vol. 464,
vol. 79, nos. 17–18, pp. 11551–11565, May 2020. pp. 342–351, Nov. 2021.
[15] C. Shen et al., “Multi-receptive field graph convolutional neural net- [39] Y. Ge, Z. Yang, Z. Huang, and F. Ye, “A multi-level feature fusion
works for pedestrian detection,” IET Intell. Transp. Syst., vol. 13, no. 9, method based on pooling and similarity for HRRS image retrieval,”
pp. 1319–1328, Sep. 2019. Remote Sens. Lett., vol. 12, no. 11, pp. 1090–1099, Nov. 2021.
[16] A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, “Image [40] I. Delibasoglu and M. Cetin, “Improved U-Nets with inception blocks
classification for content-based indexing,” IEEE Trans. Image Process., for building detection,” Proc. SPIE, vol. 14, no. 4, Nov. 2020,
vol. 10, no. 1, pp. 117–130, Jan. 2001. Art. no. 044512.
[17] M. V. Latte, S. Shidnal, B. S. Anami, and V. B. Kuligod, “A combined [41] X. Li, L. Yu, D. Chang, Z. Ma, and J. Cao, “Dual cross-entropy loss
color and texture features based methodology for recognition of crop for small-sample fine-grained vehicle classification,” IEEE Trans. Veh.
field image,” Int. J. Signal Process., Image Process. Pattern Recognit., Technol., vol. 68, no. 5, pp. 4204–4212, May 2019.
vol. 8, no. 2, pp. 287–302, Feb. 2015. [42] R. McCall et al., “A taxonomy of autonomous vehicle handover situa-
[18] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, tions,” Transp. Res. A, Policy Pract., vol. 124, pp. 507–522, Jun. 2019.
and L. Van Gool, “Modeling scenes with local descriptors and latent [43] L. Zhang, L. Li, X. Pan, Z. Cao, Q. Chen, and H. Yang, “Multi-
aspects,” in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV), vol. 1, level ensemble network for scene recognition,” Multimedia Tools Appl.,
Beijing, China, Oct. 2005, pp. 883–890. vol. 78, no. 19, pp. 28209–28230, Oct. 2019.
[19] G. Guo and N. Zhang, “A survey on deep learning based face [44] M. Liu and D. Grana, “Accelerating geostatistical seismic inversion
recognition,” Comput. Vis. Image Understand., vol. 189, Dec. 2019, using TensorFlow: A heterogeneous distributed deep learning frame-
Art. no. 102805. work,” Comput. Geosci., vol. 124, pp. 37–45, Mar. 2019.
[20] S. Ren, K. Sun, C. Tan, and F. Dong, “A two-stage deep learning method [45] W. Wang, Y. Hu, T. Zou, H. Liu, J. Wang, and X. Wang, “A new
for robust shape reconstruction with electrical impedance tomography,” image classification approach via improved MobileNet models with local
IEEE Trans. Instrum. Meas., vol. 69, no. 7, pp. 4887–4897, Jul. 2020. receptive field expansion in shallow layers,” Comput. Intell. Neurosci.,
[21] Z. Wang, K. Liu, J. Li, Y. Zhu, and Y. Zhang, “Various frameworks vol. 2020, pp. 1–10, Aug. 2020.
and libraries of machine learning and deep learning: A survey,” Arch. [46] S. Liu, G. Tian, and Y. Xu, “A novel scene classification model
Comput. Methods Eng., pp. 1–24, Feb. 2019, doi: 10.1007/s11831-018- combining ResNet based transfer learning and data augmentation with
09312-w. a filter,” Neurocomputing, vol. 338, pp. 191–206, Apr. 2019.
[22] E. Mutabazi, J. Ni, G. Tang, and W. Cao, “A review on medical textual [47] K. M. Hosny, M. A. Kassem, and M. M. Fouad, “Classification of skin
question answering systems based on deep learning approaches,” Appl. lesions into seven classes using transfer learning with AlexNet,” J. Digit.
Sci., vol. 11, no. 12, p. 5456, Jun. 2021. Imag., vol. 33, no. 5, pp. 1325–1334, 2020.
5001614 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 71, 2022
[48] H. Alhichri, A. S. Alswayed, Y. Bazi, N. Ammour, and N. A. Alajlan, Kang Shen received the B.S. degree from Hohai
“Classification of remote sensing images using EfficientNet-B3 CNN University, Changzhou, China, in 2020, where she is
model with attention,” IEEE Access, vol. 9, pp. 14078–14094, 2021. currently pursuing the M.S. degree with the Depart-
[49] R. K. Mohapatra, K. Shaswat, and S. Kedia, “Offline handwritten sig- ment of Detection Technology and Automatic Equip-
nature verification using CNN inspired by inception V1 architecture,” in ment, College of Internet of Things Engineering.
Proc. 5th Int. Conf. Image Inf. Process. (ICIIP), Solan, India, Nov. 2019, Her research interests include self-driving, robot
pp. 263–267. control, and machine learning.
[50] M. Kasper-Eulaers, N. Hahn, S. Berger, T. Sebulonsen, Ø. Myrland,
and P. E. Kummervold, “Short communication: Detecting heavy goods
vehicles in rest areas in winter conditions using YOLOv5,” Algorithms,
vol. 14, no. 4, p. 114, Mar. 2021.
[51] C. Wang et al., “Pulmonary image classification based on inception- Yinan Chen received the B.S. degree from Hohai
v3 transfer learning model,” IEEE Access, vol. 7, pp. 146533–146541, University, Changzhou, China, in 2018, where she
2019. is currently pursuing the M.S. degree in communi-
[52] F. Yu et al., “BDD100K: A diverse driving dataset for heterogeneous cation and information systems with the College of
multitask learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Internet of Things Engineering.
Recognit. (CVPR), Seattle, WA, USA, Jun. 2020, pp. 2633–2642. Her research interests include deep learning and
[53] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive image processing.
faster R-CNN for object detection in the wild,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, Jun. 2018,
pp. 3339–3348.
[54] C. Chen, Z. Zheng, X. Ding, Y. Huang, and Q. Dou, “Harmonizing
transferability and discriminability for adapting object detectors,” in Weidong Cao (Member, IEEE) received the Ph.D.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, degree in mechanical engineering from Chongqing
WA, USA, Jun. 2020, pp. 8866–8875. University, Chongqing, China, in 2018.
[55] Y. Song, Z. Liu, J. Wang, R. Tang, G. Duan, and J. Tan, “Multiscale He is currently a Lecturer with the College
adversarial and weighted gradient domain adaptive network for data of Internet of Things Engineering, Hohai Uni-
scarcity surface defect detection,” IEEE Trans. Instrum. Meas., vol. 70, versity, Changzhou, China. His research interests
pp. 1–10, 2021. include swarm intelligence optimization algorithms
and data-driven modeling.