Sensors 22 04833
Sensors 22 04833
Article
DetectFormer: Category-Assisted Transformer for Traffic Scene
Object Detection
Tianjiao Liang 1,2 , Hong Bao 1,2 , Weiguo Pan 1,2, * , Xinyue Fan 1,2 and Han Li 1,2
1 Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China;
[email protected] (T.L.); [email protected] (H.B.); [email protected] (X.F.);
[email protected] (H.L.)
2 College of Robotics, Beijing Union University, Beijing 100101, China
* Correspondence: [email protected]
Abstract: Object detection plays a vital role in autonomous driving systems, and the accurate detec-
tion of surrounding objects can ensure the safe driving of vehicles. This paper proposes a category-
assisted transformer object detector called DetectFormer for autonomous driving. The proposed
object detector can achieve better accuracy compared with the baseline. Specifically, ClassDecoder
is assisted by proposal categories and global information from the Global Extract Encoder (GEE)
to improve the category sensitivity and detection performance. This fits the distribution of object
categories in specific scene backgrounds and the connection between objects and the image context.
Data augmentation is used to improve robustness and attention mechanism added in backbone
network to extract channel-wise spatial features and direction information. The results obtained
by benchmark experiment reveal that the proposed method can achieve higher real-time detection
performance in traffic scenes compared with RetinaNet and FCOS. The proposed method achieved a
detection performance of 97.6% and 91.4% in AP50 and AP75 on the BCTSDB dataset, respectively.
2. Related Work
2.1. Object Detection
Traditional object detection uses HOG [4] or DPM [5] to extract the image features,
and then feed them into a classifier such as SVM [6]. Chen et al. [7] use SVM for traffic
light detection. In recent years, deep learning based object detection algorithms have
achieved better performance in terms of accuracy compared with traditional methods and
have become a research hotspot. Generally, there are two types of object detection based
on deep convolutional networks: (1) multi-stage detection, such as R-CNN series [8–10],
and Cascade R-CNN [11]; (2) one-stage detection, which is also known as the dense
detector and can be divided into anchor-based methods (for example, the You Only Look
Once series [12–14] and RetinaNet [15]) and anchor-free methods (for example, FCOS [16],
CenterNet [17], and CornerNet [18]). Multi-stage detection methods extract features of the
foreground area using region proposal algorithms from preset dense candidates in the first
stage. The bounding boxes of objects are regressed in the subsequent steps. The limitation
of this structure is that it reduces the detection speed and cannot satisfy the real-time
requirements of autonomous driving tasks. Single-stage detection methods directly detect
the object and regress the bounding boxes different from multi-stage methods, which can
avoid the repeated calculation of the feature map and obtains the anchor boxes directly
Sensors 2022, 22, 4833 3 of 17
on the feature map. He et al. [19] proposed a detection method using CapsNet [20] based
on visual inspection of traffic scenes. Li et al. [21] proposed improved Faster R-CNN for
multi-object detection in a complex traffic environments. Lian et al. [22] proposed attention
fusion for small traffic object detection. Liang et al. [23] proposed a light-weight anchor-free
detector for traffic scene object detection. However, their models cannot capture global
information limited by the size of the receptive field. The above-mentioned approaches
obtain local information when extracting image features, and enlarge the receptive field by
increasing the size of the convolution kernel or stacking the number of convolution layers.
In recent years, transformers have been introduced as new attention-based building blocks
applied to computer vision, they have achieved superior performance because they can
obtain the global information of the image without increasing the receptive field.
3. Proposed Method
The overall pipeline of our proposed method is shown in Figure 1. The main contri-
butions of the proposed method are the following three parts: (1) attention mechanism
in backbone network based on position information; (2) the Global Extract Encoder can
enhance the model’s global perception ability; (3) a novel learnable object relationship
module called ClassDecoder. Finally, efficient data augmentation was used to improve the
robustness of the model.
modules. The first module is the multi-head self-attention layer, and the second one is the
feedforward network (FFN). Residual connections ⊕ are used between each sub-layer.
Sensors2022,
Sensors 2022,22,
22,4833
x FOR PEER REVIEW 4 of
4 of1817
Figure 1. The overall architecture of the proposed method. The architecture can be divided into
three parts: backbone, encoder, and decoder. The backbone network is used to extract image fea-
tures, the encoder is used to enhance the model’s global perception ability, and the decoder is used
to detect the objects in traffic scenes.
We split the feature maps into patches, and collapsed the spatial dimensions of f from
RC× H ×W to a one dimension sequence RC× HW . Then, a fixed position embedding is added
to the feature sequence f 0 ∈ RC× HW owing to permutation invariance and fed into GEE.
The obtained information from different subspaces and positions by adding multi-head
self-attention H.
f ( j)
Wi = w( j) f 0 j = 1, 2, 3 , (1)
f (i ) f ( j)
!
Wi Wi T f (k)
hi = So f tmax √ Wi i 6= j 6= k, (2)
HW/n
with the input dimensions, and the model can obtain long distance regional relationships
and global information rather than local information when extracting object features.
Figure3.3.Structure
Figure StructureofofClassDecoder.
ClassDecoder.Proposal
Proposalcategories
categorieslearn
learnthe
therelationship
relationshipbetween
betweendifferent
different
categories and classify objects based on Global Information.
categories and classify objects based on Global Information.
ClassDecoder block
ClassDecoder block requires
requirestwo
twoinputs:
inputs: the
the feature
feature sequence
sequence GGand
andthetheproposal
proposal
categoriesP.P.The
categories Theproposed
proposedClassDecoder
ClassDecoderisistotodetect
detectdifferent
differentcategories
categoriesof
ofobjects,
objects,using
using
proposalcategories
proposal categoriestotopredict
predictthe
theconfidence
confidencevector
vectorofofeach
eachcategory,
category,and
andthe
thedepth
depthnnofof
ClassDecoder represents the number of categories. Then, the convolution operation is usedis
ClassDecoder represents the number of categories. Then, the convolution operation
used to generate the global descriptor of each vector. Finally, the softmax function is used
to output the prediction result of the category.
= , (5)
= ( ( ( ))). (6)
× ×
Sensors 2022, 22, 4833 6 of 17
to generate the global descriptor of each vector. Finally, the softmax function is used to
output the prediction result of the category.
GP T
f p = So f tmax √ P, (5)
dk
1 H W
∑i=1 ∑ j=1 xc (i, j),
g
zc = (7)
H×W
1
W ∑ 0 ≤ i <W
zch (h) = xc (h, i ), (8)
1
H ∑ 0≤ j < H
zw
c (w) = xc ( j, w), (9)
h i
g
P = ϕ F zc , zch (h), zw
c ( w ) . (10)
where xc is the input from the features extracted from the previous layer associated with
the c-channel, ϕ(.) is the convolutional operation, and F [.] is concatenate operation. After
the output of different information P through their respective convolution layer (.), the
normalization is activated by sigmoid activation function σ (.). The final output yc is the
multiply of the original feature map and information weights.
f w = σ ( ϕw (P w )), (11)
ground.
attentionmechanism
Figure 4. The attention mechanismininbackbone
backbonenetwork.
network.We
We propose
propose thethe global
global encoding
encoding for for
channel-
channel-wise
wise spatial information
spatial information and
and extract extract
X and X and Yinformation
Y direction direction information for the
for the location locationfeatures.
attention atten-
tion features.
3.4. Data Augmentation
,
Traffic (ℎ)
scene (ℎ)
andobject are defined
detection as follows:
is usually affected by light, weather, and other factors.
The data-driven deep neural networks require a large number of labeled images to train
= ∑ ∑ ( , ), (7)
×
the model. Most traffic scene datasets cannot cover all complex environmental conditions.
In this paper, we use three types of data augmentation methods global pixel level, spatial
level, and object level, as shown in(ℎ) = ∑5. Specifically,
Figure (ℎ, ), we use Brightness Contrast,(8)
Blur, and Channel Dropout for illumination transformation; we use Rain, Sun Flare, and
Cutout [37] for the spatial level data )= ∑
( augmentation, (Mixup,
, ), CutMix [38] for the object (9)
level augmentation. The data augmented by these methods can simulate complex traffic
scenarios, which can improve the detection
= ( [ robustness
, (ℎ), (of)]). the model. (10)
In this paper, we use three types of data augmentation methods global pixel level, spatial
level, and object level, as shown in Figure 5. Specifically, we use Brightness Contrast, Blur,
and Channel Dropout for illumination transformation; we use Rain, Sun Flare, and Cutout
[37] for the spatial level data augmentation, Mixup, CutMix [38] for the object level aug-
Sensors 2022, 22, 4833 mentation. The data augmented by these methods can simulate complex traffic scenarios, 8 of 17
which can improve the detection robustness of the model.
Figure 5. Efficient data augmentation for traffic scene images. Different augmentation methods are
Figure 5. Efficient data augmentation for traffic scene images. Different augmentation methods are
used to simulate the complex environment.
used to simulate the complex environment.
4.2. Datasets
Detection performance in traffic scenes is evaluated using the BCTSDB [39], KITTI [40],
and COCO [41] datasets to evaluate the generalization ability. The KITTI dataset contains
7481 training images and 7518 test images, totaling 80,256 labeled objects with three
categories (e.g., vehicle, pedestrian, and cyclist). The BCTSDB dataset contains 15,690 traffic
sign images, including 25,243 labeled traffic signs. The COCO dataset is used to test the
generalization ability of the model including 80 object categories and more than 220 K
labeled images.
4.4. Performances
We first evaluate the effectiveness of the different proposed units. The ClassDecoder
head, Global Extract Encoder, Attention, Anchor-free head, and Data augmentation are
Sensors 2022, 22, 4833 9 of 17
gradually added to the RetinaNet baseline on COCO and BCTSDB dataset to test the
generalization ability of the proposed method and the detection ability in the traffic scene,
as shown in Tables 1 and 2, respectively.
Methods Parameters (M) FLOPs (G) AP (%) AP50 (%) AP75 (%)
RetinaNet baseline 37.74 95.66 32.5 50.9 34.8
+ClassDecoder 35.03 (−2.71) 70.30 (−25.36) 34.6 (+2.1) 53.5 (+2.6) 36.1 (+1.3)
+Global Extract
36.95 (+1.92) 90.45 (+20.15) 36.2 (+1.6) 55.7 (+2.2) 37.8 (+1.7)
Encoder
+Attention 37.45 (+0.5) 90.65 (+0.2) 38.3 (+2.1) 58.3 (+2.6) 39.3 (+1.5)
+Anchor-free 37.31 (−0.14) 89.95 (−0.7) 38.9 (+0.6) 59.1 (+0.8) 39.6 (+0.3)
+Data augmentation 37.31 (+0) 89.95 (+0) 41.3 (+2.4) 61.8 (+2.7) 41.5 (+1.9)
Methods Parameters (M) FLOPs (G) AP (%) AP50 (%) AP75 (%)
RetinaNet baseline 37.74 95.66 59.7 89.4 71.2
+ClassDecoder 35.03 (−2.71) 70.30 (−25.36) 61.6 (+3.7) 91.8 (+2.4) 75.8 (+4.6)
+Global Extract
36.95 (+1.92) 90.45 (+20.15) 63.4 (+3.4) 93.9 (+2.1) 80.6 (+4.8)
Encoder
+Attention 37.45 (+0.5) 90.65 (+0.2) 65.2 (+3.1) 95.1 (+1.2) 84.2 (+3.6)
+Anchor-free 37.31 (−0.14) 89.95 (−0.7) 65.8 (+2.1) 95.7 (+0.6) 87.4 (+3.2)
+Data augmentation 37.31 (+0) 89.95 (+0) 76.1 (+4.1) 97.6 (+1.9) 91.4 (+4.0)
Theloss
Figure6.6.The
Figure losscurves
curvesfor
fordifferent
differentinitialization
initializationmethods.
methods.
Table 5 presents the classification performance of baseline methods and that of the
proposed method on the BCTSDB dataset. Anchor-based and anchor-free methods were
used to compare RetinaNet and FCOS, respectively. The experimental results reveal that
DetectFormer is helpful in improving the classification ability of the model. Remarkably, De-
tectFormer can reduce the computation and parameter number of the detection networks.
Model Backbone Head Params. (M) FLOPs (G) Top-1 Acc. (%) Top-5 Acc. (%)
RetinaNet [15] ResNet50 Anchor-based 37.74 95.66 96.8 98.9
FCOS [16] ResNet50 Anchor-free 31.84 78.67 98.2 99.1
Ours. ResNet50 Anchor-free 37.31 89.95 98.7 99.5
The convergence curves among the DetectFormer and other SOTA (state-of-the-art)
methods, including RetinaNet, DETR, Faster R-CNN, FCOS, and YOLOv5, are shown in
Figure 7, which illustrates that DetectFormer achieves better performance with efficient
training and accurate detection. The vertical axis is the detection accuracy.
Ours. ResNet50 Anchor-free 37.31 89.95 98.7 99.5
The convergence curves among the DetectFormer and other SOTA (state-of-the-art)
methods, including RetinaNet, DETR, Faster R-CNN, FCOS, and YOLOv5, are shown in
Sensors 2022, 22, 4833 11 of 17
Figure 7, which illustrates that DetectFormer achieves better performance with efficient
training and accurate detection. The vertical axis is the detection accuracy.
Figure 7.
Figure 7. The
The detection
detection results
results with
with different
different methods
methods on
on the
the BCTSDB
BCTSDB dataset.
dataset. Our
Our model
model can
can
achieve higher detection accuracy in shorter training epochs. In particular, DETR requires more than
achieve higher detection accuracy in shorter training epochs. In particular, DETR requires more than
200 training epochs for high precision detection.
200 training epochs for high precision detection.
Table 66 shows
Table shows the
the detection
detection results
results on
on BCTSDB
BCTSDB dataset
dataset produced
produced by
by multi-stage
multi-stage
methods (e.g., Faster R-CNN, Cascade R-CNN) and single-stage methods, including
methods (e.g., Faster R-CNN, Cascade R-CNN) and single-stage methods, including anchor-an-
based methods (e.g., YOLOv3, RetinaNet) and the anchor-free method FCOS. DetectFormer
shows high detection accuracy and more competitive performance. The AP, AP50, and
AP75 are 76.1%, 97.6%, and 84.3%, respectively. DetectFormer can suit the distribution of
object categories and boost detection confidence in the field of autonomous driving better
than other networks.
The proposed method was also evaluated on the KITTI dataset. As shown in Table 7,
compared with other methods, DetectFormer shows better detection results.
Figure 8.
Figure8.
Figure Precision curves
Precisioncurves
8. Precision of
curvesof the
ofthe proposed
the proposed method
proposed method and
method and RetinaNet.
and RetinaNet. Our
RetinaNet. Our model
Ourmodel has
modelhas high
hashigh detection
highdetection
detection
accuracy even
accuracyeven
accuracy with
evenwith low
withlow confidence.
lowconfidence.
confidence.
Figure9.
Figure Confusionmatrix
9.Confusion matrixofofthe
theproposed
proposedmethod
methodand
andRetinaNet.
RetinaNet.The
Thedarker
darkerthe
theblock,
block,the
thelarger
larger
Figure 9. Confusion matrix of the proposed method and RetinaNet. The darker the block, the larger
thevalue
the value itit represents.
represents. Compared
Comparedwith withRetinaNet,
RetinaNet,thetheproposed
proposedmethod
methodcan
canobtain
obtainmore
morecategory
category
the value it represents. Compared with RetinaNet, the proposed method can obtain more category
information
informationand
information and help
andhelp to
helpto classify
toclassify the
classifythe objects.
theobjects. (a)
objects.(a) Confusion
(a)Confusion matrix
Confusionmatrix of
matrixof RetinaNet.
ofRetinaNet. (b)
RetinaNet.(b) Confusion
(b)Confusion matrix
Confusionmatrix
matrix
of
of DetectFormer.
DetectFormer.
of DetectFormer.
The detection
The detectionresults
resultsare
areshown
showninin Figures
Figures 10 10
andand 11 the
11 on
on on the KITTI
KITTI and and BCTSDB
BCTSDB da-
The detection results are shown in Figures 10 and 11 the KITTI and BCTSDB da-
datasets, respectively.
tasets, respectively.
respectively. The The
The resultsresults demonstrate
results demonstrate
demonstrate the the proposed
the proposed
proposed method’s method’s effectiveness
method’s effectiveness
effectiveness in in
in traffic
traffic
tasets,
traffic scenarios.
scenarios. Three Three
Three types
types oftypes of
of traffic traffic
traffic signs
signs on signs
on the on
the BCTSDBthe BCTSDB
BCTSDB dataset, dataset,
dataset, including including
including warning, warning,
warning, prohib-
prohib-
scenarios.
prohibitory,
itory, mandatory,
mandatory, and and types
three three types
of of traffic
traffic objectsobjects
on onKITTI
the the KITTI dataset,
dataset, including
including car,
itory, mandatory, and three types of traffic objects on the KITTI dataset, including car,
car, pedestrian,
pedestrian, cyclist
cyclist were were detected.
detected. The The detection
detection resultresult
does does
not not include
include otherother
types types
of of
traf-
pedestrian, cyclist were detected. The detection result does not include other types of traf-
fic objects such as a motorcycle in Figure 10, but the proposed model
fic objects such as a motorcycle in Figure 10, but the proposed model can detect those can detect those
kinds of
kinds of objects.
objects.
Sensors 2022, 22, 4833 13 of 17
Figure 10. Detection Results on KITTI dataset. Our method can detect different objects in traffic
Figure 10. Detection Results
Figure accurately,
10. Detection Results on
on KITTI
KITTI dataset. Our method
dataset. Our method can
can detect
detect different
different objects
objects in
in traffic
traffic
scenes and even identify overlapping objects and dense objects.
scenes accurately, and even identify overlapping objects and dense objects.
scenes accurately, and even identify overlapping objects and dense objects.
Figure 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
Figurewith
scales 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
high precision.
Figure 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
scales with high precision.
scales with high precision.
5. Discussion
5. Discussion
5. Discussion
Why can ClassDecoder improve the classification ability of models? In this paper, we
WhyClassDecoder
propose
Why can
can ClassDecoder improve
to improve
ClassDecoder improvethethe
the classificationability,
classification
classification abilitywhich
ability of models?
of models? In
In this
this paper,
is designed based we
paper, on
we
propose
propose ClassDecoder
the transformer to improve
architecture
ClassDecoder the
without
to improve classification
any convolution
the classification ability, which
operations.
ability, is designed
The model
which is designed based on
interacts
based on the
the transformer
with architecture
different background without
feature maps anyin convolution operations.
scaled dot-product The model
attention interacts
and multi-head
with different
attention background
by using proposalfeature mapsand
categories, in scaled
learns dot-product
the implicit attention andbetween
relationship multi-headthe
attention by using proposal categories, and learns the implicit relationship
background and the category by using the key-value pair idea in the Transformer. The between the
background and the category by using the key-value pair idea in the Transformer. The
Sensors 2022, 22, 4833 14 of 17
transformer architecture without any convolution operations. The model interacts with
different background feature maps in scaled dot-product attention and multi-head attention
by using proposal categories, and learns the implicit relationship between the background
and the category by using the key-value pair idea in the Transformer. The number of
proposal categories is equal to the number of object categories, and the parameters of
proposal categories are learnable. The input of ClassDecoder is the feature maps, and
proposal categories, and the output is the prediction category of the current bounding box.
The output dimensions are the same as those of the proposal categories, and the proposal
categories are associated with the output in the role of Query (Query-Key-Value relationship
in transformer architecture). It can be understood that the proposal categories are vectors
that can be learned, and their quantity represents the confidence vectors corresponding to
different categories of the current bounding box. Then, the model converts the confidence
vector into category confidence through feed-forward network. The category with the
highest confidence is the category of the predicted bounding box.
6. Conclusions
This paper proposes a novel object detector called DetectFormer, which is assisted by
a transformer to learn the relationship between objects in traffic scenes. By introducing the
GEE and ClassDecoder, this study focused on fitting the distribution of object categories
to specific scene backgrounds and implicitly learning the object category relationships
to improve the sensitivity of the model to the categories. The results obtained by experi-
ments on the KITTI and BCTSDB datasets reveal that the proposed method can improve
the classification ability and achieve outstanding performance in complex traffic scenes.
The AP50 and AP75 of the proposed method are 97.6% and 91.4% on BCTSDB, and the
average accuracies of car, pedestrian, and cyclist are 86.6%, 79.5%, and 81.7% on KITTI,
respectively, which indicates that the proposed method achieves better results compared to
other methods. The proposed method improved detection accuracy, but it still encountered
many challenges when applied to natural traffic scenarios. The experiment in this paper is
trained on public datasets and real traffic scenes facing challenges with complex lighting
and weather factors. Our future work is focused on object detection in an open environment
and the deployment of models to vehicles.
Author Contributions: Conceptualization, T.L. and W.P.; methodology, H.B.; software, T.L. and W.P.;
validation, X.F. and H.L.; writing—original draft preparation, T.L.; writing—review and editing, T.L.
and W.P. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China (Nos.
61802019, 61932012, 61871039, 61906017 and 62006020) and the Beijing Municipal Education Commis-
sion Science and Technology Program (Nos. KM201911417003, KM201911417009 and KM201911417001).
Beijing Union University Research and Innovation Projects for Postgraduates (No.YZ2020K001). By
the Premium Funding Project for Academic Human Resources Development in Beijing Union Uni-
versity under Grant BPHR2020DZ02.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
APM AP for objects of medium scales (322 < area < 962)
APS AP for objects of small scales (area < 322)
BCTSDB BUU Chinese Traffic Sign Detection Benchmark
CNN Convolutional Neural Network
FLOPs Floating-point operations per second
FPN Feature pyramid network
FPS Frames Per Second
GEE Global Extract Encoder
HOG Histogram of Oriented Gradients
IoU Intersection over union
LSTM Long Short-Term Memory
NMS Non-Maximum Suppression
RNN Recurrent Neural Network
SOTA State-of-the-art
SSD Single Shot MultiBox Detector
SVM Support Vector Machine
VRAM Video random access memory
YOLO You Only Look Once
References
1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you
need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA,
4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; pp. 6000–6010.
2. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929.
3. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In
Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229, ISBN 978-3-030-58451-1.
4. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: San Diego, CA,
USA, 2005; Volume 1, pp. 886–893.
5. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models.
IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef]
6. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13,
18–28. [CrossRef]
7. Chen, Z.; Shi, Q.; Huang, X. Automatic detection of traffic lights using support vector machine. In Proceedings of the 2015 IEEE
Intelligent Vehicles Symposium (IV), Seoul, Korea, 28 June–1 July 2015; IEEE: Seoul, Korea, 2015; pp. 37–40.
8. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
IEEE: Columbus, OH, USA, 2014; pp. 580–587.
9. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 1440–1448.
10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef]
11. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Salt Lake City, UT,
USA, 2018; pp. 6154–6162.
12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las
Vegas, NV, USA, 2016; pp. 779–788.
13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 6517–6525.
14. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
15. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 42, 318–327. [CrossRef]
16. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A Simple and Strong Anchor-free Object Detector. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 44, 1922–1933. [CrossRef]
17. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850.
18. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [CrossRef]
Sensors 2022, 22, 4833 16 of 17
19. He, S.; Chen, L.; Zhang, S.; Guo, Z.; Sun, P.; Liu, H.; Liu, H. Automatic Recognition of Traffic Signs Based on Visual Inspection.
IEEE Access 2021, 9, 43253–43261. [CrossRef]
20. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9
December 2017; pp. 3859–3869.
21. Li, C.; Qu, Z.; Wang, S.; Liu, L. A method of cross-layer fusion multi-object detection and recognition based on improved faster
R-CNN model in complex traffic environment. Pattern Recognit. Lett. 2021, 145, 127–134. [CrossRef]
22. Lian, J.; Yin, Y.; Li, L.; Wang, Z.; Zhou, Y. Small Object Detection in Traffic Scenes Based on Attention Feature Fusion. Sensors 2021,
21, 3031. [CrossRef]
23. Liang, T.; Bao, H.; Pan, W.; Pan, F. ALODAD: An Anchor-Free Lightweight Object Detector for Autonomous Driving. IEEE Access
2022, 10, 40701–40714. [CrossRef]
24. Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational
Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385, pp. 37–45, ISBN 978-3-642-24796-5.
25. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [CrossRef]
26. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [CrossRef]
27. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The
Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538. [CrossRef]
28. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s
Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144.
[CrossRef]
29. Wang, H.; Wu, Z.; Liu, Z.; Cai, H.; Zhu, L.; Gan, C.; Han, S. HAT: Hardware-Aware Transformers for Efficient Natural Language
Processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020;
Association for Computational Linguistics: Online, 2020; pp. 7675–7688.
30. Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [CrossRef]
31. Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
Canada, 5–20 April 2018; IEEE: Calgary, AB, Canada, 2018; pp. 5884–5888.
32. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. arXiv 2016, arXiv:1810.04805. [CrossRef]
33. Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features With Transformer. IEEE Trans. Intell.
Transport. Syst. 2021, 1–14. [CrossRef]
34. Cai, L.; Janowicz, K.; Mai, G.; Yan, B.; Zhu, R. Traffic transformer: Capturing the continuity and periodicity of time series for
traffic forecasting. Trans. GIS 2020, 24, 736–755. [CrossRef]
35. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari,
V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham,
Switzerland, 2018; Volume 11211, pp. 3–19, ISBN 978-3-030-01233-5.
36. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN,
USA, 2021; pp. 13708–13717.
37. DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552.
38. Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable
Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2
November 2019; IEEE: Seoul, Korea, 2019; pp. 6022–6031.
39. Liang, T.; Bao, H.; Pan, W.; Pan, F. Traffic Sign Detection via Improved Sparse R-CNN for Autonomous Vehicles. J. Adv. Transp.
2022, 2022, 3825532. [CrossRef]
40. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
[CrossRef]
41. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science;
Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755, ISBN 978-3-319-10601-4.
42. Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection
Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155.
43. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101.
44. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
45. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; JMLR Workshop and Conference
Proceedings; pp. 249–256.
Sensors 2022, 22, 4833 17 of 17
46. Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; IEEE: Honolulu, HI, USA, 2017; pp. 936–944.
47. Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow; Apress: Berkeley, CA, USA, 2021; pp. 125–144,
ISBN 978-1-4842-6167-5.
48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016;
pp. 770–778.
49. Wang, X.; Yang, M.; Zhu, S.; Lin, Y. Regionlets for Generic Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37,
2071–2084. [CrossRef]
50. Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June
2016; IEEE: Las Vegas, NV, USA, 2016; pp. 2147–2156.
51. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection.
In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2016; Volume 9908, pp. 354–370, ISBN 978-3-319-46492-3.
52. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer
Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International
Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37, ISBN 978-3-319-46447-3.
53. Yi, J.; Wu, P.; Metaxas, D.N. ASSD: Attentive single shot multibox detector. Comput. Vis. Image Underst. 2019, 189, 102827.
[CrossRef]
54. Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Computer Vision—ECCV 2018;
Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing:
Cham, Switzerland, 2018; Volume 11215, pp. 404–419, ISBN 978-3-030-01251-9.