0% found this document useful (0 votes)

14 views17 pages

Sensors 22 04833

gdfgg

Uploaded by

deshmukhneha833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views17 pages

Sensors 22 04833

gdfgg

Uploaded by

deshmukhneha833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

sensors

Article
DetectFormer: Category-Assisted Transformer for Traffic Scene
Object Detection
Tianjiao Liang 1,2 , Hong Bao 1,2 , Weiguo Pan 1,2, * , Xinyue Fan 1,2 and Han Li 1,2

1 Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China;
[email protected] (T.L.); [email protected] (H.B.); [email protected] (X.F.);
[email protected] (H.L.)
2 College of Robotics, Beijing Union University, Beijing 100101, China
* Correspondence: [email protected]

Abstract: Object detection plays a vital role in autonomous driving systems, and the accurate detec-
tion of surrounding objects can ensure the safe driving of vehicles. This paper proposes a category-
assisted transformer object detector called DetectFormer for autonomous driving. The proposed
object detector can achieve better accuracy compared with the baseline. Specifically, ClassDecoder
is assisted by proposal categories and global information from the Global Extract Encoder (GEE)
to improve the category sensitivity and detection performance. This fits the distribution of object
categories in specific scene backgrounds and the connection between objects and the image context.
Data augmentation is used to improve robustness and attention mechanism added in backbone
network to extract channel-wise spatial features and direction information. The results obtained
by benchmark experiment reveal that the proposed method can achieve higher real-time detection
performance in traffic scenes compared with RetinaNet and FCOS. The proposed method achieved a
detection performance of 97.6% and 91.4% in AP50 and AP75 on the BCTSDB dataset, respectively.

Keywords: autonomous driving; deep learning; object detection; transformer

Citation: Liang, T.; Bao, H.; Pan, W.;

Fan, X.; Li, H. DetectFormer:
Category-Assisted Transformer for 1. Introduction
Traffic Scene Object Detection.
Vision-based object detection in traffic scenes plays a crucial role in autonomous
Sensors 2022, 22, 4833. https://
driving systems. With the rapid development of autonomous driving, the performance of
doi.org/10.3390/s22134833
object detection has made significant progress. The traffic object (e.g., traffic signs, vehicles,
Academic Editor: Giovanni Pau and pedestrians) can be detected automatically by extracting the features. The result of
Received: 30 May 2022
perceiving the traffic scenario can ensure the safety of the autonomous vehicle. This kind
Accepted: 24 June 2022
of method can be divided into anchor-based and anchor-free.
Published: 26 June 2022
Deep-learning-based object detection can be divided into single-stage and multi-stage
object detection. The multi-stage algorithms extract the region of interest first, and then the
Publisher’s Note: MDPI stays neutral
location of the object is determined in these candidate areas. The single-stage algorithm’s
with regard to jurisdictional claims in
output the location and category with dense bounding boxes directly on the original image.
published maps and institutional affil-
These detection algorithms classify each anchor box or key point and detect different
iations.
categories independently, while ignoring the relationships between categories. There exists
a specific relationship between other objects, such as probability, location, and scale of
different objects in a particular environment, which is essential for object detection and can
Copyright: © 2022 by the authors.
improve object detection accuracy.
Licensee MDPI, Basel, Switzerland. This relationship between categories exists in many cases in traffic scenarios. For ex-
This article is an open access article ample, pedestrians appearing in highway scenes and vehicles appearing on the pedestrian
distributed under the terms and path are low-probability events, which indicates the connection between object categories
conditions of the Creative Commons and scenarios. Secondly, the signs “Passing” and “No Passing” should not appear in the
Attribution (CC BY) license (https:// same scene, which indicates the connection between different object categories. There
creativecommons.org/licenses/by/ exist specific implicit relationships between object categories and the background of traf-
4.0/). fic scenes. Existing object detection methods do not consider this relationship in scenes,

Sensors 2022, 22, 4833. https://doi.org/10.3390/s22134833 https://www.mdpi.com/journal/sensors

Sensors 2022, 22, 4833 2 of 17

and their classification subnetwork is trained to independently classify different objects

as individuals without the objects knowing each other, which results in the model un-
derperforming in terms of fitting the distribution of objects and the scene background.
Additionally, the model does not thoroughly learn the features required by the detec-
tion task and will cause a gap in the classification confidence between categories, which
influences the detection performance.
Based on the above-mentioned assumptions, this paper proposes a category-assisted
transformer object detector to learn the relationships between different objects called
DetectFormer, based on the single-stage method. The motivation of this study was to allow
the classification subnetwork to fit better the distribution of object categories with specific
scene backgrounds and ensure that the network model is more focused on this relationship.
Transformer [1] is widely used in natural language processing, machine translation,
and computer vision because of its ability to perceive global information. Specifically, the
vision transformer (ViT) [2] and DETR [3] have been proposed and applied to computer
vision. Previous studies have used transformers to capture global feature information
and reallocate network attention to features, which is called self-attention. In this study,
DetectFormer was built based on the transformer concept. Still the inputs and structure of
the multi-head attention mechanism are different because the purpose of DetectFormer is
to improve the detection accuracy with the assistance of category information.
The contributions of this study are as follows:
(1) The Global Extract Encoder (GEE) is proposed to extract the global information of
the image features output by the backbone network, enhancing the model’s global
perception ability.
(2) A novel category-assisted transformer called ClassDecoder is proposed. It can learn
the object category relationships and improve the model’s sensitivity by implicitly
learning the relationships between objects.
(3) The attention mechanism is added to the backbone network to capture cross-channel,
direction-aware and position-sensitive information during feature extraction.
(4) Efficient data augmentation methods are proposed to enhance the diversity of the
dataset and improve the robustness of model detection.
The rest of this paper is organized as follows. In Section 2, we introduce object
detection algorithms and transformer structure. Details of the proposed DetectFormer
are presented in Section 3. In Section 4, the model’s implementation is discussed, and the
model is compared with previous methods. The conclusions and direction of future work
are discussed in Section 5.

2. Related Work
2.1. Object Detection
Traditional object detection uses HOG [4] or DPM [5] to extract the image features,
and then feed them into a classifier such as SVM [6]. Chen et al. [7] use SVM for traffic
light detection. In recent years, deep learning based object detection algorithms have
achieved better performance in terms of accuracy compared with traditional methods and
have become a research hotspot. Generally, there are two types of object detection based
on deep convolutional networks: (1) multi-stage detection, such as R-CNN series [8–10],
and Cascade R-CNN [11]; (2) one-stage detection, which is also known as the dense
detector and can be divided into anchor-based methods (for example, the You Only Look
Once series [12–14] and RetinaNet [15]) and anchor-free methods (for example, FCOS [16],
CenterNet [17], and CornerNet [18]). Multi-stage detection methods extract features of the
foreground area using region proposal algorithms from preset dense candidates in the first
stage. The bounding boxes of objects are regressed in the subsequent steps. The limitation
of this structure is that it reduces the detection speed and cannot satisfy the real-time
requirements of autonomous driving tasks. Single-stage detection methods directly detect
the object and regress the bounding boxes different from multi-stage methods, which can
avoid the repeated calculation of the feature map and obtains the anchor boxes directly
Sensors 2022, 22, 4833 3 of 17

on the feature map. He et al. [19] proposed a detection method using CapsNet [20] based
on visual inspection of traffic scenes. Li et al. [21] proposed improved Faster R-CNN for
multi-object detection in a complex traffic environments. Lian et al. [22] proposed attention
fusion for small traffic object detection. Liang et al. [23] proposed a light-weight anchor-free
detector for traffic scene object detection. However, their models cannot capture global
information limited by the size of the receptive field. The above-mentioned approaches
obtain local information when extracting image features, and enlarge the receptive field by
increasing the size of the convolution kernel or stacking the number of convolution layers.
In recent years, transformers have been introduced as new attention-based building blocks
applied to computer vision, they have achieved superior performance because they can
obtain the global information of the image without increasing the receptive field.

2.2. Transformers Structure

The transformer is a new encoder–decoder architecture introduced by Vaswani et al. [1]
first used in machine translation and has better performance than LSTM [24], GRU [25],
RNNs [26] (MoE [27], GNMT [28]) in translation tasks. Transformer extracts features
by aggregating global information, making it suited for long sequence prediction tasks
and other information-heavy tasks, which has better performance than other RNN-based
models in natural language processing [29,30], speech processing [31], transfer learning [32].
It is comparable to the performance of CNN in computer vision as a new framework.
Alexey et al. [2] proposed a vision transformer, which applied a transformer to computer
vision and image classification tasks. Nicolas et al. [3] proposed DETR, which applied a
transformer to object detection task. Yan et al. use a transformer to predict long-term traffic
flow [33]. Cai et al. [34] use a transformer to capture the spatial dependency for continuity
and periodicity time series.
Although the transformer structure shows strong performance, the training based on
the transformer takes a long time, and requires a large amount of data sets and ideal pre-
training. This paper proposes a learnable object relationship module based on a transformer
with self-attention, and a single-stage detector was designed to complete the task of traffic
scene object detection. Compared with other methods, the proposed method achieves
better detection performance in a shorter training time.

3. Proposed Method
The overall pipeline of our proposed method is shown in Figure 1. The main contri-
butions of the proposed method are the following three parts: (1) attention mechanism
in backbone network based on position information; (2) the Global Extract Encoder can
enhance the model’s global perception ability; (3) a novel learnable object relationship
module called ClassDecoder. Finally, efficient data augmentation was used to improve the
robustness of the model.

3.1. Global Extract Encoder

modules. The first module is the multi-head self-attention layer, and the second one is the
feedforward network (FFN). Residual connections ⊕ are used between each sub-layer.
Sensors2022,
Sensors 2022,22,
22,4833
x FOR PEER REVIEW 4 of
4 of1817

Figure 1. The overall architecture of the proposed method. The architecture can be divided into
three parts: backbone, encoder, and decoder. The backbone network is used to extract image fea-
tures, the encoder is used to enhance the model’s global perception ability, and the decoder is used
to detect the objects in traffic scenes.

3.1. Global Extract Encoder

The convolutional neural network is usually affected by the kernel size, network
depth, and other factors, causing the receptive field cannot cover the whole area of the
image, which is challenging to learn the relationship between long-distant regions or pix-
els. When extracting the features of the object, the network cannot obtain global infor-
mation.
Inspired by the transformer architecture and the vision transformer, this study de-
signed the Global Extract Encoder (GEE) to enhance the model’s global perception ability.
As shown in Figure 1, the GEE accepts the image features ∈ ℝ × × extracted from the
backbone network, performs global information perception on , and sends to the
following Decoder for object detection. The typical values used in this study are = 2048
Figure 1. The overall architecture of the proposed method. The architecture can be divided into
and
Figure , 1. The
= overall
, , where , of are
architecture the height
the proposed and The
method. width of the original
architecture ∈
imageinto three
can be divided
three parts: backbone, encoder, and decoder. The backbone network is used to extract image fea-
× × backbone, encoder, and decoder. The backbone network is used to extract image features, the
ℝ tures, the
parts: . The structure
encoder is usedoftoGEE is shown
enhance in Figure
the model’s 2 and
global consists
perception of two
ability, andprimary modules.
the decoder is used
encoder
Theto first isthe
used
detectmodule toisenhance
objects the the model’s
multi-head
in traffic scenes. global perception
self-attention layer,ability,
and the and the decoder
second one isisthe
used to detect
feedfor-
the objects
ward network in traffic
(FFN). scenes.
Residual connections ○ + are used between each sub-layer.
3.1. Global Extract Encoder
The convolutional neural network is usually affected by the kernel size, network
depth, and other factors, causing the receptive field cannot cover the whole area of the
image, which is challenging to learn the relationship between long-distant regions or pix-
els. When extracting the features of the object, the network cannot obtain global infor-
mation.
Inspired by the transformer architecture and the vision transformer, this study de-
signed the Global Extract Encoder (GEE) to enhance the model’s global perception ability.
As shown in Figure 1, the GEE accepts the image features ∈ ℝ × × extracted from the
backbone network, performs global information perception on , and sends to the
following Decoder for object detection. The typical values used in this study are = 2048
and , = , , where , are the height and width of the original image ∈
× ×
ℝ . The structure of GEE is shown in Figure 2 and consists of two primary modules.
Figure
The 2. Structure
first moduleofisGlobal
Figure 2. Structure ofthe
Extract Encoder.
multi-head
Global
The The
multi-head
self-attention
Extract Encoder.
self-attention
layer, and
multi-head the secondlearning
one
self-attention isthe
theglobal
learning
in-
feedfor-
the global
formation
ward from feature
network maps
(FFN). and feedforward
Residual connectionsnetwork
○+ areenables
used Global Extract
between each Encoder to acquire
sub-layer.
information from feature maps and feedforward network enables Global Extract Encoder to acquire
the ability of nonlinear fitting.
the ability of nonlinear fitting.

We split the feature maps into patches, and collapsed the spatial dimensions of f from
RC× H ×W to a one dimension sequence RC× HW . Then, a fixed position embedding is added
to the feature sequence f 0 ∈ RC× HW owing to permutation invariance and fed into GEE.
The obtained information from different subspaces and positions by adding multi-head
self-attention H.
f ( j)
Wi = w( j) f 0 j = 1, 2, 3 , (1)
f (i ) f ( j)
!
Wi Wi T f (k)
hi = So f tmax √ Wi i 6= j 6= k, (2)
HW/n

H = Concat(h1 , h2 , . . . , hn )w(H) , (3)

Figure projection
where 2. Structure of GlobalwExtract
matrix ( j) ∈ REncoder.
c× HW j The
= 1,multi-head self-attention
2, 3. Additionally, learning
w(H) the global
∈ RnHW in-n
×c , and
formation from feature maps and feedforward network enables Global Extract Encoder to acquire
donates the number of heads. The feedforward network (FFN) enables GEE the ability of
the ability of nonlinear fitting.
nonlinear fitting. After global feature extraction, f 0 expands the spatial dimension into
C × H × W. Thus, the dimensions of the GEE module output f out ∈ RC× H ×W are consistent
Sensors 2022, 22, 4833 5 of 17

with the input dimensions, and the model can obtain long distance regional relationships
and global information rather than local information when extracting object features.

3.2. Class Decoder

To learn the object category relationships and improve the model’s sensitivity to the
categories by implicitly learning the relationships between objects, a novel learnable object
relationship module called ClassDecoder is proposed. The structure of ClassDecoder is
shown in Figure 3 and is similar to the transformer architecture. However, this study
disregarded the self-attention mechanism, the core of transformer blocks, and designed
a module from the perspective of object categories to implicitly learn the relationship
between categories, including the foreground and background. Here, 1 × 1 convolution
was used to reduce the channel dimension of the global feature map f out from C to a smaller
dimension m, and the spatial dimensions were collapsed to create a new feature sequence
G ∈ Rm× HW .
G = F ϕ f out ,

(4)
where the ϕ(.) means 1 × 1 convolutional operation to reduce the channel dimension of
Sensors 2022, 22, x FOR PEER REVIEWout 6 of 18
f , and F(.) means collapse operator, which transforms two-dimensional feature matrices
into feature sequences.

Figure3.3.Structure
Figure StructureofofClassDecoder.
ClassDecoder.Proposal
Proposalcategories
categorieslearn
learnthe
therelationship
relationshipbetween
betweendifferent
different
categories and classify objects based on Global Information.
categories and classify objects based on Global Information.

ClassDecoder block
ClassDecoder block requires
requirestwo
twoinputs:
inputs: the
the feature
feature sequence
sequence GGand
andthetheproposal
proposal
categoriesP.P.The
categories Theproposed
proposedClassDecoder
ClassDecoderisistotodetect
detectdifferent
differentcategories
categoriesof
ofobjects,
objects,using
using
proposalcategories
proposal categoriestotopredict
predictthe
theconfidence
confidencevector
vectorofofeach
eachcategory,
category,and
andthe
thedepth
depthnnofof
ClassDecoder represents the number of categories. Then, the convolution operation is usedis
ClassDecoder represents the number of categories. Then, the convolution operation
used to generate the global descriptor of each vector. Finally, the softmax function is used
to output the prediction result of the category.

= , (5)

= ( ( ( ))). (6)
× ×
Sensors 2022, 22, 4833 6 of 17

to generate the global descriptor of each vector. Finally, the softmax function is used to
output the prediction result of the category.

GP T

f p = So f tmax √ P, (5)
dk

yclass = So f tmax (σ ( ϕ( f p ))). (6)

where the global information G (G ∈ Rn×dk ),
the proposal categories P (P ∈ Rm×dv ),
and
m is the same as the first dimension of G. In this study, the dimensions of dk and dv were
set to be the same and equal to the feature channels H × W; P denotes various learnable
sequences that are referred to as proposal categories and are independently decoded into
class labels, resulting in n final class predictions, where n denotes the total number of
dataset categories in anchor-free methods and is the product of the number of categories
and number of anchor boxes in anchor-based methods.
There are many ways to initialize the proposal categories. Transformer architecture
does not contain any inductive bias; this study attempted to feed prior knowledge into
ClassDecoder, and proposal categories were initialized as follows. A 1 × 1 convolution was
used to reduce the dimension of g and reduce the original m dimension to the n dimension
(generally, n m), where n represents the total number of categories in the dataset of the
detection task based on the anchor-free method. ClassDecoder globally reasons about all
categories simultaneously using the pair-wise relationships between objects while learning
the relationship between categories, including the foreground and background.

3.3. Attention Mechanism in the Backbone Network

The attention mechanisms in computer vision can enhance the objects in the feature
maps. CBAM [35] attempts to utilize position information by reducing the channel dimen-
sion of the input tensor and using convolution to compute spatial attention. Different from
CBAM, our proposed method adds a location attention feature to build the direction-aware
information, which can improve the network more accurately locate objects, by capturing
precise location information in two different spatial directions. A global encoding for
channel-wise spatial information is added based on Coordinate Attention [36]. Specifically,
the features xc (i, j) are aggregated along W and H spatial directions to obtain feature maps
of perception in two directions. These two features zch (h) and zw c ( h ) allow the attention
module to obtain long-term dependencies along with different spatial directions. The
g
concatenate operation F is performed with the channel descriptor zc with global spatial
information. Then, the convolution function ϕ is used to transform them and obtain the
output P , as shown in Figure 4.
g
zc , zch (h) and zw
c ( h ) are defined as follows:

1 H W
∑i=1 ∑ j=1 xc (i, j),
g
zc = (7)
H×W
1
W ∑ 0 ≤ i <W
zch (h) = xc (h, i ), (8)

1
H ∑ 0≤ j < H
zw
c (w) = xc ( j, w), (9)
h i
g
P = ϕ F zc , zch (h), zw
c ( w ) . (10)

where xc is the input from the features extracted from the previous layer associated with
the c-channel, ϕ(.) is the convolutional operation, and F [.] is concatenate operation. After
the output of different information P through their respective convolution layer (.), the
normalization is activated by sigmoid activation function σ (.). The final output yc is the
multiply of the original feature map and information weights.

f w = σ ( ϕw (P w )), (11)
ground.

3.3. Attention Mechanism in the Backbone Network

The attention mechanisms in computer vision can enhance the objects in the feature
Sensors 2022, 22, 4833 maps. CBAM [35] attempts to utilize position information by reducing the channel dimen- 7 of 17
sion of the input tensor and using convolution to compute spatial attention. Different from
CBAM, our proposed method adds a location attention feature to build the direction-
aware information, which can improve h the
f = σ ϕh P network
h
, more accurately locate objects,(12) by
capturing precise location information in two different spatial directions. A global encod-
ing for channel-wise spatial information f g = σis ϕadded g
,
g (P ) based on Coordinate Attention [36]. (13)
Specifically, the features ( , ) are aggregated along W and
g H spatial directions to ob-
yc (i, j) = xc (i, j) × f cw ( j) × f ch (i ) × f c (i, j). (14)
tain feature maps of perception in two directions. These two features (ℎ) and (ℎ)
allowThe
the proposed
attention module
attention tomechanism
obtain long-termin thedependencies
backbone could along bewith different
applied spatial
to different
directions.
kinds The concatenate
of networks. As shown operation F is performed
in the following with thepart,
experimental channel descriptor attention
the improved with
mechanism
global spatialcaninformation.
be plugged Then,
into lightweight backbone
the convolution networksisand
function improve
used the network
to transform them
detection
and obtaincapability.
the output , as shown in Figure 4.

attentionmechanism
Figure 4. The attention mechanismininbackbone
backbonenetwork.
network.We
We propose
propose thethe global
global encoding
encoding for for
channel-
channel-wise
wise spatial information
spatial information and
and extract extract
X and X and Yinformation
Y direction direction information for the
for the location locationfeatures.
attention atten-
tion features.
3.4. Data Augmentation
,
Traffic (ℎ)
scene (ℎ)
andobject are defined
detection as follows:
is usually affected by light, weather, and other factors.
The data-driven deep neural networks require a large number of labeled images to train
= ∑ ∑ ( , ), (7)
×
the model. Most traffic scene datasets cannot cover all complex environmental conditions.
In this paper, we use three types of data augmentation methods global pixel level, spatial
level, and object level, as shown in(ℎ) = ∑5. Specifically,
Figure (ℎ, ), we use Brightness Contrast,(8)
Blur, and Channel Dropout for illumination transformation; we use Rain, Sun Flare, and
Cutout [37] for the spatial level data )= ∑
( augmentation, (Mixup,
, ), CutMix [38] for the object (9)
level augmentation. The data augmented by these methods can simulate complex traffic
scenarios, which can improve the detection
= ( [ robustness
, (ℎ), (of)]). the model. (10)
In this paper, we use three types of data augmentation methods global pixel level, spatial
level, and object level, as shown in Figure 5. Specifically, we use Brightness Contrast, Blur,
and Channel Dropout for illumination transformation; we use Rain, Sun Flare, and Cutout
[37] for the spatial level data augmentation, Mixup, CutMix [38] for the object level aug-
Sensors 2022, 22, 4833 mentation. The data augmented by these methods can simulate complex traffic scenarios, 8 of 17
which can improve the detection robustness of the model.

Figure 5. Efficient data augmentation for traffic scene images. Different augmentation methods are
Figure 5. Efficient data augmentation for traffic scene images. Different augmentation methods are
used to simulate the complex environment.
used to simulate the complex environment.

4. Experience and Results

4.1. Evaluation Metrics
The average precision (AP) metrics were used to evaluate the detection performance,
including AP at different IoU thresholds (AP, AP50 , AP75 ) and AP for different scale objects
(APS , APM , APL ), which consider both recall and precision. The top-n accuracy was used
to evaluate the classification ability of different methods. Top-n represents the truth value
of the object in the first n confidence results of the model. We also use parameters and
FLOPs (floating-point operations per second) to measure the volume and computation of
different models.

4.2. Datasets
Detection performance in traffic scenes is evaluated using the BCTSDB [39], KITTI [40],
and COCO [41] datasets to evaluate the generalization ability. The KITTI dataset contains
7481 training images and 7518 test images, totaling 80,256 labeled objects with three
categories (e.g., vehicle, pedestrian, and cyclist). The BCTSDB dataset contains 15,690 traffic
sign images, including 25,243 labeled traffic signs. The COCO dataset is used to test the
generalization ability of the model including 80 object categories and more than 220 K
labeled images.

4.3. Implementation and Training Details

The network structure constructed by PyTorch and the default hyperparameters used
were the same as those for MMDetection [42] unless otherwise stated. Two NVIDIA TITAN
V graphics cards with 24 GB VRAM were used to train the model. The linear warming up
policy was used to start the training, where the warm-up ratio was set to 0.1. The optimizer
of DetectFormer is AdamW [43]; the initial learning rate is set to 10−4 , and the weight
decay is set to 10−4 . The backbone network is established using pre-trained weights from
ImageNet [44], and other layers used Xavier [45] for parameter initialization except for
the proposal categories. The input images are scaled to a full scale of 640 × 640, while
maintaining the aspect ratio.

4.4. Performances
We first evaluate the effectiveness of the different proposed units. The ClassDecoder
head, Global Extract Encoder, Attention, Anchor-free head, and Data augmentation are
Sensors 2022, 22, 4833 9 of 17

gradually added to the RetinaNet baseline on COCO and BCTSDB dataset to test the
generalization ability of the proposed method and the detection ability in the traffic scene,
as shown in Tables 1 and 2, respectively.

Table 1. The ablation study on the COCO dataset.

Methods Parameters (M) FLOPs (G) AP (%) AP50 (%) AP75 (%)
RetinaNet baseline 37.74 95.66 32.5 50.9 34.8
+ClassDecoder 35.03 (−2.71) 70.30 (−25.36) 34.6 (+2.1) 53.5 (+2.6) 36.1 (+1.3)
+Global Extract
36.95 (+1.92) 90.45 (+20.15) 36.2 (+1.6) 55.7 (+2.2) 37.8 (+1.7)
Encoder
+Attention 37.45 (+0.5) 90.65 (+0.2) 38.3 (+2.1) 58.3 (+2.6) 39.3 (+1.5)
+Anchor-free 37.31 (−0.14) 89.95 (−0.7) 38.9 (+0.6) 59.1 (+0.8) 39.6 (+0.3)
+Data augmentation 37.31 (+0) 89.95 (+0) 41.3 (+2.4) 61.8 (+2.7) 41.5 (+1.9)

Table 2. The ablation study on the BCTSDB dataset.

Methods Parameters (M) FLOPs (G) AP (%) AP50 (%) AP75 (%)
RetinaNet baseline 37.74 95.66 59.7 89.4 71.2
+ClassDecoder 35.03 (−2.71) 70.30 (−25.36) 61.6 (+3.7) 91.8 (+2.4) 75.8 (+4.6)
+Global Extract
36.95 (+1.92) 90.45 (+20.15) 63.4 (+3.4) 93.9 (+2.1) 80.6 (+4.8)
Encoder
+Attention 37.45 (+0.5) 90.65 (+0.2) 65.2 (+3.1) 95.1 (+1.2) 84.2 (+3.6)
+Anchor-free 37.31 (−0.14) 89.95 (−0.7) 65.8 (+2.1) 95.7 (+0.6) 87.4 (+3.2)
+Data augmentation 37.31 (+0) 89.95 (+0) 76.1 (+4.1) 97.6 (+1.9) 91.4 (+4.0)

We further compare the different performances of anchor-based and anchor-free

methods on KITTI dataset. As shown in Table 3, the detection performance of an anchor-
free detector with Feature Pyramid Network (FPN) [46] is better than the anchor-based
detector. FPN plays a crucial role in improving detection accuracy based on the anchor-
free method.

Table 3. Comparison of anchor-based and anchor-free methods on the KITTI dataset.

Methods Detector Car (%) Pedestrian (%) Cyclist (%)

Anchor-based 83.24 70.11 73.54
DetectFormer Anchor-free 69.45 61.15 62.24
Anchor-free w/. FPN 86.59 79.45 81.71

For the initialization method of proposal categories, we compare different methods, as

shown in Figure 6. The experiment shows that the orthogonalized initial parameter method
better than the random initialization method in the early stage of training. The advantage
becomes less obvious as the training continue.
The efficiency of attention and detection results of DetectFormer with different number
of parameter backbone networks, from light-weight backbone network (MobileNetv3 [47])
to high-performance backbone network (ResNet101 [48]) are shown in Table 4, which
shows that it can improve the detection performance of the model by inserting attention
mechanism into the backbone network, especially in the lightweight backbone network,
our method is competitive in lightweight networks.
Anchor-free w/. FPN 86.59 79.45 81.71

For the initialization method of proposal categories, we compare different methods,

as shown in Figure 6. The experiment shows that the orthogonalized initial parameter
Sensors 2022, 22, 4833
method better than the random initialization method in the early stage of training.10The
of 17
advantage becomes less obvious as the training continue.

Theloss
Figure6.6.The
Figure losscurves
curvesfor
fordifferent
differentinitialization
initializationmethods.
methods.

TableThe The performance

4. efficiency of and
of attention attentional mechanism
detection in different backbone
results of DetectFormer networks
with different on
num-
BCTSDB
ber dataset. backbone networks, from light-weight backbone network (MobileNetv3
of parameter
[47]) to high-performance backbone network (ResNet101 [48]) are shown in Table 4, which
Backbone Params. FLOPs Head Attention AP (%)
shows that it can improve the detection performance of the model by inserting attention
MobileNetv3
mechanism × 1.0
into 5.4 M network,
the backbone 220especially
M 51.2
in the lightweight backbone network,
ResNet50 25 M 3.8 G RetinaNet w/o. 59.7
our method is competitive in lightweight networks.
ResNet101 46.3 M 7.6 G 64.8
MobileNetv3 × 1.0 5.9 M 231 M 54.1
ResNet50 25 M 3.8 G RetinaNet w/. 62.5
ResNet101 46.3 M 7.6 G 66.3

Table 5 presents the classification performance of baseline methods and that of the
proposed method on the BCTSDB dataset. Anchor-based and anchor-free methods were
used to compare RetinaNet and FCOS, respectively. The experimental results reveal that
DetectFormer is helpful in improving the classification ability of the model. Remarkably, De-
tectFormer can reduce the computation and parameter number of the detection networks.

Table 5. Classification results with other methods on the BCTSDB dataset.

Model Backbone Head Params. (M) FLOPs (G) Top-1 Acc. (%) Top-5 Acc. (%)
RetinaNet [15] ResNet50 Anchor-based 37.74 95.66 96.8 98.9
FCOS [16] ResNet50 Anchor-free 31.84 78.67 98.2 99.1
Ours. ResNet50 Anchor-free 37.31 89.95 98.7 99.5

The convergence curves among the DetectFormer and other SOTA (state-of-the-art)
methods, including RetinaNet, DETR, Faster R-CNN, FCOS, and YOLOv5, are shown in
Figure 7, which illustrates that DetectFormer achieves better performance with efficient
training and accurate detection. The vertical axis is the detection accuracy.
Ours. ResNet50 Anchor-free 37.31 89.95 98.7 99.5

The convergence curves among the DetectFormer and other SOTA (state-of-the-art)
methods, including RetinaNet, DETR, Faster R-CNN, FCOS, and YOLOv5, are shown in
Sensors 2022, 22, 4833 11 of 17
Figure 7, which illustrates that DetectFormer achieves better performance with efficient
training and accurate detection. The vertical axis is the detection accuracy.

Figure 7.
Figure 7. The
The detection
detection results
results with
with different
different methods
methods on
on the
the BCTSDB
BCTSDB dataset.
dataset. Our
Our model
model can
can
achieve higher detection accuracy in shorter training epochs. In particular, DETR requires more than
achieve higher detection accuracy in shorter training epochs. In particular, DETR requires more than
200 training epochs for high precision detection.
200 training epochs for high precision detection.

Table 66 shows
Table shows the
the detection
detection results
results on
on BCTSDB
BCTSDB dataset
dataset produced
produced by
by multi-stage
multi-stage
methods (e.g., Faster R-CNN, Cascade R-CNN) and single-stage methods, including
methods (e.g., Faster R-CNN, Cascade R-CNN) and single-stage methods, including anchor-an-
based methods (e.g., YOLOv3, RetinaNet) and the anchor-free method FCOS. DetectFormer
shows high detection accuracy and more competitive performance. The AP, AP50, and
AP75 are 76.1%, 97.6%, and 84.3%, respectively. DetectFormer can suit the distribution of
object categories and boost detection confidence in the field of autonomous driving better
than other networks.

Table 6. Comparison of results with other methods on the BCTSDB dataset.

Model Backbone Head AP AP50 AP75 APS APM APL FPS

Faster R-CNN [10] ResNet50 Anchor-based 70.2 94.7 86.0 65.3 76.5 84.5 28
Cascade R-CNN [11] ResNet50 Anchor-based 75.8 96.7 92.5 72.9 79.3 89.2 23
YOLOv3 [14] Darknet53 Anchor-based 59.5 92.7 70.4 54.2 70.1 83.8 56
RetinaNet [15] ResNet50 Anchor-based 59.7 89.4 71.2 47.2 72.5 83.3 52
FCOS [16] ResNet50 Anchor-free 68.6 95.8 83.9 62.7 75.7 83.9 61
Ours. ResNet50 Anchor-free 76.1 97.6 91.4 63.1 77.4 84.5 60

The proposed method was also evaluated on the KITTI dataset. As shown in Table 7,
compared with other methods, DetectFormer shows better detection results.

Table 7. Comparison results for detection methods on the KITTI dataset.

Car Pedestrian Cyclist

Time
Methods Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard (ms)
(%) (%) (%) (%) (%) (%) (%) (%) (%)
Regionlets [49] 84.75 76.45 59.70 73.14 61.15 55.21 70.41 58.72 51.83 -
Faster R-CNN [10] 87.97 79.11 70.62 78.97 65.24 60.09 71.40 61.86 53.97 142
Mono3D [50] 84.52 89.37 79.15 80.30 67.29 62.23 77.19 65.15 57.88 -
MS-CNN [51] 93.87 88.68 76.11 85.71 74.89 68.99 84.88 75.30 65.27 -
SSD [52] 87.34 87.74 77.27 50.38 48.41 43.46 48.25 52.31 52.13 30
ASSD [53] 89.28 89.95 82.11 69.07 62.49 60.18 75.23 76.16 72.83 30
RFBNet [54] 87.31 87.27 84.44 66.16 61.77 58.04 74.89 72.05 71.01 23
Ours. 90.48 88.03 81.25 83.32 79.35 75.67 85.04 82.33 77.76 22
Sensors 2022, 22, 4833 12 of 17

Sensors 2022, 22, x FOR PEER REVIEW 13 of 18

Sensors 2022, 22, x FOR PEER REVIEW 13 of 18
Figures 8 and 9 and show that DetectFormer can improve the model’s sensitivity to
categories by implicitly learning the relationships between objects.

Figure 8.
Figure8.
Figure Precision curves
Precisioncurves
8. Precision of
curvesof the
ofthe proposed
the proposed method
proposed method and
method and RetinaNet.
and RetinaNet. Our
RetinaNet. Our model
Ourmodel has
modelhas high
hashigh detection
highdetection
detection
accuracy even
accuracyeven
accuracy with
evenwith low
withlow confidence.
lowconfidence.
confidence.

Figure9.
Figure Confusionmatrix
9.Confusion matrixofofthe
theproposed
proposedmethod
methodand
andRetinaNet.
RetinaNet.The
Thedarker
darkerthe
theblock,
block,the
thelarger
larger
Figure 9. Confusion matrix of the proposed method and RetinaNet. The darker the block, the larger
thevalue
the value itit represents.
represents. Compared
Comparedwith withRetinaNet,
RetinaNet,thetheproposed
proposedmethod
methodcan
canobtain
obtainmore
morecategory
category
the value it represents. Compared with RetinaNet, the proposed method can obtain more category
information
informationand
information and help
andhelp to
helpto classify
toclassify the
classifythe objects.
theobjects. (a)
objects.(a) Confusion
(a)Confusion matrix
Confusionmatrix of
matrixof RetinaNet.
ofRetinaNet. (b)
RetinaNet.(b) Confusion
(b)Confusion matrix
Confusionmatrix
matrix
of
of DetectFormer.
DetectFormer.
of DetectFormer.

The detection
The detectionresults
resultsare
areshown
showninin Figures
Figures 10 10
andand 11 the
11 on
on on the KITTI
KITTI and and BCTSDB
BCTSDB da-
The detection results are shown in Figures 10 and 11 the KITTI and BCTSDB da-
datasets, respectively.
tasets, respectively.
respectively. The The
The resultsresults demonstrate
results demonstrate
demonstrate the the proposed
the proposed
proposed method’s method’s effectiveness
method’s effectiveness
effectiveness in in
in traffic
traffic
tasets,
traffic scenarios.
scenarios. Three Three
Three types
types oftypes of
of traffic traffic
traffic signs
signs on signs
on the on
the BCTSDBthe BCTSDB
BCTSDB dataset, dataset,
dataset, including including
including warning, warning,
warning, prohib-
prohib-
scenarios.
prohibitory,
itory, mandatory,
mandatory, and and types
three three types
of of traffic
traffic objectsobjects
on onKITTI
the the KITTI dataset,
dataset, including
including car,
itory, mandatory, and three types of traffic objects on the KITTI dataset, including car,
car, pedestrian,
pedestrian, cyclist
cyclist were were detected.
detected. The The detection
detection resultresult
does does
not not include
include otherother
types types
of of
traf-
pedestrian, cyclist were detected. The detection result does not include other types of traf-
fic objects such as a motorcycle in Figure 10, but the proposed model
fic objects such as a motorcycle in Figure 10, but the proposed model can detect those can detect those
kinds of
kinds of objects.
objects.
Sensors 2022, 22, 4833 13 of 17

Sensors 2022, 22, x FOR PEER REVIEW

traffic 14those
objects such as a motorcycle in Figure 10, but the proposed model can detect of 18
Sensors 2022, 22, x FOR PEER REVIEW 14 of 18
kinds of objects.

Figure 10. Detection Results on KITTI dataset. Our method can detect different objects in traffic
Figure 10. Detection Results
Figure accurately,
10. Detection Results on
on KITTI
KITTI dataset. Our method
dataset. Our method can
can detect
detect different
different objects
objects in
in traffic
traffic
scenes and even identify overlapping objects and dense objects.
scenes accurately, and even identify overlapping objects and dense objects.
scenes accurately, and even identify overlapping objects and dense objects.

Figure 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
Figurewith
scales 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
high precision.
Figure 11. Detection results on BCTSDB dataset. Our method can detect traffic signs at different
scales with high precision.
scales with high precision.
5. Discussion
5. Discussion
5. Discussion
Why can ClassDecoder improve the classification ability of models? In this paper, we
WhyClassDecoder
propose
Why can
can ClassDecoder improve
to improve
ClassDecoder improvethethe
the classificationability,
classification
classification abilitywhich
ability of models?
of models? In
In this
this paper,
is designed based we
paper, on
we
propose
propose ClassDecoder
the transformer to improve
architecture
ClassDecoder the
without
to improve classification
any convolution
the classification ability, which
operations.
ability, is designed
The model
which is designed based on
interacts
based on the
the transformer
with architecture
different background without
feature maps anyin convolution operations.
scaled dot-product The model
attention interacts
and multi-head
with different
attention background
by using proposalfeature mapsand
categories, in scaled
learns dot-product
the implicit attention andbetween
relationship multi-headthe
attention by using proposal categories, and learns the implicit relationship
background and the category by using the key-value pair idea in the Transformer. The between the
background and the category by using the key-value pair idea in the Transformer. The
Sensors 2022, 22, 4833 14 of 17

transformer architecture without any convolution operations. The model interacts with
different background feature maps in scaled dot-product attention and multi-head attention
by using proposal categories, and learns the implicit relationship between the background
and the category by using the key-value pair idea in the Transformer. The number of
proposal categories is equal to the number of object categories, and the parameters of
proposal categories are learnable. The input of ClassDecoder is the feature maps, and
proposal categories, and the output is the prediction category of the current bounding box.
The output dimensions are the same as those of the proposal categories, and the proposal
categories are associated with the output in the role of Query (Query-Key-Value relationship
in transformer architecture). It can be understood that the proposal categories are vectors
that can be learned, and their quantity represents the confidence vectors corresponding to
different categories of the current bounding box. Then, the model converts the confidence
vector into category confidence through feed-forward network. The category with the
highest confidence is the category of the predicted bounding box.

6. Conclusions
This paper proposes a novel object detector called DetectFormer, which is assisted by
a transformer to learn the relationship between objects in traffic scenes. By introducing the
GEE and ClassDecoder, this study focused on fitting the distribution of object categories
to specific scene backgrounds and implicitly learning the object category relationships
to improve the sensitivity of the model to the categories. The results obtained by experi-
ments on the KITTI and BCTSDB datasets reveal that the proposed method can improve
the classification ability and achieve outstanding performance in complex traffic scenes.
The AP50 and AP75 of the proposed method are 97.6% and 91.4% on BCTSDB, and the
average accuracies of car, pedestrian, and cyclist are 86.6%, 79.5%, and 81.7% on KITTI,
respectively, which indicates that the proposed method achieves better results compared to
other methods. The proposed method improved detection accuracy, but it still encountered
many challenges when applied to natural traffic scenarios. The experiment in this paper is
trained on public datasets and real traffic scenes facing challenges with complex lighting
and weather factors. Our future work is focused on object detection in an open environment
and the deployment of models to vehicles.

Author Contributions: Conceptualization, T.L. and W.P.; methodology, H.B.; software, T.L. and W.P.;
validation, X.F. and H.L.; writing—original draft preparation, T.L.; writing—review and editing, T.L.
and W.P. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China (Nos.
61802019, 61932012, 61871039, 61906017 and 62006020) and the Beijing Municipal Education Commis-
sion Science and Technology Program (Nos. KM201911417003, KM201911417009 and KM201911417001).
Beijing Union University Research and Innovation Projects for Postgraduates (No.YZ2020K001). By
the Premium Funding Project for Academic Human Resources Development in Beijing Union Uni-
versity under Grant BPHR2020DZ02.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations

AP Averaged AP at IoUs from 0.5 to 0.95 with an interval of 0.05

AP50 AP at IoU threshold 0.5
AP75 AP at IoU threshold 0.75
APL AP for objects of large scales (area > 962)
Sensors 2022, 22, 4833 15 of 17

APM AP for objects of medium scales (322 < area < 962)
APS AP for objects of small scales (area < 322)
BCTSDB BUU Chinese Traffic Sign Detection Benchmark
CNN Convolutional Neural Network
FLOPs Floating-point operations per second
FPN Feature pyramid network
FPS Frames Per Second
GEE Global Extract Encoder
HOG Histogram of Oriented Gradients
IoU Intersection over union
LSTM Long Short-Term Memory
NMS Non-Maximum Suppression
RNN Recurrent Neural Network
SOTA State-of-the-art
SSD Single Shot MultiBox Detector
SVM Support Vector Machine
VRAM Video random access memory
YOLO You Only Look Once

References
1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you
need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA,
4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; pp. 6000–6010.
2. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929.
3. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In
Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229, ISBN 978-3-030-58451-1.
4. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: San Diego, CA,
USA, 2005; Volume 1, pp. 886–893.
5. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models.
IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef]
6. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13,
18–28. [CrossRef]
7. Chen, Z.; Shi, Q.; Huang, X. Automatic detection of traffic lights using support vector machine. In Proceedings of the 2015 IEEE
Intelligent Vehicles Symposium (IV), Seoul, Korea, 28 June–1 July 2015; IEEE: Seoul, Korea, 2015; pp. 37–40.
8. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
IEEE: Columbus, OH, USA, 2014; pp. 580–587.
9. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 1440–1448.
10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef]
11. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Salt Lake City, UT,
USA, 2018; pp. 6154–6162.
12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las
Vegas, NV, USA, 2016; pp. 779–788.
13. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 6517–6525.
14. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
15. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 42, 318–327. [CrossRef]
16. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A Simple and Strong Anchor-free Object Detector. IEEE Trans. Pattern Anal. Mach. Intell.
2020, 44, 1922–1933. [CrossRef]
17. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850.
18. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [CrossRef]
Sensors 2022, 22, 4833 16 of 17

19. He, S.; Chen, L.; Zhang, S.; Guo, Z.; Sun, P.; Liu, H.; Liu, H. Automatic Recognition of Traffic Signs Based on Visual Inspection.
IEEE Access 2021, 9, 43253–43261. [CrossRef]
20. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9
December 2017; pp. 3859–3869.
21. Li, C.; Qu, Z.; Wang, S.; Liu, L. A method of cross-layer fusion multi-object detection and recognition based on improved faster
R-CNN model in complex traffic environment. Pattern Recognit. Lett. 2021, 145, 127–134. [CrossRef]
22. Lian, J.; Yin, Y.; Li, L.; Wang, Z.; Zhou, Y. Small Object Detection in Traffic Scenes Based on Attention Feature Fusion. Sensors 2021,
21, 3031. [CrossRef]
23. Liang, T.; Bao, H.; Pan, W.; Pan, F. ALODAD: An Anchor-Free Lightweight Object Detector for Autonomous Driving. IEEE Access
2022, 10, 40701–40714. [CrossRef]
24. Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational
Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385, pp. 37–45, ISBN 978-3-642-24796-5.
25. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [CrossRef]
26. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [CrossRef]
27. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The
Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538. [CrossRef]
28. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s
Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144.
[CrossRef]
29. Wang, H.; Wu, Z.; Liu, Z.; Cai, H.; Zhu, L.; Gan, C.; Han, S. HAT: Hardware-Aware Transformers for Efficient Natural Language
Processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020;
Association for Computational Linguistics: Online, 2020; pp. 7675–7688.
30. Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694. [CrossRef]
31. Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
Canada, 5–20 April 2018; IEEE: Calgary, AB, Canada, 2018; pp. 5884–5888.
32. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. arXiv 2016, arXiv:1810.04805. [CrossRef]
33. Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features With Transformer. IEEE Trans. Intell.
Transport. Syst. 2021, 1–14. [CrossRef]
34. Cai, L.; Janowicz, K.; Mai, G.; Yan, B.; Zhu, R. Traffic transformer: Capturing the continuity and periodicity of time series for
traffic forecasting. Trans. GIS 2020, 24, 736–755. [CrossRef]
35. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari,
V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham,
Switzerland, 2018; Volume 11211, pp. 3–19, ISBN 978-3-030-01233-5.
36. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN,
USA, 2021; pp. 13708–13717.
37. DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552.
38. Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable
Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2
November 2019; IEEE: Seoul, Korea, 2019; pp. 6022–6031.
39. Liang, T.; Bao, H.; Pan, W.; Pan, F. Traffic Sign Detection via Improved Sparse R-CNN for Autonomous Vehicles. J. Adv. Transp.
2022, 2022, 3825532. [CrossRef]
40. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
[CrossRef]
41. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in
Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science;
Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755, ISBN 978-3-319-10601-4.
42. Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection
Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155.
43. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101.
44. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
45. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; JMLR Workshop and Conference
Proceedings; pp. 249–256.
Sensors 2022, 22, 4833 17 of 17

46. Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; IEEE: Honolulu, HI, USA, 2017; pp. 936–944.
47. Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow; Apress: Berkeley, CA, USA, 2021; pp. 125–144,
ISBN 978-1-4842-6167-5.
48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016;
pp. 770–778.
49. Wang, X.; Yang, M.; Zhu, S.; Lin, Y. Regionlets for Generic Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37,
2071–2084. [CrossRef]
50. Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June
2016; IEEE: Las Vegas, NV, USA, 2016; pp. 2147–2156.
51. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection.
In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer
International Publishing: Cham, Switzerland, 2016; Volume 9908, pp. 354–370, ISBN 978-3-319-46492-3.
52. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer
Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International
Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37, ISBN 978-3-319-46447-3.
53. Yi, J.; Wu, P.; Metaxas, D.N. ASSD: Attentive single shot multibox detector. Comput. Vis. Image Underst. 2019, 189, 102827.
[CrossRef]
54. Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Computer Vision—ECCV 2018;
Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing:
Cham, Switzerland, 2018; Volume 11215, pp. 404–419, ISBN 978-3-030-01251-9.

VNX - Su Espace 2 1991 1997 PDF
100% (2)
VNX - Su Espace 2 1991 1997 PDF
555 pages
Comparative Analysis of Feature Descriptors and Classifiers For Real-Time Object Detection
No ratings yet
Comparative Analysis of Feature Descriptors and Classifiers For Real-Time Object Detection
11 pages
Computer Vision 3
No ratings yet
Computer Vision 3
38 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
5 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Iván García Aguilar Automated Labeling of Training
No ratings yet
Iván García Aguilar Automated Labeling of Training
8 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
1 s2.0 S095219762400616X Main
No ratings yet
1 s2.0 S095219762400616X Main
19 pages
Final Report - Removed
No ratings yet
Final Report - Removed
43 pages
1 Realtimeobjectdetection
No ratings yet
1 Realtimeobjectdetection
6 pages
Deep Learning Object Detection Survey
No ratings yet
Deep Learning Object Detection Survey
30 pages
Object Detection Based On CNN and Vision Transformer A Survey
No ratings yet
Object Detection Based On CNN and Vision Transformer A Survey
30 pages
TIJER2305225
No ratings yet
TIJER2305225
6 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
Object Detection Using Machine Learningand Neural Networks
No ratings yet
Object Detection Using Machine Learningand Neural Networks
10 pages
Seminar Paper by Roquia Salam
No ratings yet
Seminar Paper by Roquia Salam
29 pages
Investigations of Object Detection in Im
No ratings yet
Investigations of Object Detection in Im
46 pages
Deep Learning Object Detection
No ratings yet
Deep Learning Object Detection
7 pages
Object Detection and Segmentation On Tensor Flow Using
No ratings yet
Object Detection and Segmentation On Tensor Flow Using
10 pages
An Object Detection Technique For Blind People in Real-Time Using Deep Neural Network
No ratings yet
An Object Detection Technique For Blind People in Real-Time Using Deep Neural Network
6 pages
Object Detection With DL
No ratings yet
Object Detection With DL
17 pages
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
No ratings yet
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
6 pages
An Evaluation of Deep Learning Methods For Small Object
No ratings yet
An Evaluation of Deep Learning Methods For Small Object
18 pages
Younis 2020
No ratings yet
Younis 2020
5 pages
Object Detection Security System Report
No ratings yet
Object Detection Security System Report
13 pages
Object Detectionusing Machine Learningand Deep Learning
No ratings yet
Object Detectionusing Machine Learningand Deep Learning
9 pages
Kumar 2019
No ratings yet
Kumar 2019
6 pages
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
No ratings yet
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
18 pages
Sensors 23 03385
No ratings yet
Sensors 23 03385
20 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Transformer For Object Detection Review and Benchmark
No ratings yet
Transformer For Object Detection Review and Benchmark
16 pages
A Survey Object Detection Methods From CNN To Tran
No ratings yet
A Survey Object Detection Methods From CNN To Tran
32 pages
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
No ratings yet
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
13 pages
Slicing Aidedhyperinferenceandfine-Tuning Forsmallobjectdetection
No ratings yet
Slicing Aidedhyperinferenceandfine-Tuning Forsmallobjectdetection
5 pages
An Application of A Deep Learning Algorithm For Automatic Detection of Unexpected Accidents Under Bad CCTV Monitoring Conditions in Tunnels
No ratings yet
An Application of A Deep Learning Algorithm For Automatic Detection of Unexpected Accidents Under Bad CCTV Monitoring Conditions in Tunnels
5 pages
Remotesensing 14 00984 v2
No ratings yet
Remotesensing 14 00984 v2
21 pages
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
No ratings yet
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
14 pages
An Investigation of Deep Neural Network Based Techniques For Object Detection An
No ratings yet
An Investigation of Deep Neural Network Based Techniques For Object Detection An
6 pages
20-Year Evolution of Object Detection
No ratings yet
20-Year Evolution of Object Detection
20 pages
Mini
No ratings yet
Mini
8 pages
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
No ratings yet
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
7 pages
Implementation of An Improved Multi-Object Detection, Tracking, and Counting For Autonomous Driving
No ratings yet
Implementation of An Improved Multi-Object Detection, Tracking, and Counting For Autonomous Driving
29 pages
Fin Irjmets1684232858
No ratings yet
Fin Irjmets1684232858
9 pages
Deep Learning for Object Tracking
No ratings yet
Deep Learning for Object Tracking
3 pages
Sensors 25 00214
No ratings yet
Sensors 25 00214
32 pages
Ijlbps 6620dd20c5747
No ratings yet
Ijlbps 6620dd20c5747
8 pages
Electronics 12 01515
No ratings yet
Electronics 12 01515
21 pages
Real Time Object Detection With Deep Learning and OpenCV
No ratings yet
Real Time Object Detection With Deep Learning and OpenCV
5 pages
SSRN Id4107251
No ratings yet
SSRN Id4107251
7 pages
Ankit Synopsis
No ratings yet
Ankit Synopsis
13 pages
Faster R-CNN Based On Frame Difference and Spatiotemporal Context For Vehicle Detection
No ratings yet
Faster R-CNN Based On Frame Difference and Spatiotemporal Context For Vehicle Detection
15 pages
Object Detection in Images and Videos Using OpenCV A Comparative Study of Deep Learning and Traditional Computer Vision Techniques
No ratings yet
Object Detection in Images and Videos Using OpenCV A Comparative Study of Deep Learning and Traditional Computer Vision Techniques
6 pages
Automatic Vehicle Detection System in Different Environment Conditions Using Fast R-CNN
No ratings yet
Automatic Vehicle Detection System in Different Environment Conditions Using Fast R-CNN
21 pages
Object Detection for the Visually Impaired
No ratings yet
Object Detection for the Visually Impaired
4 pages
Design and Augmentation of A Deep Learning Based Vehicle Detection Model For Low Light Intensity Conditions
No ratings yet
Design and Augmentation of A Deep Learning Based Vehicle Detection Model For Low Light Intensity Conditions
13 pages
Real-Time Scene Understanding
No ratings yet
Real-Time Scene Understanding
11 pages
Aws RP
No ratings yet
Aws RP
11 pages
CNN vs Transformer: Object Detection Survey
No ratings yet
CNN vs Transformer: Object Detection Survey
31 pages
STP128 Eb.1415051 1 PDF
No ratings yet
STP128 Eb.1415051 1 PDF
256 pages
TECHNICAL DATA SHEET OF FIBREALUM (New)
No ratings yet
TECHNICAL DATA SHEET OF FIBREALUM (New)
2 pages
Abis Over IP Configuration Locally From The BTS Side
No ratings yet
Abis Over IP Configuration Locally From The BTS Side
12 pages
Bai Tap CNTT
No ratings yet
Bai Tap CNTT
198 pages
Bombardier Learjet-Flight Control Systems and Avionics
100% (3)
Bombardier Learjet-Flight Control Systems and Avionics
88 pages
Re92064 - 2021 12 13
No ratings yet
Re92064 - 2021 12 13
68 pages
Krishnajn@iitb - Ac.in: Quiz Schedule For The Semester Day Thursday Thursday Thursday
No ratings yet
Krishnajn@iitb - Ac.in: Quiz Schedule For The Semester Day Thursday Thursday Thursday
1 page
Bosch Erick Brand From China
100% (2)
Bosch Erick Brand From China
36 pages
Clinical Data Management
100% (4)
Clinical Data Management
39 pages
Norma de Tallado y Acanalado de Engranajes
No ratings yet
Norma de Tallado y Acanalado de Engranajes
7 pages
24 - Metor 6M Datasheet
No ratings yet
24 - Metor 6M Datasheet
2 pages
KC and KP Questions Equilibria
No ratings yet
KC and KP Questions Equilibria
8 pages
Ecomax 502l
No ratings yet
Ecomax 502l
23 pages
Numerical Simulation of Electrical Transients: Magnetic Transients Program (EMTP) Are Used. These Computer Programs
No ratings yet
Numerical Simulation of Electrical Transients: Magnetic Transients Program (EMTP) Are Used. These Computer Programs
23 pages
PI - SynPower Power Steering Fluid - 512 05
No ratings yet
PI - SynPower Power Steering Fluid - 512 05
2 pages
Encrypted Document Analysis
100% (2)
Encrypted Document Analysis
50 pages
Furukawa PCR200 Crawler Drill Overview
No ratings yet
Furukawa PCR200 Crawler Drill Overview
8 pages
Ijuk Fiber Composite for Motorcycles
No ratings yet
Ijuk Fiber Composite for Motorcycles
6 pages
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
No ratings yet
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
3 pages
AutoCAD Linework: Line vs Spline vs Polyline
No ratings yet
AutoCAD Linework: Line vs Spline vs Polyline
3 pages
Heat Transfer Assignments
No ratings yet
Heat Transfer Assignments
2 pages
Transformer Fault Detection and Protection System: Abstract
No ratings yet
Transformer Fault Detection and Protection System: Abstract
6 pages
Senate Bill No. 2008 - Plumbing Engineering Law
100% (1)
Senate Bill No. 2008 - Plumbing Engineering Law
15 pages
Lennox 2018 Catalogue
0% (1)
Lennox 2018 Catalogue
203 pages
Structuro 100: Uses Properties
No ratings yet
Structuro 100: Uses Properties
2 pages
Firestop - Wikipedia
No ratings yet
Firestop - Wikipedia
7 pages
GRI H8800R Data Sheet
No ratings yet
GRI H8800R Data Sheet
3 pages
Integra II SSA - Compressed (Review 03-07-2022)
No ratings yet
Integra II SSA - Compressed (Review 03-07-2022)
4 pages

Sensors 22 04833

Uploaded by

Sensors 22 04833

Uploaded by

sensors

Keywords: autonomous driving; deep learning; object detection; transformer

Citation: Liang, T.; Bao, H.; Pan, W.;

Sensors 2022, 22, 4833. https://doi.org/10.3390/s22134833 https://www.mdpi.com/journal/sensors

and their classification subnetwork is trained to independently classify different objects

2.2. Transformers Structure

3.1. Global Extract Encoder

3.1. Global Extract Encoder

H = Concat(h1 , h2 , . . . , hn )w(H) , (3)

3.2. Class Decoder

yclass = So f tmax (σ ( ϕ( f p ))). (6)

3.3. Attention Mechanism in the Backbone Network

3.3. Attention Mechanism in the Backbone Network

4. Experience and Results

4.3. Implementation and Training Details

Table 1. The ablation study on the COCO dataset.

Table 2. The ablation study on the BCTSDB dataset.

We further compare the different performances of anchor-based and anchor-free

Table 3. Comparison of anchor-based and anchor-free methods on the KITTI dataset.

Methods Detector Car (%) Pedestrian (%) Cyclist (%)

For the initialization method of proposal categories, we compare different methods, as

For the initialization method of proposal categories, we compare different methods,

TableThe The performance

Table 5. Classification results with other methods on the BCTSDB dataset.

Table 6. Comparison of results with other methods on the BCTSDB dataset.

Model Backbone Head AP AP50 AP75 APS APM APL FPS

Table 7. Comparison results for detection methods on the KITTI dataset.

Car Pedestrian Cyclist

Sensors 2022, 22, x FOR PEER REVIEW 13 of 18

Sensors 2022, 22, x FOR PEER REVIEW

AP Averaged AP at IoUs from 0.5 to 0.95 with an interval of 0.05

You might also like