An Improved Lightweight Yolo-Fastest V2 For Engine
An Improved Lightweight Yolo-Fastest V2 For Engine
Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
1
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
Abstract— Engineering vehicle recognition based on video frequently, especially the illegal construction of cultivated land,
surveillance is one of the key technologies to assist illegal land use forest land and river. In addition, coupled with the serious
monitoring. At present, the engineering vehicle recognition mainly national conditions of land pollution and soil erosion, we must
adopts the traditional deep learning model with a large number of cherish land resources and strengthen the fight against illegal
floating-point operations. So, it cannot be achieved in edge devices
with limited computing power and storage in real-time. In addition,
land use. Engineering vehicles (such as trucks, excavators, etc.)
some lightweight models have problems with inaccurate bounding box are important tools for illegal land use. Therefore, identifying
locating, low recognition rate, and unreasonable selection of positive engineering vehicles from video is a useful way to monitor
training samples for the small object. To solve the problems, the paper illegal land use. The traditional method identifies engineering
proposes an improved lightweight Yolo-Fastest V2 for engineering vehicles manually through surveillance video by staff [5], a
vehicle recognition fusing location enhancement and adaptive label time-consuming and labor-intensive process. With the rapid
assignment. The location-enhanced Feature Pyramid Network (FPN) advancement of deep learning, automatic image recognition
structure combines deep and shallow feature maps to accurately based on convolutional neural networks (CNN) has become
localize bounding boxes. The grouping k-means clustering strategy widespread in traffic management [6]-[9], crime prevention
and adaptive label assignment algorithm select an appropriate anchor
for each object based on its shape and Intersection over Union (IoU).
[10-12], anomaly detection [13]-[16], fire warning [17], human
The study was conducted on Raspberry Pi 4B 2018 using two datasets detection [18]-[22], and agricultural management [23], among
and different models. Experiments show that our method achieves the other fields. This also enables the identification of engineering
optimal combination in speed and accuracy. Specifically, the mAP50 vehicles from video in real-time. However, the real-time video
is increased by 7.02 % with the speed of 11.24FPS under the captured by the online camera must be uploaded to a cloud or
engineering vehicle data obtained by video surveillance in a rural area local server on which the deep learning recognition model has
of China. been deployed [24]-[25]. This process is affected by network
bandwidth, preventing data transmission in real-time. Not only
are a significant number of operations concentrated on the
Index Terms—lightweight model, feature fusion, k-means server, but the server’s performance requirements are also
clustering, adaptive label assignment.
relatively high. Consequently, in the process of video
recognition, deploying computing power at the front end is a
I. INTRODUCTION crucial way to alleviate the pressure of data transmission and
This research was support by the project of major scientific and technological Dayu Cheng is with the School of Mining and Geomatics Engineering,
achievements transformation in Hebei Province of China: 22287401Z, NSFC: Hebei University of Engineering, Handan 056038, China (e-mail:
41971352, Alibaba Cloud Computing Co. Ltd. funding: 17289315, and [email protected]).
Guangxi JMRH Funding: 202203. (Corresponding author: Dayu Cheng). Xiaoliang Meng is with the School of Remote Sensing and Information
Hairong Zhang is with the School of Environment and Spatial Informatics, Engineering, Wuhan University, Wuhan 430079, China, and also with the Key
China University of Mining and Technology, Xuzhou 221116, China, and also Laboratory of Natural Resources Monitoring in Tropical and Subtropical Area
with the Key Laboratory of Natural Resources Monitoring in Tropical and of South China, Ministry of Natural Resources, Guangzhou 510500, China (e-
Subtropical Area of South China, Ministry of Natural Resources, Guangzhou mail: [email protected]).
510500, China (e-mail: [email protected]). Wei Liu is with the School of Geography, Geomatics and Planning, Jiangsu
Dongsheng Xu and Geng Xu are with the School of Environment and Spatial Normal University, Xuzhou 221116, China (e-mail: [email protected]).
Informatics, China University of Mining and Technology, Xuzhou 221116, Teng Wang is with the Surveying and Mapping Institute, Lands and
China (e-mail: [email protected]; [email protected]). Resource Department of Guangdong Province, Guangzhou 510500, China (e-
mail: [email protected]).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
2
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
[28]-[29], thereby enhancing real-time recognition and of YOLO V5 with ShuffleNet V2, while simultaneously
relieving the burden on the central server. However, because lightening the FPN structure; the parameter size of the model is
current mainstream deep learning object recognition models just 237.55KB. On the Mate 30 Kirin 990 CPU, the image
require large floating-point operations (FLOPs) [30]-[32], real- recognition speed of 352×352 can reach more than 300 FPS.
time recognition cannot be achieved in edge devices with The real-time detection effect can also be achieved on
limited computing and storage resources. Therefore, embedded devices with limited computing resources. However,
developing a lightweight model for object recognition in land Yolo-Fastest V2 lacks rich feature fusion, which prevents it
and resource management has become a popular research from making full use of the rich location information in shallow
direction [33]-[35]. feature maps and from selecting suitable positive training
Currently, some researchers are investigating lightweight samples effectively. This results in insufficient extraction of
object recognition models. Nikouei et al. [36] developed a location and semantic features of small objects, which leads to
lightweight human detection model L-CNN using depth-wise inaccurate positioning of the bounding boxes (bboxes) of small
separable convolutions and only 23 layers. The edge device objects and a low recognition rate.
Raspberry PI 3 Model B has an average recognition speed of Yolo-FastestV2 is a widely used lightweight object
1.79FPS for 224×224 video images. Even though this model recognition model. This paper proposes a method of fusing
has a small number of parameters, the computation required is location-enhanced FPN and adaptive label assignment to
still substantial, and real-time detection cannot be achieved on address the issues encountered by Yolo-FastestV2. This method
devices with limited computing resources. Steinmann et al. [37] improves the FPN structure by fusing feature maps of different
replaced the SSD detector's backbone network with a scales to ensure that the object’s location and feature
lightweight backbone network PELEE, and combined the filter information are fully extracted. A grouping k-means clustering
pruning strategy to reduce the number of parameters. On Jetson strategy and an adaptive label assignment algorithm are
AGX, the recognition speed is 24 times faster than the original proposed to assign appropriate anchor boxes to each object to
model, but the recognition accuracy suffers. Based on YOLO- combat the issues of low recognition rate and unreasonable
Lite as the backbone network, Zhao et al. [38] added a residual selection of positive training samples. The algorithm assigns
block, balanced the high- and low-resolution sub-networks, and each object in the image an anchor with a similar shape and the
proposed the Mixed YOLO V3-Lite lightweight model. Using largest IoU. In other words, a suitable positive training sample
Jetson AGX Xavier equipment, the speed of recognizing is selected to facilitate the regression of the bbox.
224×224 video images can reach 43FPS. Still, the number of In summary, the following are the article’s contributions:
parameters is substantial, reaching 20.5MB. Balamuralidhar et 1) Feature fusion is added based on the original model FPN.
al. [28] introduced the MultEYE network in the UAV traffic Integrate the shallow feature map rich in location
detection system Multeye. The network implements pruning to information with the deep feature map channel rich in
reduce the number of CSP Bottleneck and convolution blocks semantic information to enhance the localization ability
and adds a Space-to-depth layer to decrease the space size and of the model;
increase the number of channels. The speed of recognizing 2) Introducing a threshold α and proposing a grouping k-
512×320 video images on NVIDIA Xavier NX can reach means clustering strategy. According to the adjustment
29FPS. The Head, Neck, and Backbone of the lightweight of the α value, multi-scale anchors are generated for the
model of RangiLyu [39] proposed an anchor-free Nanodet small object. So, the small-scale anchor box can be set
model, which can recognize data with a size of 320×320 on a reasonably to make the small bbox easier to learn;
mobile ARM CPU at a speed of 97FPS. However, the anchor- 3) An adaptive label assignment algorithm based on shape
free model has significant issues with positive and negative and IoU is proposed. Based on shape label assignment,
sample imbalance and semantic ambiguity, leading to the an IoU label assignment algorithm is introduced. The
phenomenon of missing objects. Amudhan et al. [40] proposed shape threshold and IoU threshold are dynamically
a lightweight model PmA that can detect small objects in calculated by calculating the mean and variance of IoU
remote sensing images effectively and quickly by extracting and the mean and variance of the ratio of the shape.
more feature information from shallow features and transferring Under the dual constraints of shape and IoU, an
shallow features to deep features for detection. Compared to appropriate anchor box is assigned to each object.
YOLO V3-tiny and YOLO V4-tiny, Jetson Nano’s detection
speed is 32% faster. Liu et al. [41] addressed the issues of II. PROPOSED METHOD
limited UAV computing resources and too small objects by As shown in Fig. 1, the model consists of three components:
enhancing the network structure and target prediction frame
the backbone network ShuffleNet V2, the FPN, and the
selection algorithm of YOLO V4. This enhanced the
prediction head. This paper improves the Yolo-FastestV2
recognition accuracy. Achieving 23.8FPS on Nvidia Jetson
model in three aspects: FPN structure, anchor box settings, and
TX2 and 9.6FPS on Raspberry Pi 4B. Dog-qiuqiu [42] proposed
the Yolo-FastestV2 model for detection in real-time on mobile label assignment.
devices. The model is intended to replace the backbone network
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
3
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
4*anchors
×3
pool
×3 conv
×7
×3
72 object
+ DWSConv
×3 conv
concat
192 192 72
96 96 shufflev2block 3 conv classes
48 48 shufflev2block 2
shufflev2block 1 72 conv
24 DWSConv
pool 4*anchors
×3
24
conv
72 conv
upsample
72 object
DWSConv
conv
+ + ×3
concat add
pool 72
conv
classes
conv
72
DWSConv
Location Enhanced Light-FPN Adaptive Label Assignment
Fig. 1. Overall framework of the improved Yolo-Fastest V2
In recent years, ShuffleNet V2 has been an excellent For medium and large-scale objects, the feature map
network for lightweight feature extraction. It consists of obtained from the first ShuffleV2Block convolution block is
three stacked Shufflev2Block convolution blocks. The entire down-sampled by a factor of 4 to obtain C1. Second, obtain
network ensures optimal performance for feature extraction C11 by performing concatenation feature fusion on C1 and
by minimizing memory access. The location-enhanced FPN the feature map obtained by the third ShuffleV2Block
structure (red dotted-line) will integrate the location convolution block. C11 is subjected to 1×1 convolution to
information of the shallow features into the deep features to predict objects of medium and large scale.
improve the model's ability to locate the bbox of a small For small-scale objects, the result after C11 convolution is
object. In the label assignment algorithm of the prediction firstly up-sampled by two times to obtain C13. Second, C2 is
header (blue dotted-line), the grouping k-means clustering obtained by down-sampling by a factor of 2 the feature map
strategy generates reasonable-anchors for small objects, obtained by the first ShuffleV2Block convolution block. The
while the adaptive label assignment algorithm matches feature map obtained by C2 and the second ShuffleV2Block
appropriate anchors for each object. Details are as follows: convolution block is fused by concatenation to obtain C21.
C21 incorporates deep semantic information. C22 is obtained
A. Location-enhanced FPN after 1×1 convolution. After each concatenation feature
In Fig. 2(a), the original FPN structure predicts small fusion operation, 1 × 1 convolution is used because during
objects using the feature map obtained by convolution of the the whole model training process, the convolution operation
third ShuffleV2Block convolution block in the ShuffleNet can extract the effective feature information of the previous
V2 network structure, in conjunction with 1×1 convolution. feature map and weaken the noise information. Finally, C22
Then, the feature map obtained by the 1×1 convolution is up- and C13 perform addition feature fusion to predict small
sampled. The feature map obtained by the second objects. Through the method of addition feature fusion,
ShuffleV2Block convolution block is subjected to shallow and deep features are fully fused, resulting in a fused
concatenation feature fusion. Finally, the 1×1 convolution is feature map with rich object location information that
performed to predict the object whose scale is greater than improves the original model’s ability to locate bboxes. This
that of the small object. Thus, FPN utilizes only two layers structure enables the prediction of large and small objects to
of shallow feature maps, lacks rich object location obtain the fusion of shallow features with rich location
information, and cannot fully extract semantic information. information, thereby improving the locating accuracy.
This will result in inaccurate center coordinate positioning
B. Grouping K-means Clustering Strategy
when the model predicts a bbox of small object.
This paper proposes a location-enhanced FPN structure For anchors-based object recognition, appropriate anchors
(Fig. 3(b)), which primarily contributes to feature fusion by must be set before label assignment. However, the object’s
introducing 1/8 times the shallow feature map obtained by size in the image is either large or small. If we directly apply
the first ShuffleV2Block convolution block into the deep k-means clustering to the bboxes of the training set, it is
feature map. simple to cluster the bboxes of some very small objects into
a wide range of anchors (Fig. 3(a)). This is due to the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
4
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
1/16 Upsample
Anchor
(b)
ShuffleV2Block 2 Concat
Fig. 3. (a) Small objects are aggregated as a whole into a
1x1 Conv
class of anchors; (b) Smaller objects do not match the
1/8 BN, ReLu anchors obtained by clustering
ShuffleV2Block 1
image
GT
(a)
H h
w
1x1 Conv C13
Upsample Add
BN, ReLu
C11 C22
Concat
1/32
ShuffleV2Block 3
1x1 Conv W
BN, ReLu
1/16 C21 Fig. 4. Ratio of width to height
C1 ShuffleV2Block 2 Concat
1/8 C2
C. Adaptive Label Assignment
Pool ShuffleV2Block 1 Pool
After acquiring appropriate anchors, combine shape label
assignment (Fig. 5). The object is matched with the
(b) appropriate anchor, which is then assigned to cls and reg
Fig. 2. (a) FPN; (b) Location-enhanced FPN labels. However, the original method of shape label
assignment causes two issues: (1) The fixed shape threshold
cannot guarantee that each object matches an anchor. In other
words, the selection of positive training samples is
unreasonable. For example, objects larger than the shape
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
5
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
i
in (1)-(5), Hgt represents the height of the i-th object GT box;
j i
Hanchor represents the height of the j-th anchor; Wgt
j
denotes the width of the i-th object GT box; Wanchor
ij
represents the width of the j-th anchor; dW denotes the ratio
ij
of the width of the j-th anchor of the i-th object GT box; dH
represents the ratio of the height of the j-th anchor of the i-th
object GT box; m represents the number of anchors; rij
denotes the maximum value of the ratio of the width and
Fig. 5. Shape Assignment (Yolo-Fastest V2)
height of the j-th anchor of the i-th object GT box; μi
represents r-mean of the i-th object GT box; σi denotes the r-
variance of the i-th object GT box; αishape represents the
shape threshold of the i-th GT box.
In terms of IoU, determine the mean and standard
deviation of the IoU for each object’s GT box and anchors at
the same level. The sum of the mean and standard deviation
is used as the lower boundary constraint value of each
object’s GT box’s IoU. This constraint combines the results
of the shape similarity constraint to determine the
participating positive sample anchors in training. If the
anchors assigned based on shape similarity do not meet the
adaptive IoU threshold, they are included in the label (a)
assignment for the following level. This ensures that each
object has a corresponding anchor. The lower bound
threshold is calculated as follows:
∑𝑚
𝑖=1 𝐼𝑜𝑈𝑖𝑗
𝑖
𝜇𝐼𝑜𝑈 = (6)
𝑚
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
6
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
(b)
Fig. 6. (a) GT is too small; (b) GT is too large
TABLE I
Model Training Environment
Software and Hardware Parameter
Operating System Windows10
GPU NVIDIA GeForce RTX
2080 Ti 8GB
cuDNN V7.6.5 for CUDA10.2
Fig. 7. Adaptive shape similarity threshold constraint
IDE Pycharm
Programming Language Python
III. EXPERIMENTAL RESULTS AND ANALYSIS
Frame Pytorch
This paper uses the engineering vehicle image collected
by the monitor as the data source. First, train the model. TABLE II
Then, ablation experiments are conducted to determine the Embedded Device Operating Environment
effect of the location-enhanced FPN, the grouping k-means Software and Hardware Parameter
clustering strategy, and the adaptive label assignment on the
Device Raspberry Pi 4B 2018
model. Finally, it is compared to other mainstream-
Operating System Linux 32bit
lightweight object detection models to validate the method’s
advancement. CPU arm v7l
Programming Language C++
A. Data Introduction and Experimental Environment Frame NCNN
There are 679 images with 1080*1920 pixels in the dataset NCNN: a high-performance neural network inference
of this experiment. Divide the dataset into training set and computing framework produced by Tencent.
testing set by 8:2. There are 543 training data and 136 testing
data. The recognized object types include car, truck, Model Training. Due to privacy restrictions on
agricultural car, and excavator. In training set, the number of surveillance video, only a limited number of images can be
vehicles is 587, 389, 161, and 131, respectively. In testing obtained. Therefore, this paper only divides the train set and
set, the number of vehicles is 166, 110, 36, and 37, the test set, not the val set, to ensure a sufficient number of
respectively. During the training phase, transfer learning and train sets. The improved model is pre-trained on MS COCO
data augmentation (e.g., rotations, gamut transformation, 2017’s train set to obtain pre-trained weights. Other methods
etc.) are used to train the model. employ official COCO pre-trained weights. On the premise
Description of the Experimental Environment. The that all models do not freeze the parameters of any layer, the
model was trained on an NVIDIA GeForce RTX 2080 Ti pre-trained model is fine-tuned using transfer learning on the
GPU with 8GB of memory (TABLE I). As shown in Fig.8, a train set in this paper. The experiment adopts data
augmentation based on image processing, because it can
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
7
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
8
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
C. Compare with State-of-the-Art Models demonstrates that our method fully learns the location
To evaluate the impact of the proposed method on the features and semantic features of small objects under video
model’s ability to extract features, six current state-of-the-art surveillance. In Fig.12(a)-(f), other detectors have the
lightweight detectors are compared with the model that has phenomenon of missing detection. However, Fig.12(g)
been enhanced by the proposed method. Among the six most shows that the number of missing detections is less for the
advanced methods, Yolo-Fastest and Yolo-FastestV2 are improved Yolo-Fastest V2. This demonstrates our method
lightweight from YOLOV3 and YOLOV5, respectively. The can choose better positive training samples in order to
others are official lightweight models. transmit an effective gradient to the model during the
Compared to the original model Yolo-FastestV2, the forward propagation. This can fully learn the effective
proposed method can enhance the model’s ability to extract features of the object. The recognition results of other scenes
feature. Fig.11 demonstrates that when the input image size are shown in Fig.13 for the improved Yolo-Fastest V2.
exceeds 352, the mAP50 curve of Yolo-FastestV2 tends to
be flat. At this point, the capability of feature extraction
reaches a bottleneck. Compared to the improved model of
the proposed method, the model’s ability to extract features
is enhanced. After the improved model, the mAP50 curve
still maintains an increasing trend when the input image size
is greater than 352. When the input image size is less than or
equal to 352, the improved model has a higher mAP50 than
the model with the same parameter level. When the input
image size exceeds 352, the improved model exhibits a
slightly lower mAP50 growth trend than NanoDet-m,
YOLOV3-Tiny, and Yolo-Fastet. Due to model size, it
results in insufficient extraction of the object’s feature
information.
TABLE VI compares the models quantitatively in terms
of FLOPs, parameter quantities, recognition speed, and
mAP50. The proposed method increases the number of
parameters by 105.4K, but does not increase FLOPs.
Specifically, compared to YOLOV3-Tiny, YOLOX-Nano,
NanoDet-m, Yolo-Fastest and Yolo-FastestV2, the proposed Fig. 10. Recognition Speed Comparisons of different
method increases mAP50 by 9.42%, 9.55%, 3.09%, 11.42%, sizes on validation dataset
and 7.02% for Yolo-FastestV2. It is 7.35FPS, 8.33FPS,
7.94FPS, 4.92FPS, and 1.28FPS faster in terms of
recognition speed. The improved model is 17.38% lower in
mAP50 than YOLOV5-n, but 8.28FPS faster. Therefore, in
a trade-off between speed and accuracy, the performance of
the enhanced model is optimal. In TABLE IV, the
experiment uses UA-DETRAC vehicle benchmark dataset,
including car, bus, van, others. The number of cars, bus, van,
others in the training set is 503853, 57051, 33651 and 3726
respectively. The number of cars, bus, van, others in the test
set is 548555, 38519, 71785 and 16915 respectively. The
proposed method increases car, bus, van and others by
0.40%mAP50, 0.10%mAP50, 1.10%mAP50 and
6.25%mAP50, respectively. Up 1.97%mAP50 overall. The
experimental results show that this method can reasonably
select positive training samples and fully learn the vehicle
feature under video surveillance equipment.
To visually demonstrate the effect of vehicle detection in
Raspberry Pi, the experimental comparison results for each
model are presented in Fig.12. The comparison of Fig.12(f) Fig. 11. mAP50 Comparisons of different sizes on
and Fig.12(g) shows that the proposed method identifies validation dataset
fewer errors and makes bbox positioning more accurate. This
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
9
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
TABLE III
Validation of the proposed method on Yolo-FastestV2 (monitoring dataset)
Method Size FLOPs mAP50
Yolo-Fastest V2 352*352 0.11G 42.90
Yolo-Fastest V2+LE-FPN 352*352 0.11G 43.91(+1.01)
Yolo-Fastest V2+GKC 352*352 0.11G 44.94(+2.04)
Yolo-Fastest V2+ALA 352*352 0.11G 44.43(+1.53)
Yolo-Fastest V2 +LE-FPN+GKC 352*352 0.11G 47.26(+4.36)
*Yolo-Fastest V2 352*352 0.11G 49.92(+7.02)
GKC: Grouping K-means Clustering; LE-FPN: Location-enhanced FPN;
ALA: Adaptive Label Assignment; *Yolo-Fastest V2: the improved Yolo-Fastest V2
TABLE IV
Validation of the proposed method on Yolo-Fastest V2 (UA-DETRAC)
Method Size All car bus van others
TABLE V
Validation of the proposed method on yolov5-n (Pascal VOC dataset)
Method Size FLOPs mAP50
yolov5-n 352*352 0.63G 74.00
yolov5-n+LE-FPN 352*352 0.63G 74.10(+0.10)
yolov5-n+GKC 352*352 0.63G 74.50(+0.50)
yolov5-n+ALA 352*352 0.63G 74.90(+0.90)
yolov5-n+LE-FPN+GKC 352*352 0.63G 74.80(+0.80)
*yolov5-n 352*352 0.63G 75.10(+1.10)
LE-FPN: Location-enhanced FPN, *yolov5-n: the improved yolov5-n
TABLE VI
Model Comparison
Model Size FLOPs Parameters Speed (NCNN) mAP50
YOLOV3-Tiny 352*352 1.97G 8.68M 324.08ms/3.89FPS 40.50
YOLOX-Nano 352*352 2.08G 2.24M 343.92ms/2.91FPS 40.37
YOLOV5-n 352*352 4.20G 1.27M 337.77ms/2.96FPS 67.30
NanoDet-m 352*352 0.41G 937.23K 303.40ms/3.30FPS 46.83
Yolo-Fastest 352*352 0.16G 291.42K 158.22ms/6.32FPS 38.50
Yolo-Fastestv2 352*352 0.11G 237.55K 100.41ms/9.96FPS 42.90
*Yolo-Fastestv2 352*352 0.11G 342.95K 88.95ms/11.24FPS 49.92
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
10
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
11
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
12
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
[4] Zhang, X., Song, J., Wang, Y., Deng, W., & Liu, Y. “Effects of land use Learning and Machine Learning,” Remote Sensing, vol.14, no.12, pp.
on slope runoff and soil loss in the Loess Plateau of China: A meta- 2837, Jun. 2022.
analysis,” Science of The Total Environment, vol. 755, pp. 142418, Feb. [24]J. Lee, J. Wang, D. Crandall, S. Šabanović, and G. Fox. “Real-time,
2021. cloud-based object detection for unmanned aerial vehicles,” In 2017
[5] Zhang, H., Wang, Z., Yang, B., Chai, J., & Wei, C. “Spatial–temporal First IEEE International Conference on Robotic Computing (IRC)., pp.
characteristics of illegal land use and its driving factors in China from 36-43, Apr. 2017.
2004 to 2017,” International Journal of Environmental Research and [25]A. Anjum, T. Abdullah, M. F. Tariq, Y. Baltaci, and N. Antonopoulos.
Public Health, vol. 18, no. 3, pp. 1336, Feb. 2021. “Video stream analysis in clouds: An object detection and classification
[6] Xiang, Xuezhi, Ning Lv, Xinli Guo, Shuai Wang, and Abdulmotaleb El framework for high performance video analytics,” IEEE Transactions
Saddik. "Engineering Vehicles Detection Based on Modified Faster R- on Cloud Computing., vol. 7, no. 4, pp. 1152-1167, Jan. 2016.
CNN for Power Grid Surveillance," Sensors, vol. 18, no. 7, pp. 2258, [26]W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. “Edge computing: Vision
Jul. 2018. and challenges,” IEEE internet of things journal., vol. 3, no. 5, pp. 637-
[7] Y. Yang, H. Song, S. Sun, W. Zhang, Y. Chen, L. Rakal, and Y. Fang. 646, Jun. 2016.
“A fast and effective video vehicle detection method leveraging feature [27]Y. Siriwardhana, P. Porambage, M. Liyanage, and M. Ylianttila. “A
fusion and proposal temporal link,” Journal of Real-Time Image survey on mobile augmented reality with 5G mobile edge computing:
Processing., vol. 18, no. 4, pp. 1261-1274, May. 2021. architectures, applications, and technical aspects,” IEEE
[8] S. Sri Jamiya, and P. Esther Rani. "An efficient method for moving Communications Surveys & Tutorials., vol. 23, no. 2, pp. 1160-1192,
vehicle detection in real-time video surveillance." Advances in Smart Feb. 2021.
System Technologies., Springer, Singapore, pp. 577-585, 2021. [28]N. Balamuralidhar, S. Tilon, and F. Nex. “MultEYE: Monitoring system
[9] A. Fedorov, K. Nikolskaia, S. Ivanov, V. Shepelev, and A. Minbaleev. for real-time vehicle detection, tracking and speed estimation from
“Traffic flow estimation with data from a video surveillance camera,” UAV imagery on edge-computing platforms,” Remote sensing., vol. 13,
Journal of Big Data., vol. 6, no. 1, pp. 1-15, Aug. 2019. no. 4, 573, Feb. 2021.
[10]A. B. Mabrouk, and E. Zagrouba. Abnormal behavior recognition for [29]P. Gupta, B. Pareek, G. Singal, and D. V. Rao. “Edge device based
intelligent video surveillance systems: A review. Expert Systems with military vehicle detection and classification from uav,” Multimedia
Applications., vol. 91, pp. 480-491, Jan. 2018. Tools and Applications., vol. 81, no. 14, pp. 19813-19834, Jul. 2022.
[11]J. Lim, M. I. Al Jobayer, V. M. Baskaran, J. M. Lim, J. See, and K. [30]K. He, G. Gkioxari, P. Dollár, and R. Girshick. “Mask r-cnn,” In
Wong. “Deep multi-level feature pyramids: Application for non- Proceedings of the IEEE international conference on computer vision.,
canonical firearm detection in video surveillance,” Engineering pp. 2961-2969, 2017.
applications of artificial intelligence., vol. 97, 104094, Jan. 2021. [31]ultralytics/yolov5: YOLOv5 in PyTorch > ONNX > CoreML > TFLite
[12]F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, and F. (github.com).
Herrera. “Object detection binary classifiers methodology based on [32]Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun. “Yolox: Exceeding yolo series
deep learning to identify small objects handled similarly: Application in in 2021,” 2021, arXiv preprint arXiv:2107.08430.
video surveillance,” Knowledge-Based Systems., vol. 194, 105590, Apr. [33]Y. Zhao, Y. Yin, and G. Gui. “Lightweight deep learning based
2020. intelligent edge surveillance techniques,” IEEE Transactions on
[13]J. Sun, J. Shao, and C. He. “Abnormal event detection for video Cognitive Communications and Networking., vol. 6, no. 4, pp. 1146-
surveillance using deep one-class learning,” Multimedia Tools and 1154, Jun. 2020.
Applications., vol. 78,no. 3, pp. 3633-3647, Sep. 2019. [34]Z. H. O. U. Long, W. Suyuan, C. U. I. Zhongma, F. A. N. G. Jiaqi, Y.
[14]J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh. A. N. G. Xiaoting, and D. I. N. G. Wei. “Lira-YOLO: a lightweight
“Anomalynet: An anomaly detection network for video surveillance,” model for ship detection in radar images,” Journal of Systems
IEEE Transactions on Information Forensics and Security., vol. 14, no. Engineering and Electronics, vol. 31, no. 5, pp. 950-956, October. 2020.
10, pp. 2537-2550, Feb. 2019. [35]A. Ullah, K. Muhammad, W. Ding, V. Palade, I. U. Haq, and S. W. Baik.
[15]J. T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, and Y. Xiao. “Attention- “Efficient activity recognition using lightweight CNN and DS-GRU
driven loss for anomaly detection in video surveillance,” IEEE network for surveillance applications,” Applied Soft Computing, vol.
transactions on circuits and systems for video technology., vol. 30, no. 103, 107102, May. 2021.
12, pp. 4639-4647, Dec. 2019. [36]S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B. Y. Choi, and T. Faughnan.
[16]R. Nawaratne, D. Alahakoon, D. De Silva, and X. Yu. “Spatiotemporal “Smart surveillance as an edge network service: From harr-cascade,
anomaly detection using deep learning for real-time video surveillance,” svm to a lightweight cnn,” In 2018 ieee 4th international conference on
IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 393- collaboration and internet computing (cic)., pp. 256-265, Oct. 2018.
402, Aug. 2019. [37]L. Steinmann, L. Sommer, A. Schumann, and J. Beyerer. “Fast and
[17]K. Muhammad, J. Ahmad, Z. Lv, P. Bellavista, P. Yang, and S. W. Baik. lightweight person detector for unmanned aerial vehicles,” EUSIPCO.
“Efficient deep CNN-based fire detection and localization in video 2019.
surveillance applications,” IEEE Transactions on Systems., Man, and [38]H. Zhao, Y. Zhou, L. Zhang, Y. Peng, X. Hu, H. Peng, and X. Cai.
Cybernetics: Systems, vol. 49, no. 7, pp. 1419-1434, Jun. 2018. “Mixed YOLOv3-LITE: a lightweight real-time object detection
[18]H. Wei, M. Laszewski, and N. Kehtarnavaz. “Deep learning-based method,” Sensors, vol. 20, no. 7, 1861, Mar. 2020.
person detection and classification for far field video surveillance,” In [39]RangiLyu/nanodet: NanoDet-Plus Super fast and lightweight anchor-
2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS), pp. free object detection model. Only 980 KB(int8) / 1.8MB (fp16) and run
1-4, Nov. 2018. 97FPS on cellphone (github.com).
[19]I. M. Nasir, M. Raza, J. H. Shah, S. H. Wang, U. Tariq, and M. A. Khan. [40]A. N. Amudhan, and A. P. Sudheer. “Lightweight and computationally
“HAREDNet: A deep learning based architecture for autonomous video faster Hypermetropic Convolutional Neural Network for small size
surveillance by recognizing human actions,” Computers and Electrical object detection,” Image and Vision Computing, vol. 119, 104396, Mar.
Engineering., vol. 99, 107805, Apr. 2022. 2022.
[20]M. A. Khan, K. Javed, S. A. Khan, T. Saba, U. Habib, J. A. Khan, and [41]J. Liu, C. Hu, J. Zhou, and W. Ding. “Object Detection Algorithm Based
A. A. Abbasi. “Human action recognition using fusion of multiview and on Lightweight YOLOv4 for UAV,” In 2022 7th International
deep features: an application to video surveillance,” Multimedia tools Conference on Intelligent Computing and Signal Processing (ICSP). pp.
and applications., pp. 1-27, Mar. 2020. 425-429, Apr. 2022.
[21]G. Sreenu, and S. Durai. “Intelligent video surveillance: a review [42]dog-qiuqiu/Yolo-FastestV2: Based on Yolo's low-power, ultra-
through deep learning techniques for crowd analysis,” Journal of Big lightweight universal target detection algorithm, the parameter is only
Data., vol. 6, no. 1, pp. 1-27, Jun. 2019. 250k, and the speed of the smart phone mobile terminal can reach
[22]M. Shorfuzzaman, M. S. Hossain, and M. F. Alhamid. “Towards the ~300fps+ (github.com).
sustainable development of smart cities through mass video surveillance: [43]S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li. “Bridging the gap
A response to the COVID-19 pandemic,” Sustainable cities and society., between anchor-based and anchor-free detection via adaptive training
vol. 64, 102582, Jan. 2021. sample selection,” In Proceedings of the IEEE/CVF conference on
[23]H. H. Tseng, M. D. Yang, R. Saminathan, Y. C. Hsu, C. Y. Yang, and computer vision and pattern recognition. pp. 9759-9768, 2020.
D. H. Wu. “Rice Seedling Detection in UAV Images Using Transfer
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216
13
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <
Hairong Zhang received Ph.D. degree Geng Xu received the B.S. degree in
in Geodesy and Survey Engineering geophysics from the geophysics from
from China University of Mining and the School of Resources and
Technology, Xuzhou, China, in 2002. Geosciences, China University of
He is currently an Associate Professor Mining and Technology, Xuzhou,
with the School of Environment and China, in 2020, where he is currently
Spatial Informatics, China University of working toward the M.S. degree. His
Mining and Technology, Xuzhou, research interests include mine water
China. He is also a guest professor in the Key Laboratory of inrush intelligent recognition with multi-field coupling based
Natural Resources Monitoring in Tropical and Subtropical on deep learning
Area of South China, Ministry of Natural Resources. His
research interests include GIS, deep learning and
geographical modeling. Wei Liu received the M.S. degree in
cartography and geographic
information engineering from the China
Dongsheng Xu received the B.S. degree University of Mining and Technology,
in human geography and urban & rural Xuzhou, China, in 2007, and the Ph.D.
planning from the School of Surveying degree in cartography and geographic
and Land Information Engineering, information engineering from the China
Henan Polytechnic University, Jiaozuo, University of Mining and Technology,
China, in 2020. He is currently working Xuzhou, China, in 2010. He is currently an Associate
toward the M.S. degree in cartography Professor with the School of Geography, Geomatics and
and geographical information Planning, Jiangsu Normal University, Xuzhou, China. His
engineering with China University of Mining and research interests include spatial data quality checking, high-
Technology, Xuzhou, China. His research interests include resolution remote sensing image processing, and GIS
deep learning and model compression. development and applications.
Dayu Cheng received the Ph.D. degree Teng Wang received the Master Degree
in cartography and geographic in Telecommunications Engineering
information engineering from China and Telecommunication Networks from
University of Mining and Technology, University of Technology Sydney in
Xuzhou, China, in 2012. He is currently 2014. He is currently a Senior Engineer
an Associate Professor with School of with Surveying and Mapping Institute
Mining and Geomatics Engineering, Lands and Resource Department of
Hebei University of Engineering, Guangdong Province, Guangzhou. His
Handan, China. His research interests include geographic big research interests include GIS and computing vision.
data mining, computing vision, and GIS development and
applications.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/