Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views13 pages

An Improved Lightweight Yolo-Fastest V2 For Engine

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

An Improved Lightweight Yolo-Fastest V2 For Engine

Uploaded by

Khánh Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and

Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

1
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

An Improved Lightweight Yolo-Fastest V2 for


Engineering Vehicle Recognition Fusing Location
Enhancement and Adaptive Label Assignment
Hairong Zhang, Dongsheng Xu, Dayu Cheng,

Xiaoliang Meng, Geng Xu, Wei Liu, and Teng Wang

Abstract— Engineering vehicle recognition based on video frequently, especially the illegal construction of cultivated land,
surveillance is one of the key technologies to assist illegal land use forest land and river. In addition, coupled with the serious
monitoring. At present, the engineering vehicle recognition mainly national conditions of land pollution and soil erosion, we must
adopts the traditional deep learning model with a large number of cherish land resources and strengthen the fight against illegal
floating-point operations. So, it cannot be achieved in edge devices
with limited computing power and storage in real-time. In addition,
land use. Engineering vehicles (such as trucks, excavators, etc.)
some lightweight models have problems with inaccurate bounding box are important tools for illegal land use. Therefore, identifying
locating, low recognition rate, and unreasonable selection of positive engineering vehicles from video is a useful way to monitor
training samples for the small object. To solve the problems, the paper illegal land use. The traditional method identifies engineering
proposes an improved lightweight Yolo-Fastest V2 for engineering vehicles manually through surveillance video by staff [5], a
vehicle recognition fusing location enhancement and adaptive label time-consuming and labor-intensive process. With the rapid
assignment. The location-enhanced Feature Pyramid Network (FPN) advancement of deep learning, automatic image recognition
structure combines deep and shallow feature maps to accurately based on convolutional neural networks (CNN) has become
localize bounding boxes. The grouping k-means clustering strategy widespread in traffic management [6]-[9], crime prevention
and adaptive label assignment algorithm select an appropriate anchor
for each object based on its shape and Intersection over Union (IoU).
[10-12], anomaly detection [13]-[16], fire warning [17], human
The study was conducted on Raspberry Pi 4B 2018 using two datasets detection [18]-[22], and agricultural management [23], among
and different models. Experiments show that our method achieves the other fields. This also enables the identification of engineering
optimal combination in speed and accuracy. Specifically, the mAP50 vehicles from video in real-time. However, the real-time video
is increased by 7.02 % with the speed of 11.24FPS under the captured by the online camera must be uploaded to a cloud or
engineering vehicle data obtained by video surveillance in a rural area local server on which the deep learning recognition model has
of China. been deployed [24]-[25]. This process is affected by network
bandwidth, preventing data transmission in real-time. Not only
are a significant number of operations concentrated on the
Index Terms—lightweight model, feature fusion, k-means server, but the server’s performance requirements are also
clustering, adaptive label assignment.
relatively high. Consequently, in the process of video
recognition, deploying computing power at the front end is a
I. INTRODUCTION crucial way to alleviate the pressure of data transmission and

W ith the development of economy, urban population,


and transportation level, Chinese urbanization has
entered a stage of rapid expansion. This will lead to
dramatic changes in Chinese urban land use and
increasingly prominent contradiction between people and land.
central server computing. Some researchers have proposed
smart devices that advance computing tasks to the network edge
with the advent of edge computing [26]-[27]. The object
recognition model is directly deployed in the edge device, and
the edge device’s computing resources are used to perform a
In recent years, some illegal land use behaviors have occurred portion of the computing tasks required for object recognition

This research was support by the project of major scientific and technological Dayu Cheng is with the School of Mining and Geomatics Engineering,
achievements transformation in Hebei Province of China: 22287401Z, NSFC: Hebei University of Engineering, Handan 056038, China (e-mail:
41971352, Alibaba Cloud Computing Co. Ltd. funding: 17289315, and [email protected]).
Guangxi JMRH Funding: 202203. (Corresponding author: Dayu Cheng). Xiaoliang Meng is with the School of Remote Sensing and Information
Hairong Zhang is with the School of Environment and Spatial Informatics, Engineering, Wuhan University, Wuhan 430079, China, and also with the Key
China University of Mining and Technology, Xuzhou 221116, China, and also Laboratory of Natural Resources Monitoring in Tropical and Subtropical Area
with the Key Laboratory of Natural Resources Monitoring in Tropical and of South China, Ministry of Natural Resources, Guangzhou 510500, China (e-
Subtropical Area of South China, Ministry of Natural Resources, Guangzhou mail: [email protected]).
510500, China (e-mail: [email protected]). Wei Liu is with the School of Geography, Geomatics and Planning, Jiangsu
Dongsheng Xu and Geng Xu are with the School of Environment and Spatial Normal University, Xuzhou 221116, China (e-mail: [email protected]).
Informatics, China University of Mining and Technology, Xuzhou 221116, Teng Wang is with the Surveying and Mapping Institute, Lands and
China (e-mail: [email protected]; [email protected]). Resource Department of Guangdong Province, Guangzhou 510500, China (e-
mail: [email protected]).

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

2
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

[28]-[29], thereby enhancing real-time recognition and of YOLO V5 with ShuffleNet V2, while simultaneously
relieving the burden on the central server. However, because lightening the FPN structure; the parameter size of the model is
current mainstream deep learning object recognition models just 237.55KB. On the Mate 30 Kirin 990 CPU, the image
require large floating-point operations (FLOPs) [30]-[32], real- recognition speed of 352×352 can reach more than 300 FPS.
time recognition cannot be achieved in edge devices with The real-time detection effect can also be achieved on
limited computing and storage resources. Therefore, embedded devices with limited computing resources. However,
developing a lightweight model for object recognition in land Yolo-Fastest V2 lacks rich feature fusion, which prevents it
and resource management has become a popular research from making full use of the rich location information in shallow
direction [33]-[35]. feature maps and from selecting suitable positive training
Currently, some researchers are investigating lightweight samples effectively. This results in insufficient extraction of
object recognition models. Nikouei et al. [36] developed a location and semantic features of small objects, which leads to
lightweight human detection model L-CNN using depth-wise inaccurate positioning of the bounding boxes (bboxes) of small
separable convolutions and only 23 layers. The edge device objects and a low recognition rate.
Raspberry PI 3 Model B has an average recognition speed of Yolo-FastestV2 is a widely used lightweight object
1.79FPS for 224×224 video images. Even though this model recognition model. This paper proposes a method of fusing
has a small number of parameters, the computation required is location-enhanced FPN and adaptive label assignment to
still substantial, and real-time detection cannot be achieved on address the issues encountered by Yolo-FastestV2. This method
devices with limited computing resources. Steinmann et al. [37] improves the FPN structure by fusing feature maps of different
replaced the SSD detector's backbone network with a scales to ensure that the object’s location and feature
lightweight backbone network PELEE, and combined the filter information are fully extracted. A grouping k-means clustering
pruning strategy to reduce the number of parameters. On Jetson strategy and an adaptive label assignment algorithm are
AGX, the recognition speed is 24 times faster than the original proposed to assign appropriate anchor boxes to each object to
model, but the recognition accuracy suffers. Based on YOLO- combat the issues of low recognition rate and unreasonable
Lite as the backbone network, Zhao et al. [38] added a residual selection of positive training samples. The algorithm assigns
block, balanced the high- and low-resolution sub-networks, and each object in the image an anchor with a similar shape and the
proposed the Mixed YOLO V3-Lite lightweight model. Using largest IoU. In other words, a suitable positive training sample
Jetson AGX Xavier equipment, the speed of recognizing is selected to facilitate the regression of the bbox.
224×224 video images can reach 43FPS. Still, the number of In summary, the following are the article’s contributions:
parameters is substantial, reaching 20.5MB. Balamuralidhar et 1) Feature fusion is added based on the original model FPN.
al. [28] introduced the MultEYE network in the UAV traffic Integrate the shallow feature map rich in location
detection system Multeye. The network implements pruning to information with the deep feature map channel rich in
reduce the number of CSP Bottleneck and convolution blocks semantic information to enhance the localization ability
and adds a Space-to-depth layer to decrease the space size and of the model;
increase the number of channels. The speed of recognizing 2) Introducing a threshold α and proposing a grouping k-
512×320 video images on NVIDIA Xavier NX can reach means clustering strategy. According to the adjustment
29FPS. The Head, Neck, and Backbone of the lightweight of the α value, multi-scale anchors are generated for the
model of RangiLyu [39] proposed an anchor-free Nanodet small object. So, the small-scale anchor box can be set
model, which can recognize data with a size of 320×320 on a reasonably to make the small bbox easier to learn;
mobile ARM CPU at a speed of 97FPS. However, the anchor- 3) An adaptive label assignment algorithm based on shape
free model has significant issues with positive and negative and IoU is proposed. Based on shape label assignment,
sample imbalance and semantic ambiguity, leading to the an IoU label assignment algorithm is introduced. The
phenomenon of missing objects. Amudhan et al. [40] proposed shape threshold and IoU threshold are dynamically
a lightweight model PmA that can detect small objects in calculated by calculating the mean and variance of IoU
remote sensing images effectively and quickly by extracting and the mean and variance of the ratio of the shape.
more feature information from shallow features and transferring Under the dual constraints of shape and IoU, an
shallow features to deep features for detection. Compared to appropriate anchor box is assigned to each object.
YOLO V3-tiny and YOLO V4-tiny, Jetson Nano’s detection
speed is 32% faster. Liu et al. [41] addressed the issues of II. PROPOSED METHOD
limited UAV computing resources and too small objects by As shown in Fig. 1, the model consists of three components:
enhancing the network structure and target prediction frame
the backbone network ShuffleNet V2, the FPN, and the
selection algorithm of YOLO V4. This enhanced the
prediction head. This paper improves the Yolo-FastestV2
recognition accuracy. Achieving 23.8FPS on Nvidia Jetson
model in three aspects: FPN structure, anchor box settings, and
TX2 and 9.6FPS on Raspberry Pi 4B. Dog-qiuqiu [42] proposed
the Yolo-FastestV2 model for detection in real-time on mobile label assignment.
devices. The model is intended to replace the backbone network

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

3
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

4*anchors

×3
pool
×3 conv
×7
×3
72 object
+ DWSConv
×3 conv
concat
192 192 72
96 96 shufflev2block 3 conv classes
48 48 shufflev2block 2
shufflev2block 1 72 conv
24 DWSConv
pool 4*anchors
×3
24
conv
72 conv
upsample
72 object
DWSConv
conv
+ + ×3
concat add
pool 72
conv
classes
conv
72
DWSConv
Location Enhanced Light-FPN Adaptive Label Assignment
Fig. 1. Overall framework of the improved Yolo-Fastest V2

In recent years, ShuffleNet V2 has been an excellent For medium and large-scale objects, the feature map
network for lightweight feature extraction. It consists of obtained from the first ShuffleV2Block convolution block is
three stacked Shufflev2Block convolution blocks. The entire down-sampled by a factor of 4 to obtain C1. Second, obtain
network ensures optimal performance for feature extraction C11 by performing concatenation feature fusion on C1 and
by minimizing memory access. The location-enhanced FPN the feature map obtained by the third ShuffleV2Block
structure (red dotted-line) will integrate the location convolution block. C11 is subjected to 1×1 convolution to
information of the shallow features into the deep features to predict objects of medium and large scale.
improve the model's ability to locate the bbox of a small For small-scale objects, the result after C11 convolution is
object. In the label assignment algorithm of the prediction firstly up-sampled by two times to obtain C13. Second, C2 is
header (blue dotted-line), the grouping k-means clustering obtained by down-sampling by a factor of 2 the feature map
strategy generates reasonable-anchors for small objects, obtained by the first ShuffleV2Block convolution block. The
while the adaptive label assignment algorithm matches feature map obtained by C2 and the second ShuffleV2Block
appropriate anchors for each object. Details are as follows: convolution block is fused by concatenation to obtain C21.
C21 incorporates deep semantic information. C22 is obtained
A. Location-enhanced FPN after 1×1 convolution. After each concatenation feature
In Fig. 2(a), the original FPN structure predicts small fusion operation, 1 × 1 convolution is used because during
objects using the feature map obtained by convolution of the the whole model training process, the convolution operation
third ShuffleV2Block convolution block in the ShuffleNet can extract the effective feature information of the previous
V2 network structure, in conjunction with 1×1 convolution. feature map and weaken the noise information. Finally, C22
Then, the feature map obtained by the 1×1 convolution is up- and C13 perform addition feature fusion to predict small
sampled. The feature map obtained by the second objects. Through the method of addition feature fusion,
ShuffleV2Block convolution block is subjected to shallow and deep features are fully fused, resulting in a fused
concatenation feature fusion. Finally, the 1×1 convolution is feature map with rich object location information that
performed to predict the object whose scale is greater than improves the original model’s ability to locate bboxes. This
that of the small object. Thus, FPN utilizes only two layers structure enables the prediction of large and small objects to
of shallow feature maps, lacks rich object location obtain the fusion of shallow features with rich location
information, and cannot fully extract semantic information. information, thereby improving the locating accuracy.
This will result in inaccurate center coordinate positioning
B. Grouping K-means Clustering Strategy
when the model predicts a bbox of small object.
This paper proposes a location-enhanced FPN structure For anchors-based object recognition, appropriate anchors
(Fig. 3(b)), which primarily contributes to feature fusion by must be set before label assignment. However, the object’s
introducing 1/8 times the shallow feature map obtained by size in the image is either large or small. If we directly apply
the first ShuffleV2Block convolution block into the deep k-means clustering to the bboxes of the training set, it is
feature map. simple to cluster the bboxes of some very small objects into
a wide range of anchors (Fig. 3(a)). This is due to the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

4
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

influence of threshold k in the k-means method, which


results in assigning such anchors to small objects (Fig. 3(b)).
Ultimately, the anchors cause harmful gradient values to be
passed back.
This paper proposes a grouping k-means clustering
strategy in response to the issue. As shown in Fig.4, this
strategy includes a threshold α. Then, compare the width and
height of the object’s ground-truth (GT) with the image’s
width and height. The GT whose minimum value of this ratio
is less than the threshold α are clustered into n(n≥1) classes;
GT boxes larger than α are clustered into m(m≥1) classes.
𝑛, min(h/H, w/W) < α
Number of anchor box types = { Ground Truth
𝑚, min(h/H, w/W) ≥ α
The threshold α is used as the demarcation point Anchor
between the small object and the medium object. It can (a)
adjust the threshold α by analyzing the relative width
and height distribution of the dataset object, so that the
small objects form a single set. The k-means clustering
is performed on this set to obtain three anchors of
different scales, allowing each small object to be assigned
a more reasonable label. For the setting of m and n values,
models from the Yolo series [26] usually cluster into three
types of anchors for various object levels. Therefore, this
paper continues the Yolo method and sets n and m to 3,
respectively.
1/32 1x1 Conv
ShuffleV2Block 3
BN, ReLu
Ground Truth

1/16 Upsample
Anchor
(b)
ShuffleV2Block 2 Concat
Fig. 3. (a) Small objects are aggregated as a whole into a
1x1 Conv
class of anchors; (b) Smaller objects do not match the
1/8 BN, ReLu anchors obtained by clustering

ShuffleV2Block 1
image

GT
(a)
H h

w
1x1 Conv C13
Upsample Add
BN, ReLu
C11 C22

Concat
1/32
ShuffleV2Block 3
1x1 Conv W
BN, ReLu
1/16 C21 Fig. 4. Ratio of width to height
C1 ShuffleV2Block 2 Concat
1/8 C2
C. Adaptive Label Assignment
Pool ShuffleV2Block 1 Pool
After acquiring appropriate anchors, combine shape label
assignment (Fig. 5). The object is matched with the
(b) appropriate anchor, which is then assigned to cls and reg
Fig. 2. (a) FPN; (b) Location-enhanced FPN labels. However, the original method of shape label
assignment causes two issues: (1) The fixed shape threshold
cannot guarantee that each object matches an anchor. In other
words, the selection of positive training samples is
unreasonable. For example, objects larger than the shape

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

5
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

label assignment threshold at different levels (Fig. 6(a), Fig. ∑𝑚 𝑖


𝑗=1(𝜇𝐼𝑜𝑈−𝐼𝑜𝑈𝑖𝑗 )
2

6(b)); (2) it is possible to match multiple anchors with similar


𝑖
𝜎𝐼𝑜𝑈 =√ 𝑚
(7)
shapes, among which the anchors with low IoU participate 𝑖 𝑖 𝑖
𝛼𝐼𝑜𝑈 = 𝜇𝐼𝑜𝑈 + 𝜎𝐼𝑜𝑈 (8)
in the backpropagation, which increases the complexity of
model training. in (6)-(8), IoUij represents the IoU of the i-th object GT box
This paper proposes an adaptive label assignment strategy and the j-th anchor; μiIoU denotes the mean of all IoUs for the
inspired by the ATSS algorithm [43]. The strategy is divided i-th object GT box; σiIoU represents the standard deviation of
into shape similarity and maximum IoU. In terms of shape, all IoUs for the i-th object GT box; αiIoU denotes the IoU
first, at the same level, calculate the mean and standard threshold for the i-th object GT box.
deviation of the aspect ratio (rw , rH ) of each GT and anchors. In a lightweight network structure, too many layers of
Calculating the sum of the mean and standard deviation pyramid structure will affect the recognition speed, which is
yields the upper boundary constraint value for each object’s not conducive to embedded deployment and operation. The
various anchors. As shown in Fig.7, this constraint
proposed method adapts anchors at different scales for each
guarantees that each object is assigned to at least one
object at the same feature scale level. Therefore, the method
similarly-shaped anchor. The Shape similarity threshold is
calculated as follows: is also applicable to network structures without FPN. The
proposed method is at the multi-scale anchor level based on
𝑖
𝑊𝑔𝑡 𝑖
𝐻𝑔𝑡
the relationship between each object’s GT box and anchors.
𝑗 , 𝑖𝑓 𝑟𝑤 ≥ 1 𝑗 , 𝑖𝑓 𝑟𝐻 ≥ 1 The adaptive label assignment strategy will dynamically
𝑖𝑗 𝑊𝑎𝑛𝑐ℎ𝑜𝑟 𝑖𝑗 𝐻𝑎𝑛𝑐ℎ𝑜𝑟
𝑑𝑊 = 𝑗 , 𝑑𝐻 = 𝑗
adjust the shape similarity threshold and IoU threshold so
𝑊𝑎𝑛𝑐ℎ𝑜𝑟 𝐻𝑎𝑛𝑐ℎ𝑜𝑟
𝑖 , 𝑖𝑓 𝑟𝑤 < 1 𝑖 , 𝑖𝑓 𝑟𝐻 < 1 that the object matches the anchor in terms of shape
𝑊𝑔𝑡 𝐻𝑔𝑡
{ { similarity and overlap. It can mitigate the issue of
(1) unreasonable positive training samples selection to some
𝑖𝑗 𝑖𝑗
𝑟𝑖𝑗 = max (𝑑𝑊 , 𝑑𝐻 ) (2) extent.
∑𝑚
𝑗=1 𝑟𝑖𝑗
𝜇𝑖 = 𝑚
(3)
∑𝑚
𝑗=1(𝜇𝑖−𝑟𝑖𝑗 )
2
𝜎𝑖 = √ (4)
𝑚
α𝑖𝑠ℎ𝑎𝑝𝑒 = 𝜇𝑖 + 𝜎𝑖 (5)

i
in (1)-(5), Hgt represents the height of the i-th object GT box;
j i
Hanchor represents the height of the j-th anchor; Wgt
j
denotes the width of the i-th object GT box; Wanchor
ij
represents the width of the j-th anchor; dW denotes the ratio
ij
of the width of the j-th anchor of the i-th object GT box; dH
represents the ratio of the height of the j-th anchor of the i-th
object GT box; m represents the number of anchors; rij
denotes the maximum value of the ratio of the width and
Fig. 5. Shape Assignment (Yolo-Fastest V2)
height of the j-th anchor of the i-th object GT box; μi
represents r-mean of the i-th object GT box; σi denotes the r-
variance of the i-th object GT box; αishape represents the
shape threshold of the i-th GT box.
In terms of IoU, determine the mean and standard
deviation of the IoU for each object’s GT box and anchors at
the same level. The sum of the mean and standard deviation
is used as the lower boundary constraint value of each
object’s GT box’s IoU. This constraint combines the results
of the shape similarity constraint to determine the
participating positive sample anchors in training. If the
anchors assigned based on shape similarity do not meet the
adaptive IoU threshold, they are included in the label (a)
assignment for the following level. This ensures that each
object has a corresponding anchor. The lower bound
threshold is calculated as follows:
∑𝑚
𝑖=1 𝐼𝑜𝑈𝑖𝑗
𝑖
𝜇𝐼𝑜𝑈 = (6)
𝑚

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

6
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

Raspberry Pi 4B 2018 with an ARM v7l CPU is used as a


limited-resource edge computing device to infer the model
(TABLE II).

(b)
Fig. 6. (a) GT is too small; (b) GT is too large

Fig. 8. Raspberry Pi 4B 2018

TABLE I
Model Training Environment
Software and Hardware Parameter
Operating System Windows10
GPU NVIDIA GeForce RTX
2080 Ti 8GB
cuDNN V7.6.5 for CUDA10.2
Fig. 7. Adaptive shape similarity threshold constraint
IDE Pycharm
Programming Language Python
III. EXPERIMENTAL RESULTS AND ANALYSIS
Frame Pytorch
This paper uses the engineering vehicle image collected
by the monitor as the data source. First, train the model. TABLE II
Then, ablation experiments are conducted to determine the Embedded Device Operating Environment
effect of the location-enhanced FPN, the grouping k-means Software and Hardware Parameter
clustering strategy, and the adaptive label assignment on the
Device Raspberry Pi 4B 2018
model. Finally, it is compared to other mainstream-
Operating System Linux 32bit
lightweight object detection models to validate the method’s
advancement. CPU arm v7l
Programming Language C++
A. Data Introduction and Experimental Environment Frame NCNN
There are 679 images with 1080*1920 pixels in the dataset NCNN: a high-performance neural network inference
of this experiment. Divide the dataset into training set and computing framework produced by Tencent.
testing set by 8:2. There are 543 training data and 136 testing
data. The recognized object types include car, truck, Model Training. Due to privacy restrictions on
agricultural car, and excavator. In training set, the number of surveillance video, only a limited number of images can be
vehicles is 587, 389, 161, and 131, respectively. In testing obtained. Therefore, this paper only divides the train set and
set, the number of vehicles is 166, 110, 36, and 37, the test set, not the val set, to ensure a sufficient number of
respectively. During the training phase, transfer learning and train sets. The improved model is pre-trained on MS COCO
data augmentation (e.g., rotations, gamut transformation, 2017’s train set to obtain pre-trained weights. Other methods
etc.) are used to train the model. employ official COCO pre-trained weights. On the premise
Description of the Experimental Environment. The that all models do not freeze the parameters of any layer, the
model was trained on an NVIDIA GeForce RTX 2080 Ti pre-trained model is fine-tuned using transfer learning on the
GPU with 8GB of memory (TABLE I). As shown in Fig.8, a train set in this paper. The experiment adopts data
augmentation based on image processing, because it can

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

7
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

avoid overfitting of the model on the dataset with a small


amount of data. It has a weak impact on the recognition
accuracy.
Model Parameters. The Mini-batch size is set to 4 for all
other methods, and all other parameters retain their official
configuration. The improved model is trained with an initial
learning rate of 0.001 for 240 epochs. At the 130th, 160th,
178th, and 185th epochs, the learning rate decreases by a
factor of 10. Mini-batch size is set to 4, and the stochastic
gradient descent (SGD) algorithm is implemented. In this
experiment, CrossEntropyLoss and CIoU Loss are used as
loss functions for classification and regression. The before
and after background classification loss ℒobj has a weighting
factor of 34. The type classification ℒcls weighting factor is
64. The bbox regression ℒreg weighting factor is 3.4. In the
inference phase of all methods, the Non-Maximum
Suppression (NMS) threshold is 0.45, the classification (cls)
threshold is 0.5, the object (obj) threshold is 0.5, and the
confidence (conf) scores on the bbox is the cls threshold Fig. 9. The relative width and height distribution of the
multiplied by the obj threshold. The threshold α is objects in the training set
determined primarily by the distribution of the data in two (2) Impact of the proposed method
steps. Firstly, the determination of the threshold α requires Ablation experiments were conducted for Yolo-Fastest V2
an analysis of the relative width and height distribution of the to verify the effect of each proposed method on mAP50.
bounding box for all objects in the dataset. Secondly, α value TABLE III shows that introducing location-enhanced FPN
is determined according to the distribution of the relative
and grouping k-means clustering strategies can increase
width and height of the small object. In Figure 9, the relative
mAP50 by 1.01% and 2.04%, respectively. The experimental
width and height distributions of small objects in the training
results indicate that the feature fusion in the location-
set are within 0.1 at the smallest increment. Therefore, the
threshold α is set to 0.1 in grouping k-means clustering. enhanced FPN improves the model's ability to learn features.
The grouping k-means clustering strategy contributes
B. Ablation Experiment positively to assigning an appropriate anchor box to each
(1) Impact of image size small-scale object. Replacing the original shape matching in
To achieve the goal of real-time detection, this paper Yolo-FastestV2 with the proposed adaptive label assignment
investigates the impact of different input image sizes on the leads to a 1.53% improvement in mAP50. The experimental
recognition speed of the model. The picture size is divided results demonstrate that adaptive label assignment ensures
into 256*256, 320*320, 352*352, 416*416, 480*480, each object is assigned to an anchor with a similar shape and
640*640. For a fair comparison, all models are embedded on high overlap. It ensures the model is trained with a large
Raspberry Pi 4B 2018, and the recognition speed is number of positive examples. The two methods of location-
calculated using NCNN. *Yolo-Fastest V2 is an improved enhanced FPN and grouping k-means clustering strategy
model. As shown in Fig.10, the recognition speed curve of improve the mAP50 by 4.36%. Together, the three proposed
*Yolo-FastestV2 is similar to the original model, methods increase mAP50 by 7.02%, proving that the
demonstrating that the proposed method has no significant combined scheme is effective.
effect on the model’s recognition. Video surveillance To demonstrate the robustness of the proposed method, it
recognition mostly uses edge computing devices with limited is implemented within the yolov5-n model and validated
computing power in real world. Because edge devices with using the Pascal VOC (VOC2007+VOC2012) dataset. The
limited computing power tend to use smaller image inputs experimental results are shown in TABLE V. The FPN
and we hope the mAP50 to be as high as possible, the structure of yolov5-n increases the mAP50 by 0.10% after
experimental standard for image input is set at 352*352 on adding the location enhancement method. The grouping k-
the Raspberry Pi. means clustering strategy improves mAP50 by 0.50%.
Adaptive label assignment increases the mAP50 by 0.90%.
After applying the grouping k-means clustering, the location-
enhanced structure improved mAP50 by 0.80%. Moreover,
the method that combines location-enhanced and adaptive
label assignment increases mAP50 by 1.10%. Consequently,
the experimental results demonstrate the robustness of the
proposed method.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

8
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

C. Compare with State-of-the-Art Models demonstrates that our method fully learns the location
To evaluate the impact of the proposed method on the features and semantic features of small objects under video
model’s ability to extract features, six current state-of-the-art surveillance. In Fig.12(a)-(f), other detectors have the
lightweight detectors are compared with the model that has phenomenon of missing detection. However, Fig.12(g)
been enhanced by the proposed method. Among the six most shows that the number of missing detections is less for the
advanced methods, Yolo-Fastest and Yolo-FastestV2 are improved Yolo-Fastest V2. This demonstrates our method
lightweight from YOLOV3 and YOLOV5, respectively. The can choose better positive training samples in order to
others are official lightweight models. transmit an effective gradient to the model during the
Compared to the original model Yolo-FastestV2, the forward propagation. This can fully learn the effective
proposed method can enhance the model’s ability to extract features of the object. The recognition results of other scenes
feature. Fig.11 demonstrates that when the input image size are shown in Fig.13 for the improved Yolo-Fastest V2.
exceeds 352, the mAP50 curve of Yolo-FastestV2 tends to
be flat. At this point, the capability of feature extraction
reaches a bottleneck. Compared to the improved model of
the proposed method, the model’s ability to extract features
is enhanced. After the improved model, the mAP50 curve
still maintains an increasing trend when the input image size
is greater than 352. When the input image size is less than or
equal to 352, the improved model has a higher mAP50 than
the model with the same parameter level. When the input
image size exceeds 352, the improved model exhibits a
slightly lower mAP50 growth trend than NanoDet-m,
YOLOV3-Tiny, and Yolo-Fastet. Due to model size, it
results in insufficient extraction of the object’s feature
information.
TABLE VI compares the models quantitatively in terms
of FLOPs, parameter quantities, recognition speed, and
mAP50. The proposed method increases the number of
parameters by 105.4K, but does not increase FLOPs.
Specifically, compared to YOLOV3-Tiny, YOLOX-Nano,
NanoDet-m, Yolo-Fastest and Yolo-FastestV2, the proposed Fig. 10. Recognition Speed Comparisons of different
method increases mAP50 by 9.42%, 9.55%, 3.09%, 11.42%, sizes on validation dataset
and 7.02% for Yolo-FastestV2. It is 7.35FPS, 8.33FPS,
7.94FPS, 4.92FPS, and 1.28FPS faster in terms of
recognition speed. The improved model is 17.38% lower in
mAP50 than YOLOV5-n, but 8.28FPS faster. Therefore, in
a trade-off between speed and accuracy, the performance of
the enhanced model is optimal. In TABLE IV, the
experiment uses UA-DETRAC vehicle benchmark dataset,
including car, bus, van, others. The number of cars, bus, van,
others in the training set is 503853, 57051, 33651 and 3726
respectively. The number of cars, bus, van, others in the test
set is 548555, 38519, 71785 and 16915 respectively. The
proposed method increases car, bus, van and others by
0.40%mAP50, 0.10%mAP50, 1.10%mAP50 and
6.25%mAP50, respectively. Up 1.97%mAP50 overall. The
experimental results show that this method can reasonably
select positive training samples and fully learn the vehicle
feature under video surveillance equipment.
To visually demonstrate the effect of vehicle detection in
Raspberry Pi, the experimental comparison results for each
model are presented in Fig.12. The comparison of Fig.12(f) Fig. 11. mAP50 Comparisons of different sizes on
and Fig.12(g) shows that the proposed method identifies validation dataset
fewer errors and makes bbox positioning more accurate. This

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

9
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

TABLE III
Validation of the proposed method on Yolo-FastestV2 (monitoring dataset)
Method Size FLOPs mAP50
Yolo-Fastest V2 352*352 0.11G 42.90
Yolo-Fastest V2+LE-FPN 352*352 0.11G 43.91(+1.01)
Yolo-Fastest V2+GKC 352*352 0.11G 44.94(+2.04)
Yolo-Fastest V2+ALA 352*352 0.11G 44.43(+1.53)
Yolo-Fastest V2 +LE-FPN+GKC 352*352 0.11G 47.26(+4.36)
*Yolo-Fastest V2 352*352 0.11G 49.92(+7.02)
GKC: Grouping K-means Clustering; LE-FPN: Location-enhanced FPN;
ALA: Adaptive Label Assignment; *Yolo-Fastest V2: the improved Yolo-Fastest V2

TABLE IV
Validation of the proposed method on Yolo-Fastest V2 (UA-DETRAC)
Method Size All car bus van others

Yolo-Fastest V2 352*352 40.00 60.20 67.50 26.20 6.15


*Yolo-Fastest V2 352*352 41.97(+1.97) 60.60(+0.40) 67.60(+0.10) 27.30(+1.10) 12.40(+6.25)

TABLE V
Validation of the proposed method on yolov5-n (Pascal VOC dataset)
Method Size FLOPs mAP50
yolov5-n 352*352 0.63G 74.00
yolov5-n+LE-FPN 352*352 0.63G 74.10(+0.10)
yolov5-n+GKC 352*352 0.63G 74.50(+0.50)
yolov5-n+ALA 352*352 0.63G 74.90(+0.90)
yolov5-n+LE-FPN+GKC 352*352 0.63G 74.80(+0.80)
*yolov5-n 352*352 0.63G 75.10(+1.10)
LE-FPN: Location-enhanced FPN, *yolov5-n: the improved yolov5-n

TABLE VI
Model Comparison
Model Size FLOPs Parameters Speed (NCNN) mAP50
YOLOV3-Tiny 352*352 1.97G 8.68M 324.08ms/3.89FPS 40.50
YOLOX-Nano 352*352 2.08G 2.24M 343.92ms/2.91FPS 40.37
YOLOV5-n 352*352 4.20G 1.27M 337.77ms/2.96FPS 67.30
NanoDet-m 352*352 0.41G 937.23K 303.40ms/3.30FPS 46.83
Yolo-Fastest 352*352 0.16G 291.42K 158.22ms/6.32FPS 38.50
Yolo-Fastestv2 352*352 0.11G 237.55K 100.41ms/9.96FPS 42.90
*Yolo-Fastestv2 352*352 0.11G 342.95K 88.95ms/11.24FPS 49.92

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

10
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

(a) YOLOV3-tiny (b) Yolo-Fastest

(c) YOLOV5-n (d) YOLOX-Nano

(e) Nanodet-m (f)Yolo-FatsestV2

(g) Improved Yolo-Fastest V2


Fig. 12. Comparison of the NCNN inference computing results on Raspberry Pi 4B

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

11
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

Fig. 13. Detections for different scenes on Raspberry Pi 4B

of robustness, it is applied to yolov5-n. It provides a 1.10%


IV. CONCLUSION mAP50 increase. It also demonstrates that the proposed
To overcome the phenomenon of low recognition rate and method is applicable to both lightweight and large-scale
inaccurate localization of bbox when identifying small-scale models.
engineering vehicles in surveillance videos using lightweight The proposed method still has some shortcomings. For
object detection models in embedded devices. This paper instance, if the application scenario is modified, the model
proposes a method that combines location enhancement and must perform k-means clustering again. Therefore, follow-
adaptive label assignment. This method prevents a decrease up research will be improved in the direction of anchor-free.
in model recognition speed and improves model recognition
performance. It is crucial for the monitor to identify V.ACKNOWLEDGEMENTS
engineering vehicles that illegally transfer land, as this can This research was support by the project of major scientific
reduce government departments’ human and financial and technological achievements transformation in Hebei
resources. The methods include location-enhanced FPN Province of China: 22287401Z, NSFC: 41971352, Alibaba
structure, grouping k-means clustering strategy, and adaptive Cloud Computing Co. Ltd. funding: 17289315, and Guangxi
label assignment. The location-enhanced FPN structure JMRH Funding: 202203. The authors would like to thank
utilizes concatenate and addition feature fusion to fuse the Pascal VOC (http://host.robots.ox.ac.uk/pascal/VOC/) and
shallow and deep layer feature maps together. This structure UA-DETRAC (https://detrac-db.rit.albany.edu) for making
enables the deep feature maps to contain rich location the object recognition datasets publicly available. The
information and improves the model's capacity to locate the authors would also like to thank the anonymous reviewers
bboxes of small objects. Second, the grouping k-means for their comments to improve this paper.
clustering strategy is capable of obtaining multiscale anchors
for small objects. Finally, the proposed adaptive label REFERENCES
assignment algorithm selects an appropriate anchor for each [1] X. Cui, S. Li, X Wang., X. Xue. “Driving factors of urban land growth
object to be identified and assigns the anchor to a label. in Guangzhou and its implications for sustainable development.”
Tested in the monitoring dataset of engineering vehicles, Frontiers of Earth Science, vol. 13, no 3, pp. 464-477, Apr. 2019.
[2] D. Zhang, X. Liu, X. Wu, Y. Yao, X. Wu, Y. Chen. “Multiple intra-
this method improves the recognition accuracy of the urban land use simulations and driving factors analysis: a case study in
original model from 42.90% mAP50 to 49.92% mAP50. The Huicheng, China,” GIScience & Remote Sensing, vol. 56(2), pp. 282-
number of model parameters is increased by 105.5K, but 308, 2018.
there is no significant increase in FLOPs. The recognition [3] H. Wu, A. Lin, X. Xing, D. Song, Y. Li. “Identifying core driving
factors of urban land use change from global land cover products and
speed in the Raspberry Pi can still reach 11.24FPS. To POI data using the random forest method,” International Journal of
demonstrate that the proposed method has a certain degree Applied Earth Observation and Geoinformation, vol. 103, 2021.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

12
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

[4] Zhang, X., Song, J., Wang, Y., Deng, W., & Liu, Y. “Effects of land use Learning and Machine Learning,” Remote Sensing, vol.14, no.12, pp.
on slope runoff and soil loss in the Loess Plateau of China: A meta- 2837, Jun. 2022.
analysis,” Science of The Total Environment, vol. 755, pp. 142418, Feb. [24]J. Lee, J. Wang, D. Crandall, S. Šabanović, and G. Fox. “Real-time,
2021. cloud-based object detection for unmanned aerial vehicles,” In 2017
[5] Zhang, H., Wang, Z., Yang, B., Chai, J., & Wei, C. “Spatial–temporal First IEEE International Conference on Robotic Computing (IRC)., pp.
characteristics of illegal land use and its driving factors in China from 36-43, Apr. 2017.
2004 to 2017,” International Journal of Environmental Research and [25]A. Anjum, T. Abdullah, M. F. Tariq, Y. Baltaci, and N. Antonopoulos.
Public Health, vol. 18, no. 3, pp. 1336, Feb. 2021. “Video stream analysis in clouds: An object detection and classification
[6] Xiang, Xuezhi, Ning Lv, Xinli Guo, Shuai Wang, and Abdulmotaleb El framework for high performance video analytics,” IEEE Transactions
Saddik. "Engineering Vehicles Detection Based on Modified Faster R- on Cloud Computing., vol. 7, no. 4, pp. 1152-1167, Jan. 2016.
CNN for Power Grid Surveillance," Sensors, vol. 18, no. 7, pp. 2258, [26]W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. “Edge computing: Vision
Jul. 2018. and challenges,” IEEE internet of things journal., vol. 3, no. 5, pp. 637-
[7] Y. Yang, H. Song, S. Sun, W. Zhang, Y. Chen, L. Rakal, and Y. Fang. 646, Jun. 2016.
“A fast and effective video vehicle detection method leveraging feature [27]Y. Siriwardhana, P. Porambage, M. Liyanage, and M. Ylianttila. “A
fusion and proposal temporal link,” Journal of Real-Time Image survey on mobile augmented reality with 5G mobile edge computing:
Processing., vol. 18, no. 4, pp. 1261-1274, May. 2021. architectures, applications, and technical aspects,” IEEE
[8] S. Sri Jamiya, and P. Esther Rani. "An efficient method for moving Communications Surveys & Tutorials., vol. 23, no. 2, pp. 1160-1192,
vehicle detection in real-time video surveillance." Advances in Smart Feb. 2021.
System Technologies., Springer, Singapore, pp. 577-585, 2021. [28]N. Balamuralidhar, S. Tilon, and F. Nex. “MultEYE: Monitoring system
[9] A. Fedorov, K. Nikolskaia, S. Ivanov, V. Shepelev, and A. Minbaleev. for real-time vehicle detection, tracking and speed estimation from
“Traffic flow estimation with data from a video surveillance camera,” UAV imagery on edge-computing platforms,” Remote sensing., vol. 13,
Journal of Big Data., vol. 6, no. 1, pp. 1-15, Aug. 2019. no. 4, 573, Feb. 2021.
[10]A. B. Mabrouk, and E. Zagrouba. Abnormal behavior recognition for [29]P. Gupta, B. Pareek, G. Singal, and D. V. Rao. “Edge device based
intelligent video surveillance systems: A review. Expert Systems with military vehicle detection and classification from uav,” Multimedia
Applications., vol. 91, pp. 480-491, Jan. 2018. Tools and Applications., vol. 81, no. 14, pp. 19813-19834, Jul. 2022.
[11]J. Lim, M. I. Al Jobayer, V. M. Baskaran, J. M. Lim, J. See, and K. [30]K. He, G. Gkioxari, P. Dollár, and R. Girshick. “Mask r-cnn,” In
Wong. “Deep multi-level feature pyramids: Application for non- Proceedings of the IEEE international conference on computer vision.,
canonical firearm detection in video surveillance,” Engineering pp. 2961-2969, 2017.
applications of artificial intelligence., vol. 97, 104094, Jan. 2021. [31]ultralytics/yolov5: YOLOv5 in PyTorch > ONNX > CoreML > TFLite
[12]F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, and F. (github.com).
Herrera. “Object detection binary classifiers methodology based on [32]Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun. “Yolox: Exceeding yolo series
deep learning to identify small objects handled similarly: Application in in 2021,” 2021, arXiv preprint arXiv:2107.08430.
video surveillance,” Knowledge-Based Systems., vol. 194, 105590, Apr. [33]Y. Zhao, Y. Yin, and G. Gui. “Lightweight deep learning based
2020. intelligent edge surveillance techniques,” IEEE Transactions on
[13]J. Sun, J. Shao, and C. He. “Abnormal event detection for video Cognitive Communications and Networking., vol. 6, no. 4, pp. 1146-
surveillance using deep one-class learning,” Multimedia Tools and 1154, Jun. 2020.
Applications., vol. 78,no. 3, pp. 3633-3647, Sep. 2019. [34]Z. H. O. U. Long, W. Suyuan, C. U. I. Zhongma, F. A. N. G. Jiaqi, Y.
[14]J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh. A. N. G. Xiaoting, and D. I. N. G. Wei. “Lira-YOLO: a lightweight
“Anomalynet: An anomaly detection network for video surveillance,” model for ship detection in radar images,” Journal of Systems
IEEE Transactions on Information Forensics and Security., vol. 14, no. Engineering and Electronics, vol. 31, no. 5, pp. 950-956, October. 2020.
10, pp. 2537-2550, Feb. 2019. [35]A. Ullah, K. Muhammad, W. Ding, V. Palade, I. U. Haq, and S. W. Baik.
[15]J. T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, and Y. Xiao. “Attention- “Efficient activity recognition using lightweight CNN and DS-GRU
driven loss for anomaly detection in video surveillance,” IEEE network for surveillance applications,” Applied Soft Computing, vol.
transactions on circuits and systems for video technology., vol. 30, no. 103, 107102, May. 2021.
12, pp. 4639-4647, Dec. 2019. [36]S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B. Y. Choi, and T. Faughnan.
[16]R. Nawaratne, D. Alahakoon, D. De Silva, and X. Yu. “Spatiotemporal “Smart surveillance as an edge network service: From harr-cascade,
anomaly detection using deep learning for real-time video surveillance,” svm to a lightweight cnn,” In 2018 ieee 4th international conference on
IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 393- collaboration and internet computing (cic)., pp. 256-265, Oct. 2018.
402, Aug. 2019. [37]L. Steinmann, L. Sommer, A. Schumann, and J. Beyerer. “Fast and
[17]K. Muhammad, J. Ahmad, Z. Lv, P. Bellavista, P. Yang, and S. W. Baik. lightweight person detector for unmanned aerial vehicles,” EUSIPCO.
“Efficient deep CNN-based fire detection and localization in video 2019.
surveillance applications,” IEEE Transactions on Systems., Man, and [38]H. Zhao, Y. Zhou, L. Zhang, Y. Peng, X. Hu, H. Peng, and X. Cai.
Cybernetics: Systems, vol. 49, no. 7, pp. 1419-1434, Jun. 2018. “Mixed YOLOv3-LITE: a lightweight real-time object detection
[18]H. Wei, M. Laszewski, and N. Kehtarnavaz. “Deep learning-based method,” Sensors, vol. 20, no. 7, 1861, Mar. 2020.
person detection and classification for far field video surveillance,” In [39]RangiLyu/nanodet: NanoDet-Plus Super fast and lightweight anchor-
2018 IEEE 13th Dallas Circuits and Systems Conference (DCAS), pp. free object detection model. Only 980 KB(int8) / 1.8MB (fp16) and run
1-4, Nov. 2018. 97FPS on cellphone (github.com).
[19]I. M. Nasir, M. Raza, J. H. Shah, S. H. Wang, U. Tariq, and M. A. Khan. [40]A. N. Amudhan, and A. P. Sudheer. “Lightweight and computationally
“HAREDNet: A deep learning based architecture for autonomous video faster Hypermetropic Convolutional Neural Network for small size
surveillance by recognizing human actions,” Computers and Electrical object detection,” Image and Vision Computing, vol. 119, 104396, Mar.
Engineering., vol. 99, 107805, Apr. 2022. 2022.
[20]M. A. Khan, K. Javed, S. A. Khan, T. Saba, U. Habib, J. A. Khan, and [41]J. Liu, C. Hu, J. Zhou, and W. Ding. “Object Detection Algorithm Based
A. A. Abbasi. “Human action recognition using fusion of multiview and on Lightweight YOLOv4 for UAV,” In 2022 7th International
deep features: an application to video surveillance,” Multimedia tools Conference on Intelligent Computing and Signal Processing (ICSP). pp.
and applications., pp. 1-27, Mar. 2020. 425-429, Apr. 2022.
[21]G. Sreenu, and S. Durai. “Intelligent video surveillance: a review [42]dog-qiuqiu/Yolo-FastestV2: Based on Yolo's low-power, ultra-
through deep learning techniques for crowd analysis,” Journal of Big lightweight universal target detection algorithm, the parameter is only
Data., vol. 6, no. 1, pp. 1-27, Jun. 2019. 250k, and the speed of the smart phone mobile terminal can reach
[22]M. Shorfuzzaman, M. S. Hossain, and M. F. Alhamid. “Towards the ~300fps+ (github.com).
sustainable development of smart cities through mass video surveillance: [43]S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li. “Bridging the gap
A response to the COVID-19 pandemic,” Sustainable cities and society., between anchor-based and anchor-free detection via adaptive training
vol. 64, 102582, Jan. 2021. sample selection,” In Proceedings of the IEEE/CVF conference on
[23]H. H. Tseng, M. D. Yang, R. Saminathan, Y. C. Hsu, C. Y. Yang, and computer vision and pattern recognition. pp. 9759-9768, 2020.
D. H. Wu. “Rice Seedling Detection in UAV Images Using Transfer

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. This is the author's version which has not been fully e
content may change prior to final publication. Citation information: DOI 10.1109/JSTARS.2023.3249216

13
> IEEE JOURANL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING <

Hairong Zhang received Ph.D. degree Geng Xu received the B.S. degree in
in Geodesy and Survey Engineering geophysics from the geophysics from
from China University of Mining and the School of Resources and
Technology, Xuzhou, China, in 2002. Geosciences, China University of
He is currently an Associate Professor Mining and Technology, Xuzhou,
with the School of Environment and China, in 2020, where he is currently
Spatial Informatics, China University of working toward the M.S. degree. His
Mining and Technology, Xuzhou, research interests include mine water
China. He is also a guest professor in the Key Laboratory of inrush intelligent recognition with multi-field coupling based
Natural Resources Monitoring in Tropical and Subtropical on deep learning
Area of South China, Ministry of Natural Resources. His
research interests include GIS, deep learning and
geographical modeling. Wei Liu received the M.S. degree in
cartography and geographic
information engineering from the China
Dongsheng Xu received the B.S. degree University of Mining and Technology,
in human geography and urban & rural Xuzhou, China, in 2007, and the Ph.D.
planning from the School of Surveying degree in cartography and geographic
and Land Information Engineering, information engineering from the China
Henan Polytechnic University, Jiaozuo, University of Mining and Technology,
China, in 2020. He is currently working Xuzhou, China, in 2010. He is currently an Associate
toward the M.S. degree in cartography Professor with the School of Geography, Geomatics and
and geographical information Planning, Jiangsu Normal University, Xuzhou, China. His
engineering with China University of Mining and research interests include spatial data quality checking, high-
Technology, Xuzhou, China. His research interests include resolution remote sensing image processing, and GIS
deep learning and model compression. development and applications.

Dayu Cheng received the Ph.D. degree Teng Wang received the Master Degree
in cartography and geographic in Telecommunications Engineering
information engineering from China and Telecommunication Networks from
University of Mining and Technology, University of Technology Sydney in
Xuzhou, China, in 2012. He is currently 2014. He is currently a Senior Engineer
an Associate Professor with School of with Surveying and Mapping Institute
Mining and Geomatics Engineering, Lands and Resource Department of
Hebei University of Engineering, Guangdong Province, Guangzhou. His
Handan, China. His research interests include geographic big research interests include GIS and computing vision.
data mining, computing vision, and GIS development and
applications.

Xiaoliang Meng received the Ph.D.


degree in photogrammetry and remote
sensing from Wuhan University,
Wuhan, China, in 2009. He is currently
a 'Luo Jia distinguished Professor' in
Wuhan University. He is also a guest
professor in the Key Laboratory of
Natural Resources Monitoring in
Tropical and Subtropical Area of South China, Ministry of
Natural Resources. His research interests include intelligent
sensor network and geospatial information service in the
field of ecology and environment. He won the 'Best Young
Authors Award' of the International Society for
Photogrammetry and Remote Sensing (ISPRS) in 2012.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like