Sensors 24 03721 v2
Sensors 24 03721 v2
Article
Toward Fully Automated Inspection of Critical Assets Supported
by Autonomous Mobile Robots, Vision Sensors, and
Artificial Intelligence
Javier Sanchez-Cubillo 1 , Javier Del Ser 2,3,∗ and José Luis Martin 2,3
Abstract: Robotic inspection is advancing in performance capabilities and is now being considered
for industrial applications beyond laboratory experiments. As industries increasingly rely on complex
machinery, pipelines, and structures, the need for precise and reliable inspection methods becomes
paramount to ensure operational integrity and mitigate risks. AI-assisted autonomous mobile robots
offer the potential to automate inspection processes, reduce human error, and provide real-time
insights into asset conditions. A primary concern is the necessity to validate the performance of
these systems under real-world conditions. While laboratory tests and simulations can provide
valuable insights, the true efficacy of AI algorithms and robotic platforms can only be determined
through rigorous field testing and validation. This paper aligns with this need by evaluating the
performance of one-stage models for object detection in tasks that support and enhance the perception
capabilities of autonomous mobile robots. The evaluation addresses both the execution of assigned
tasks and the robot’s own navigation. Our benchmark of classification models for robotic inspection
considers three real-world transportation and logistics use cases, as well as several generations of the
Citation: Sanchez-Cubillo, J.; Del Ser,
well-known YOLO architecture. The performance results from field tests using real robotic devices
J.; Martin, J.L. Toward Fully
equipped with such object detection capabilities are promising, and expose the enormous potential
Automated Inspection of Critical
and actionability of autonomous robotic systems for fully automated inspection and maintenance in
Assets Supported by Autonomous
open-world settings.
Mobile Robots, Vision Sensors, and
Artificial Intelligence. Sensors 2024, 24,
3721. https://doi.org/10.3390/
Keywords: autonomous mobile robot; autonomous guided vehicles; artificial intelligence; object
s24123721 detection
Our experimental results consider the three novel robotic AMR- and AGV-based
applications mentioned above, which perform inspection and loading tasks in complex
environments. Using real data collected from these Use Cases By the robotic platforms, we
conduct a performance benchmark of one-stage ML models for object detection, employing
established evaluation protocols and performance scores. To perform the assigned tasks,
the three robots must possess extraordinary detection and discrimination capabilities
to perceive the environment and manipulate objects, which cannot be achieved with
conventional computer vision and require the inclusion of AI. Based on the results, we
conclude that the synergy between AMR/AGV and object detection is promising, paving
the way for new use cases that leverage the autonomy and intelligence embodied by this
technological crossroads.
Contribution: The contribution to the state of the art made by our research herein lies
in providing an in-depth analysis of the real benefits of artificial intelligence on industrial
robotic platforms. Inspection robots work in complex, unpredictable, and very unstructured
environments, so for the sake of meeting their objectives, our research aims to provide
metrics on the performance achieved with AI inclusion that cannot be achieved otherwise.
We also develop and provide the frameworks that enable the inclusion of these models
effectively in the proposed rail, port, and logistics sectors. These sectors are known for being
very conservative regarding the adoption of new and emerging technologies. However, the
results of our field tests in real environments for each Use Case Demonstrate that adopting
AI for robotic inspection is an actionable and practical approach for fully automating this
task, and it outlines a promising future for applications leveraging AI-assisted robotics.
The rest of the manuscript is structured as follows: we first briefly review the state
of the art in visual perception in AGV/AMR (Section 2), followed by a description of the
robotic design of these platforms within the three use cases under consideration (Section 3).
We then describe the experimental setup in Section 4, followed by a discussion on the
obtained results and the insights drawn from the field tests conducted for the three use
cases (Section 5). Finally, Section 6 concludes the paper with a summary of the main
conclusions drawn from this research, together with a glimpse at future research.
2. Related Work
Before proceeding with the description of the AMR and AGV robotic platforms used
in our research, we first provide some background on these robotic technologies themselves
(Sections 2.1 and 2.2), followed by visual perception in AMR platforms (Section 2.3).
2.2. Autonomous Mobile Robots (AMRs) and Automated Guided Vehicles (AGVs)
Advances in AGV and AMR technology and their industrial applications have recently
experienced an unprecedented boom, driven by the emergence of demand for this type of
system in new scenarios previously limited to human labor. These mobile robots have been
integrated into warehousing and logistics activities, using track-guided magnetic systems,
optical sensors, and color bars as guidance technologies [6]. Zhang et al. [7] thoroughly
reviewed the use of AGVs and AMRs for recognition and tracking in civil engineering,
along with an analysis of the challenges and prospects in detection, construction, and
disease repair. More recently, Patruno et al. [8] proposed an architectural taxonomy of
several AMRs and AGVs is presented. Loganathan et al. [9] provided an exhaustive and
systematic analysis of the strengths and shortcomings of navigation and path planning
techniques for AMRs.
Depending on the environment where the robot will operate, a different motion drive
system must be selected. Robots with drive systems that allow them to move in any
direction, independent of their orientation, are called omnidirectional or holonomic robots.
Sensors 2024, 24, 3721 5 of 28
Omnidirectional robots are considerably easier to control than robots based on simple
wheels because their motion is independent of their pose [3]; they use special types of
wheels such as omnidirectional, mecanum, or fully 360º turnable wheels. However, they
are usually intended for operation on flat floors. The demands of rough terrain, where our
robots will operate in two of the three use cases, preclude the use of these types of wheels
due to their motion characteristics and fragility, which would cause the robots to become
stuck. Siegwart et al. [10] provided an excellent review of these techniques for wheeled
robots in their book Introduction to Autonomous Mobile Robots.
Mobile robots, like the ones equipped with differential motion, can also turn on their
axis, as is the case with the AMR designed for the rail use case (as detailed later). Other
mobile robots with Ackermann [11] or simple wheels require more space for turns, as the
turning maneuver always involves motion on their x-axis. This is, for example, the case
with vehicles that use Ackermann steering. Selecting the right motion strategy is crucial, as
it defines the navigation and path planning constraints that the control of the AMR will
face [12].
In order to obtain a robust and reliable navigation system for mobile robots, it is
essential to use the ROS/ROS2 [13] framework. All three robots presented here operate on
the ROS2 framework running on a high-level processing computer based on the NVIDIA
Jetson Orin architecture. They interact with other low-level industrial programmable
logic controllers (PLCs), responsible for controlling all the input–output interfaces to
the mechatronics and information buses, e.g., the controller area network (CAN), inter-
integrated circuit (I2C), and serial peripheral interface (SPI) of the embedded electronics.
ROS, an open-source robotic operating system, supports most modern robotics re-
search and commercial products. Its success stems from its focused design and enhanced
capabilities tailored for product-grade robotic systems. ROS not only supports the gener-
ation of trajectories (global and local path planners) but also facilitates the integration of
image processing libraries for environmental visual perception. This combination provides
a robust programming framework for determining the optimal actions of the robot at
all times.
more robust sets of sensors with enhanced accuracy for environmental perception. These
include radar, 2D and 3D LiDAR, high-resolution stereoscopic and time-of-flight cameras,
and inertial measurement units (IMUs). Also, new processing algorithms and libraries have
entered the scene. Following the late 1990s, OpenCV (Open Source Computer Vision Library)
became a key player in computer vision. It was initially created as part of an Intel Research
initiative to advance CPU-intensive applications but has become a powerful open-source
computer vision tool, provided and maintained by researchers and programmers for the
community. The library is continuously evolving and now offers today more than 2500
optimized computer vision algorithms [18].
The use of software libraries like OpenCV [18], Sci-kit Image [19], PyTorch [20], and
TensorFlow [21], together with other proprietary software packages from Google, Azure,
Amazon, and Microsoft have enabled and simplified the processing of data from new
sensors. They are not only software pieces used for processing data from pictures, videos,
and vision cameras, they also provide processing algorithms for the new sensors mentioned
before, by delivering tools such as image filtering, camera calibration, structure-from-
stereo/structure-from-motion algorithms, visual odometry, feature detectors for cameras
(Hough, Harris, FAST, SURF, and SIFT), and processing of laser point clouds. Some of
the algorithms used within this research are worth mentioning due to their importance in
object detection and classification tasks, namely the SIFT (scale-invariant feature transform)
and SURF (speeded-up robust features) algorithms. SIFT detects distinctive key points or
features in an image, keeping it resilient to variations in object size, orientation, rotation,
or affine transformations. SURF is another algorithm for key-point detection and image
feature description that offers increased computational speed, which is useful for real-time
applications. Both are computer vision algorithms included in the OpenCV package for
detecting and describing key features in images. They are of key importance since they
lay the foundation for the detection and extraction of intrinsic features in images, which
can subsequently be put on top of the layers for more complex AI-based detection and
classification stages. Based on these well-known computer-vision algorithms, the key
elements of this research focus on the added value of combining conventional algorithms
with new AI-based ones. Sharma et al. [22] provide an extensive comparison of diverse
feature detectors and descriptors. The most recent versions of OpenCV include packages
for Markov and Kalman filter localization, simultaneous localization and mapping (SLAM)
in its extended Kalman filter, graph-based SLAM, or particle-filter versions, and the latest
monocular visual SLAM. Moreover, the OpenCV package supports the use of graph-search
algorithms for path planning, such as breadth-first, depth-firs, Dijkstra, A*, D*, and rapidly
exploring random trees, which are useful for navigation purposes [10].
In general, AGV and AMR recognition and tracking technology involve self-positioning,
environmental perception, map construction, and path planning among the required abili-
ties of the robots [7]. Apart from the ability to capture and process intrinsic characteristics of
environmental images based on chromatic and morphological features as can be obtained
with the described algorithms, the robots in our study require complex discrimination,
detection, and classification of objects. Similar to how Blubaugh et al. [23] analyze the need
for a mobile robot or rover to extract information from images, and recognize objects by
their patterns and features to navigate the environment while avoiding obstacles, our robots
require precise information about the environment and the scenarios in which they operate.
One strict requirement is that they avoid collisions, as the objects in the infrastructure are
customer-owned and sometimes operationally critical, and must be kept intact at all times.
Today, a new paradigm is opening up in the field of perception for mobile robotics
through the use of artificial intelligence and deep learning techniques for visual detection.
In their review, Cebollada et al. [24] showed how a variety of computer vision, AI, and
deep learning tools are currently employed in mobile robotics. The use of deep learning
techniques for inspection activities in industrial environments stems from the need for a
computer vision technique that provides better and more accurate information about the
target object to be inspected. In warehouses or manufacturing plants, the target objects to be
Sensors 2024, 24, 3721 7 of 28
inspected usually come in fixed positions, e.g., in roller belts with minimal degrees of free-
dom for variable positions, with controlled illumination sources, and in delimited spaces.
The environment where objects are inspected is highly structured and well-equipped with
sensors throughout the infrastructure.
The rise of inspection activities outside of these environments has led to the adoption
of new vision and detection techniques that enable the understanding of the scenarios
to be inspected in unstructured environments (but in a more sophisticated manner). In
these unstructured scenarios, one cannot predict many of the objects that will appear in
the scene during the inspection. This has led to the introduction of object detection using
convolutional neural networks. AI-based object detection is useful for understanding what
objects appear in an image, describing what is in it (with a better understanding of the
scene, classifying the objects individually and discriminating one from another), and where
those objects are located in the image.
These robotic platforms and the use cases themselves are described in Sections 3.1–3.3.
ing vegetation and weeds between the railway rails that need to be detected, located,
and eliminated.
Figure 2 illustrates the robot’s concept design, co-existing with rail traffic. The system
is designed to accurately discriminate between weeds and, for example, cables, screws, or
rail sleepers. Blue laser equipment is used for the elimination of the weeds once they have
been precisely detected and located. In addition, the robot must operate without interfering
with or stopping rail traffic. Therefore, it is flat enough to function under trains without
colliding with the mechanical elements underneath. It also needs to withstand the suction
forces from trains passing overhead. For this purpose, the robot was developed with lateral
restraint arms that enable it to attach itself to the tracks when a train is passing.
Figure 2. General concept design of the AMR designed to address Use Case A, including its flat
design, to ensure co-existence with rail traffic; a general view of the robot’s restraint arms, enabling its
attachment to the tracks when a train passes overhead; and weed detection and potential elimination
from the robot. Source: ZeniaLabs AI.
Figure 3. Automated inspection concepts for automatically detecting damages in TEU containers
in maritime ports: (a) inspection gate; (b) forklift-based robotic solution; (c) sensorized inspection
drone; and (d) proposed AMR approach. Source: ZeniaLabs AI.
• Scenario (a) in this figure is based on an inspection gate, which is the most simple
and obvious way of inspecting defects without human intervention. It is a useful
tool yet unfeasible for supporting inspection procedures outside specific inspection
sites and areas where the gates are located. Usually, for on-site inspections, inspection
personnel move to the port terminal, near the ships, to enhance agility and reduce
time, thereby shortening the duration that containers remain in the terminal.
• Scenario (b) is instead based on a forklift-based robotic solution that allows the robot
to move around the container while analyzing potential damages and the general
structural health of the container being monitored. The main advantage of this solution
is that the robot can move to the location of the container, e.g., near the ships.
• Scenario (c) is a sensorized flying drone that flies above the top of the containers,
equipped with the necessary vision sensors and hardware to capture images of the
monitored asset.
• Scenario (d) is the one focused on in this research; it comprises the fact that containers
are sometimes placed next to each other, with little or no space between the container
sides. The inspection should be performed in a very agile and safe manner. Using
small inspection robots deployed by the robotized forklift system, several containers
can be inspected concurrently, while the robotized forklift inspects the sides of the
containers with its onboard cameras.
In all four scenarios, the methodology involves acquiring pictures and videos to be
processed by high-speed electronics and generating a report in real time on the structural
health of the inspected containers. The report needs to be available as quickly as possible
on, for example, a handheld device. The main contribution of this research to this use case
is focused on enhancing the robot’s perception with AI techniques for fully automated
inspection of the containers.
Only scenario (d) has been developed, deployed, and tested for our research, justified
by the fact that the containers to be inspected were located in an area where it was not
feasible to install inspection arches or to operate drones. The drone scenario was discarded
due to environmental restrictions at the test location (port of Bilbao, Bizkaia, Spain), which
prohibited flying a drone without the appropriate permits from the airport authorities.
The airport is too close to the test environment and poses a risk of interfering with air
Sensors 2024, 24, 3721 10 of 28
traffic. Despite these limitations, scenarios (b) and (d) are quite similar, with (d) being the
evolution of the former and providing better insights on top of the container through the
concurrent deployment of several small inspection robots. Moreover, we envision that
scenarios (a) and (c) could be relevant for future research.
Figure 4. Examples of the detections from the LiDAR system used to address Use Case C: (a) LiDAR
scans; (b) detection by the object detection model.
Sensors 2024, 24, 3721 11 of 28
4. Experimental Setup
Figure 5 presents an overview of the experimental setup for the rail and port ap-
plications. In Figure 5(a1), the rail robot in the operation of weed removal is presented.
Figure 5(a2) presents a bottom view of the robot setup. In Figure 5(a3), the rail testing
environment for AI benchmarking is displayed. In Figure 5(b1), the laboratory setup of the
side cameras and small robot for inspection of the top of TEUs is depicted. Figure 5(b2)
illustrates the field test setup in the port environment. Finally, Figure 5(b3) shows a TEU as
the inspection environment for the AI benchmark presented in this study.
Figure 5. (a1,a2) Rail robot. (a3) Rail testing environment. (b1) Laboratory setup of the top inspection
robot and side cameras. (b2) Field test setup in the port. (b3) TEU container for inspection activities.
In order to analyze the benefits of using YOLO-based detection and classification algo-
rithms for improving the robot’s performance, the necessary image acquisition hardware
was mounted on prototype robots for the execution of experiments in laboratory and field
test scenarios.
Figure 6 illustrates the main location of the image and lidar acquisition systems of
the three proposed robotic solutions. The rail inspection robot holds two MIPI cameras
(Figure 6a), one pointing at the front for obstacle detection and navigation, and another one
pointing to the ballasted floor to detect and locate the weeds to be eliminated. The robot
solution for TEU inspection is equipped with two stereo cameras: one on the robotized
forklift to inspect the container sides and another on the small robotic rover deployed by
the forklift on top of the container. The forklift moves around the container, and the rover
navigates on the roof of the container, both acquiring images for AI processing (Figure 6b).
Figure 6c presents the experimental prototype setup for the robotized forklift for loading
goods into trucks. The location of a 3D LiDAR from RoboSense is highlighted, featuring
16 laser layers that generate a point cloud of the warehouse and inside the truck. The image
processing first creates a zenithal bird-view of the scene to determine the pose of the truck
in the docking bay. The AI processing system then decides whether the position is suitable
for the maneuver of autonomously entering the trailer of the truck for unloading. The
system calculates the angle and displacement of the trailer relative to the AGV coordinates
to establish the correct entry and navigation trajectory and to plan the appropriate path.
If at any point the pose is detected to be incorrect, the AGV initiates an emergency stop.
Additionally, the AGV is equipped with a camera on the bottom to inspect the site for
potential obstacles (such as fallen material or goods) in its path.
Sensors 2024, 24, 3721 12 of 28
Figure 6. Overview of the sensors in the experimental setup of the three use cases, highlighting the
cameras and LiDAR (laboratory environment) sensors for AI processing.
a critical factor for our research. We initiate a YOLO performance benchmark on object
detection for the proposed use case with this module.
A crucial point in our detection requirements for real-time detection of the robot is
conditioned by its AI processing capabilities. Without real-time processing power, the only
chance to detect and classify any target objects is to record video frames for a subsequent
offline analysis. However, this possibility is not considered because it clashes with the
operational needs of the robot in the field tests and would only be useful in laboratory tests.
With the development of the YOLOv4 detector, Bochkovskiy et al. [34] aimed to
improve the accuracy of real-time object detectors, enabling their use not only for hint-
generating recommendation systems but also for stand-alone process management and
reducing human input. Previous neural networks did not generally operate in real time
and required a large number of GPUs for training, with a large mini-batch size. The
architecture of YOLOv4 uses a CSPDarknet53 [35] backbone comprised of 53 convolutional
layers. Within the neck part of the model, it uses PANet (Path Aggregation Network) [36]
for feature aggregation of the network. Additionally, YOLOv4 adds a spatial pyramid
pooling (SPP) block after CSPDarknet53 to increase the receptive field and separate the most
significant features from the backbone. YOLOv4 employs a Bag of Freebies mostly focused
on data augmentation, so termed because they improve the performance of the network
without adding any inference time penalty in production. For data augmentation, the
strategy uniquely incorporates self-adversarial training (SAT), aiming to find the portion
of the image that the network relies on during training. Within the so-termed Bag of
Specials, the authors used a type of non-maximum suppression (NMS) where Distance-IoU
is used instead of regular IoU [37], cross mini-batch normalization (CmBN), and DropBlock
regularization to significantly increase performance without adding a noticeable inference
time penalty.
The next AI modules within our benchmark are YOLOv5 and YOLOv6. YOLOv5 was
released only a month after its predecessor, YOLOv4 [38]. A significant modification that
YOLOv5 included over YOLOv4 was the integration of an anchor box selection process
into the model. Later, a target detection framework YOLOv6 was designed in [33] for
industrial applications, featuring a strong focus on detection accuracy and reasoning
efficiency. The resulting model, coined as Meituan-YOLOv6 (MT-YOLOv6), is not a part of
the official YOLO series. Nevertheless, it has been referred to as YOLOv6 as it was inspired
by one-stage YOLO algorithms. YOLOv6 improves over its predecessors by adopting
an anchor-free paradigm that significantly enhances speed compared to anchor-based
detectors. It also introduces a novel label assignment strategy to dynamically allocate
positive samples, further improving detection accuracy, and includes an SIoU (SCYLLA-
IoU) [39] bounding box regression loss to supervise the learning process of the network.
SIoU is a variant of IoU that incorporates new cost functions. In terms of architecture,
YOLOv6 utilizes an EfficientRep Backbone instead of the CSP-Backbone used by YOLOv5.
YOLOv6 also replaces the common convolution layer with the RepConv layer and CSPBlock
with RepBlock. These changes allow the object detection model to efficiently utilize the
computing power of GPU hardware while maintaining strong characterization ability.
Small enhancements in its architecture lead to considerable improvements in small target
detection [40]. Regarding its internal architecture, the neck design of the YOLOv6 model
features a Rep-PAN Neck to make hardware utilization more efficient and to achieve a
better balance between accuracy and speed. It also offers a more effective feature fusion
network based on a hardware-suited neural network design, along with a decoupled head
that considers the balance between the representation ability of the relevant operators and
the computing overhead on the hardware.
The next model in the performance benchmark across Use Cases A, B, and C is
YOLOv7 [41], which is notable for its efficient E-ELAN layer aggregation. This aggregation
is an extended version of the ELAN computational block designed to control the shortest
longest gradient path. It relates to the amount of memory required to keep layers in memory
and the distance it takes for a gradient to back-propagate through the layers. Furthermore,
Sensors 2024, 24, 3721 14 of 28
the network depth and width are scaled up while concatenating layers together, optimizing
the model architecture for different sizes. YOLOv7 also utilizes gradient flow propagation
paths to determine which modules in the network should use re-parametrization strategies.
Different from the previous YOLO generations, YOLOv8 [42] uses mosaic data aug-
mentation that mixes four images to provide the model with better contextual information.
The change in YOLOv8 compared to previous versions is that the augmentation stops in the
last ten training epochs to enhance performance. Also, the model switches to anchor-free
detection to improve generalization, directly predicting an object’s midpoint and reducing
the number of bounding box predictions. This speeds up NMS to discard incorrect predic-
tions. To accelerate the training process and enhance gradient flow, the model’s backbone
includes a C2f module, where the model concatenates the output of all bottleneck modules,
unlike the C3 module in previous YOLO models, which utilized the output of the last
bottleneck module. A decoupled head performs classification and regression separately.
Finally, the benchmark is completed with YOLOv9, the latest in the series of YOLO
object detectors to date. In their recent presentation to the community [43], Wang et al.
proposed a new concept called programmable gradient information (PGI). PGI addresses the
information bottleneck problem without data loss in the feed-forward process. The model
generates reliable gradients via an auxiliary reversible branch. Deep features still execute
the target task and the auxiliary branch avoids the semantic loss due to multi-path features.
The reversible architecture of PGI is built on the auxiliary branch, so there is no additional
computational burden. YOLOv9 has shown strong competitiveness, reducing the number
of parameters and the number of calculations when compared to the state-of-the-art.
Table 1 depicts the number of layers and internal parameters for the different versions
of the YOLO models considered in our benchmark.
Table 1. Variants of each YOLO model considered in the benchmark, alongside their trainable
parameters, layers, and floating-point operations (FLOPS).
Use Case A
The use case for rail maintenance and weed removal is trained on three classes for
detection: (1) the weeds that need to be detected, classified, and located for elimination
within the rail tracks; (2) the cables within the rail tracks, which are part of the rail
infrastructure and must not be damaged under any circumstances; and (3) other objects
to be found on the tracks, like bottles, plastic bags, or cardboard pieces. Such objects (and
others alike) can be essentially ignored for inspection purposes. However, it is positive to
detect them and inform them about their presence and location to the railway operators.
This is especially true when dealing with paper and cardboard, which can be dangerous
when weeds are removed via laser if the beam hits them too long, creating the risk of
starting a fire.
The training dataset collected for this first use case is composed of 3736 images
containing objects of the three classes (9216 weed objects, 5742 cable objects, and 733 other
Sensors 2024, 24, 3721 15 of 28
objects). The most prevalent objects are weeds, followed by cables and certain sporadic
appearances of foreign objects. The dataset has been preprocessed and augmented via
different transformations to increase its diversity and potentially yield a more generalizable
object detector. We also created a version of the dataset with only weed annotations to
evaluate whether the object detector, when focused solely on the most relevant label for the
inspection task, achieves better performance figures. The preprocessing transformations of
the images include auto-orienting and stretching each image to 640 × 640 pixels, a static
crop between 25% and 75% in both horizontal and vertical ranges, a gray-scaling operation
to make the models faster and insensitive to subject color, adaptive equalization to improve
normalization and line detection in varying lighting conditions, and splitting the images
into tiles to enhance accuracy on small objects, which is beneficial for cable detection due
to the small and confined space of the cables within the rail.
The data augmentation strategy in this first use case includes flipping the images
horizontally and vertically, as well as ±15 degree random rotations to make the model
more insensitive to object orientation. Added image modifications include 2.5-px random
Gaussian blurring to enhance resilience to camera focus issues, saturation adjustments
(±25%), over- and under-exposure (±10%), and brightness adjustments (increasing the
picture or darkening it by 15%). Moreover, noising the new images by 0.1% aims to make
the model more resilient to camera artifacts. These steps have resulted in a dataset totaling
35,896 images. It has been split into 88% (31,428 images) for the training dataset, 8%
(2960 images) for the validation dataset, and 4% (1508 images) for the test dataset.
Use Case B
A similar preprocessing and data augmentation procedure has been performed for Use
Cases A and C. Use Case B comprises a dataset of 1035 images, split into a train–val–test
proportion of 70%–20%–10% (992 images for training, 35 images for validation, and 18
for test purposes). The defects that the AI models discover are bowed panels, holes, and
oxide. The defects may come together in the same scene in some cases. As a result, three
different classes are annotated in the images: oxide (232 objects), hole (73 objects), and bow
(41 objects). Preprocessing and augmentation are applied to the training set to minimize
the spurious effects of light, contrast, visual noise, and other camera-related effects. After
augmentation, the training set of images increases to 6200 images.
Use Case C
This third use case handles the images differently. The input data come from LiDAR
point clouds and are converted into grayscale images for analysis. The goal is to find
the correct or anomalous pose of the truck in the bay according to the synthetic images
received from the LiDAR sensor. Therefore, the two classes trained within the model
are (a) correct, standing for the correct pose of the truck, and (b) anomalous, denoting
incorrect poses, where the AGV triggers an emergency stop. After converting all point
clouds to images, the dataset consists of 10,667 images with 9082 correct pose objects and
534 anomalous pose objects. This dataset is split into a 70%–20%–10% stratified proportion,
yielding 6754 images for training, 1919 images for validation, and 943 for testing. Neither
preprocessing nor augmentation is required for this dataset since they are LiDAR point
cloud grayscale images with few camera-dependent features. Consequently, the original
dataset suffices for training an object detection model with good generalization capabilities,
as later shown by our reported results.
Training Hyperparameters
Models were trained on a Colab Jupyter Notebook using an NVIDIA T4 GPU. The
training parameters were set to T = 100 epochs and a batch size of B = 16. Values of the
rest of the hyperparameters are listed in Table 2.
Sensors 2024, 24, 3721 16 of 28
For the most recent YOLO models (from V5 onward), internal data augmentation is
performed. For our experiments, such augmentation includes HSV (hue, saturation, value)
with fractional parameters hsv-h = 0.015, hsv-s = 0.7 and hsv-v = 0.4; image translation
with fractional parameter translate = 0.1; and scaling with parameter scale = 0.9. Neither
shearing nor perspective-changing was set. Flipping left–right (horizontally) augmentation
was configured with parameter f liplr = 0.5, image mosaic generation had a probability of
1.0, image mixup augmentation had a probability of 0.15, and CutMix augmentation was
configured with parameter copy-paste = 0.3.
Evaluation Metrics
We consider standardized metrics for this particular modeling task [44], specifically the
Precision and Recall values, the mean average precision at a specific Intersection over Union
(IoU) threshold of 0.5 (denoted as [email protected]), and averaged over IoU thresholds from 0.5 to
0.95 in steps of 0.05 ([email protected]:.95), and the F1 score. The Precision score informs about the
quality of a positive prediction made by the model by counting the number of true positives
divided by the total number of positive predictions. The Recall metric denotes the fraction
of correctly classified positive cases. With the Recall metric, we capture the positive cases
that are misclassified by the model as negatives. It is an important metric in our use cases for
the criticality of certain cases in each application, where a false negative can have damaging
consequences. This is especially true for Use Case A if the robot incorrectly shoots at
cardboard with the weed-removing laser, and for Use Case C, where an incorrect detection
of the truck’s pose could result in a collision with the AGV. The mean average precision
metric mAP measures the average precision of detection across all classes in the model. It is
evaluated at two different detection thresholds. Firstly, it measures precision at a threshold
where the positive detection aligns with the ground truth within the selected intersection
range, called the intersection over union (IoU) of 50% (mAP@IoU = [email protected]). The second
metric, [email protected]:.95, is much more restrictive and gives the values of the positive detections
within an intersection range between 50% and 95% of the detected object according to the
ground truth. Finally, the F1 score measures the performance of the model by combining its
precision and recall scores as 2 · Precision · Recall/( Precision + Recall ).
Table 4. YOLO performance comparison in Use Case A when detecting object instances of 3 classes
(weed, cable and other).
Table 5. Performance of YOLOv9 per every object class measured over the 2960 validation images of
Use Case A.
We begin by discussing the results of the binary detection model. The model was
trained to only find weeds on the rail tracks, without detecting any other object class. The
model specializes in weed detection, yielding very good detection metrics for all YOLO
versions considered in the benchmark. The field tests demonstrate high performance, as
evidenced by the metrics in Table 3, reporting values as high as 0.96 for [email protected] and an F1
score of 0.972 for YOLOv9. Due to the specialization of the trained model to the weed class,
its application is strictly limited to the Use Case and its environment. Consequently, it has
no further utility outside the scope of the rail environment.
We follow by considering the alternative experiment in which the model is trained
to detect weeds, cables, and other objects (e.g., bags, cardboard) within the rail tracks.
We first focus on evaluating the performance of the model across all three classes and
compare it to the performance of the model specialized in weed detection. As shown
in Table 4, detecting and classifying more than one class with a single object detector
involves a substantial overall decrease in the performance metrics for all models. We
further inspect the detection results by analyzing what affects the overall performance
and how such performance degradation spreads over the object classes considered in this
second experiment. This is the purpose of Table 5, which breaks down the performance
figures for the three classes and the best-performing YOLOv9 model. The main objective of
the model is to detect weeds. The relatively higher presence of annotated weed objects in
the collected dataset makes this class well-represented. However, the high visual variability
of weeds and the model having to detect the other two classes give rise to lower detection
Sensors 2024, 24, 3721 18 of 28
scores for this class. Nevertheless, precision and recall are well balanced, as opposed to the
scores associated with the other class. In this second class, the recall degrades significantly,
meaning that many objects belonging to the other class are not detected by the model.
The third cable class, however, undergoes severe degradation of both precision and recall.
Cables laid all over the train route are well-detected by the model. However, the model
not only fails to detect individual cables (low recall) but also produces false positives
(low precision) due to the inherent difficulty of detecting filament shapes in real-world
image data.
Table 7. YOLO performance comparison in detecting the correct or anomalous pose of AGV for
automated loading in a warehouse (Use Case C).
Due to the simplicity of the grayscale images resulting from LiDAR preprocessing,
training YOLO algorithms on these images yields exceptionally good object detection
models, achieving performances of mAP up to 84% at an IoU of 50%:95% (with YOLOv9).
This is the highest score obtained in our experimentation and exposes the practical relevance
of signal preprocessing in achieving actionable levels of performance without requiring
convoluted modeling choices.
Cable
Other All
Weed (max F1: 0.56)
Figure 7. (left) Examples of detected objects by YOLOv9 over Use Case A, and (right) F1 confidence
curve for weeds, cables, and other objects.
Sensors 2024, 24, 3721 20 of 28
We proceed now with the results corresponding to Use Case B and Use Case C. In the
four pictures on the left of Figure 8, we depict the detection of YOLOv9 for different types
of defects, including bumped and holed containers with oxide parts on them. The right
plot depicts the F1 confidence curve corresponding to each class, together with the one
corresponding to all objects. It can be observed that all objects seem to be well-detected by
the YOLOv9 model since the metallic surface of the TEU containers is chromatically less
heterogeneous than the scene in Use Case A. Finally, the qualitative results of Use Case C
are illustrated through examples with several correct and anomalous detected poses in
Figure 9 (left), together with the F1 confidence curve of both classes (right). In this case,
the examples on the right highlight the practical utility of preprocessing the LiDAR point
cloud data (as shown in Figure 4) to remove contextual variability, allowing the model to
receive only information that is predictive of its commanded task (the borders of the truck).
Bowed
Hole All
Oxide (max F1: 0.73)
Figure 8. (left) Examples of defects detected by YOLOv9 over Use Case B, including holes, bowed
panels, and oxide on top of the monitored asset; (right) F1 confidence curve of each class.
CORRECT
CORRECT
CORRECT
Anomalous All
Correct (max F1: 0.76)
Figure 9. (left) Exemplifying instances of Use Case C where YOLOv9 detects correct or anomalous
truck poses; (right) F1 confidence curve for each class.
(d) (e)
Use Case B
The usual procedure for the inspection of the containers first requires that the AGV
navigates to the container. Once there, the AGV circles around the container (Figure 11c)
in the first inspection, with the cameras installed in the liftable receptacle (black box)
(Figure 11b,f). At a specific point, the elevator stops, opens the deployment ramp, and lets
the surface inspection robots exit to the top of the container (Figure 11c). It first releases a
robot from the first floor of the receptacle, then raises the elevator again to let the second
robot go out. As it elevates, the side cameras of the receptacle acquire lateral images
(Figure 11e). The robots concurrently inspect the top of the container (Figures 11a,c,g),
while the AGV continues to circle the container, concurrently inspecting the sides, while the
robots on top inspect the surface of the container. Finally, the AGV raises the lift pod again
to pick up the robots. It opens the access ramp, and the first robot enters the receptacle. It
then lowers the receptacle slightly, deploys the second ramp, and the second robot enters it
(Figure 11e).
In the second use case, the field trials showed unexpected limitations of the devised
solution: the AMR was unable to inspect the sides of containers that were adjacent to each
other, even with conventional (visual) manual inspection. In this case, the following steps
were taken:
• The container targeted for inspection was separated from the others to allow access
to its sides. In the port terminal, containers were in constant motion as they were
loaded and unloaded from ships. Therefore, while this container movement slowed
down the inspection and was more inconvenient, it was not a critical maneuver for
the port operations.
Sensors 2024, 24, 3721 23 of 28
• If the container was empty, it was inspected from the inside for light leaks (Figure 11h), in-
dicating the presence of a hole. This workaround only allowed identifying hole defects.
(b) (c)
As a result of our field trials in Use Case B, defects could not be detected by the
AMR more effectively than by the port experts. The port premises house very experienced
operators who directly understand the potential causes of each defect. However, the
method did achieve one of the desired safety outcomes by preventing them from climbing
to the top of the containers, which was one of the desired outcomes in terms of safety.
Also, by automating the process, we enhanced the digitization of the entire process and
the data, because images sent and stored by the system are useful for the traceability of
the inspection process and the accountability of decisions made. In all cases, operators
decide whether to remove a container from circulation and set it for repair. However, the
developed AMR system provides an informational database that can be used to safely
validate such decisions.
From a mechanical perspective, one of the biggest limitations identified during the
trials emerged when the upper robots moved from one container to another from the
top. The initial idea was to let them move on their own in areas with many containers
placed close to one another, traversing across all the containers by navigating through
the small gaps and spaces between them. This did not work as expected. Although
containers were close enough together (causing the infeasibility of a lateral inspection, as
noted above), there was too much space for the top AMR to move from one container to the
next one by solely relying on their tracks without falling or becoming stuck between the
two containers. To amend this issue, the containers should have been placed less than three
or four centimeters apart, but many of them were slightly more separated than this critical
distance. The underlying trade-off between the maneuverability of container deployment
in the port premises and the autonomy of the AMR to navigate through contiguous assets
has captured the interest of the management of the port authority and is expected to drive
applied research studies in the future.
Sensors 2024, 24, 3721 24 of 28
When it comes to the object detection model itself, a common problem occurred with
containers damaged with large dents, i.e., those covering almost the entire side panel of
a container. Models trained to identify those defects effectively ended up detecting any
container structure as a defect, annotating any panel as such. The reason for this detection
failure is twofold: (1) the visual information varies significantly when the dent is viewed
from different angles, which can be challenging even for the human eye; and (2) there is
not as much chromatic information as when the dent is small, e.g., a scratch removing a
significant amount of paint from the container’s surface. We envision that for this particular
type of defect, the AMR should be equipped with additional sensors, increasing the cost of
the overall robotic approach.
Despite these unexpected eventualities in the test trials of Use Case B, they demon-
strated the improved safety of automating the entire operation rather than doing it manually.
The key for port operators to embrace this solution was the incorporation of AI-empowered
object detection models for the defects; otherwise, the performance differences compared
to visual inspection would have been too significant for the AMR-based approach to be of
any practical usefulness.
Use Case C
The cargo transport operation tackled in Use Case C involved a maneuver that none
of the operators wanted to perform. They had to drive the AGV so close to the walls of the
truck (where they could hardly see anything) that very few of them had managed to do it
without bumping into the sides of the cargo truck. Most operators typically struggle with
orientation; they start moving the forks inside, but often end up getting stuck inside the
truck, requiring many maneuvers to deposit the load. Only minimal correction maneuvers
are possible inside the truck, both laterally and angularly. The angle must be precisely
defined before entering, taking into account that the truck is not always positioned the same
way in the bay: there is always some lateral and angular displacement that complicates the
loading maneuver. The trucker parks it with some references to the bay, but there is always
some displacement. For manual loading, this displacement is irrelevant. However, for the
AGV to operate autonomously, it is crucial that the maneuver is planned in advance. In
this case, the AI-based object detector indicates whether the AGV is correctly aligned with
the trailer. Upon a positive response, we can then calculate the angle at which the truck
has been docked, in order to adjust the AGV’s pose to match the truck’s angle. The object
detector aids in identifying the shapes within the point cloud that are characteristic of the
bay entrance and the rear of the trailer, as well as indicating whether it is correctly oriented.
To verify the operational performance of the robotic solution devised to address this
use case, a metallic structure was constructed to simulate the load to be deployed by the
AGV inside the trailer (Figure 12a–d). Once inside the trailer, measurements were taken
with the lateral ultrasound and LiDAR sensors installed in the structure (Figure 12e,f). It
should be noted that in a fully real-world scenario, the same sensors are located in similar
positions on a palletizer rather than on the aluminum frame used in our experiments. In
addition, the robotic forklift is a pallet truck with a higher load capacity (Figure 1c) because
it must lift several tons.
In this case, once the robot has entered the truck perfectly aligned and with the correct
orientation, it adjusts 1 cm at a time inside the container, moving slowly but steadily until
it deposits the load. When exiting, it proceeds likewise in reverse. This is only possible if
perfect orientation and movement are ensured upon entering. In our field trials, the correct
or anomalous pose of the truck in the bay, detected by the AI-based approach from the
images generated from the point cloud data captured by the LiDAR sensor, was found to
be very valuable in safely completing the loading maneuver. However, in the field tests,
the machine also failed to autonomously correct minimal angular and lateral deviations
inside the truck. Despite the slow-motion dynamics imposed on the robotic solution (1 cm
per cycle), the correction was not successfully completed in several spatial configurations.
As a result, the AGV ended up hitting the lateral panels of the truck, causing catastrophic
Sensors 2024, 24, 3721 25 of 28
structural damages due to the high inertial force of its load. In such cases, the solution
was given by the detection of a collision by proximity based on the ultrasound sensors,
triggering a stop emergency in the AGV. The pose estimation method (based on YOLOv8,
which elicited the best detection performance for this use case) and the correction prior to
the entrance maneuver were found to be the only effective ways to perform a correct entry
maneuver of the AGV into the truck. This, combined with the lateral collision avoidance
system, comprised the overall AGV automated maneuver system that led to satisfactory
results in the conducted field tests.
Our main conclusions drawn from our results can be summarized as follows:
• In Use Case A (railway inspection), the goal of the AMR is to detect weeds between
the railway tracks so that they can be eliminated by a laser beam. This mechanical
solution can be an alternative to the current method using glyphosate, which is spread
from maintenance wagons or trucks. An AMR supported by AI-based detection has
been shown to perform effectively in discriminating between weeds to be eliminated
and other critical rail infrastructure objects (e.g., beacons, cables, and debris), which
the robot must detect to avoid causing damage.
• In Use Case B, the complexity of detecting and classifying specific defects in TEU con-
tainers is currently achievable only by human workers. The challenges and dangerous
situations associated with such tasks are considerably well supported by AGVs and
camera-based AI object detection, where robots equipped with cameras are deployed
at the inspection site. The detection results in this second use case have also been
promising, mainly due to the better discriminability of defects on the metallic panels
of the TEU containers under inspection.
• Finally, when it comes to Use Case C (automated loading of goods in trucks), we have
proven that this risky task can be fully automated by a robotized AGV. In this case,
the support of AI models is required in addition to the robot’s perception system. This
support ensures that a docking maneuver is initiated from the correct position and
that the AGV does not start from an anomalous position, which could potentially
lead to an incorrect trajectory. The inclusion of AI in this task provides enhanced
capabilities beyond what traditional computer vision systems can achieve.
On an overarching note, our threefold benchmark of YOLO object detectors over real-
world data and the results of the field tests with robotic platforms equipping them have
proven the feasibility of these AI-based models as an additional, high-accuracy perception
sensor for robotic inspection tasks.
Several research directions can be pursued in the future based on the findings made
in this work, beyond overcoming the practical limitations noticed during the field tests.
Among them, we highlight the need to ensure proper generalization of the performance
of the learned model across similar inspection scenarios that are subject to different envi-
ronmental artifacts at their input (e.g., different characteristics of the asset) and/or their
output. Regarding the latter, an interesting line to follow is to explore the possibilities of
class-incremental learning [45] to make the model autonomously detect and characterize
new objects in the scene. To this end, we plan to investigate whether uncertainty estimation
techniques for object detection [46] can be exploited and used to detect unknown objects
that appear recurrently over time, so that the model can consolidate them as a new class to
be discriminated.
Author Contributions: Conceptualization, J.S.-C., J.D.S. and J.L.M.; methodology, J.S.-C. and J.D.S.;
software, J.S.-C.; validation, J.S.-C. and J.D.S.; data curation, J.S.-C.; writing—original draft prepa-
ration, J.S.-C.; writing—review and editing, J.S.-C., J.D.S. and J.L.M.; supervision, J.D.S. and J.L.M.;
project administration, J.S.-C. and J.D.S.; funding acquisition, J.S.-C. and J.D.S. All authors have read
and agreed to the published version of the manuscript.
Funding: The authors of this research acknowledge indirect (cascade-funding) financial support from
the European ESMERA project (grant agreement 780265) for the rail inspection robot and from the
GALATEA project (grant agreement 873026) for the maritime port inspection robot. J. Del Ser received
funding support from the Basque Government through ELKARTEK grants and the consolidated
research group MATHMODE (IT1456-22). J. L. Martin received support from the Department of
Education of the Basque Government via funds for research groups of the Basque University system
(ref. IT1440-22).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Sensors 2024, 24, 3721 27 of 28
Data Availability Statement: The data collected during this research work cannot be made publicly
available due to confidentiality clauses of authorities, institutions, and companies involved in the
projects where the research was conducted.
Conflicts of Interest: The authors declare no conflicts of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.
References
1. Lattanzi, D.; Miller, G. Review of robotic infrastructure inspection systems. J. Infrastruct. Syst. 2017, 23, 04017004. [CrossRef]
2. Wong, C.; Yang, E.; Yan, X.T.; Gu, D. Autonomous robots for harsh environments: A holistic overview of current solutions and
ongoing challenges. Syst. Sci. Control Eng. 2018, 6, 213–219. [CrossRef]
3. Dudek, G.; Jenkin, M. Computational Principles of Mobile Robotics; Cambridge University Press: Cambridge, UK, 2024.
4. Alatise, M.B.; Hancke, G.P. A review on challenges of autonomous mobile robot and sensor fusion methods. IEEE Access 2020,
8, 39830–39846. [CrossRef]
5. Zghair, N.A.K.; Al-Araji, A.S. A one decade survey of autonomous mobile robot systems. Int. J. Electr. Comput. Eng. 2021,
11, 4891. [CrossRef]
6. Yasuda, Y.D.; Martins, L.E.G.; Cappabianco, F.A. Autonomous visual navigation for mobile robots: A systematic literature review.
ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [CrossRef]
7. Zhang, J.; Yang, X.; Wang, W.; Guan, J.; Ding, L.; Lee, V.C. Automated guided vehicles and autonomous mobile robots for
recognition and tracking in civil engineering. Autom. Constr. 2023, 146, 104699. [CrossRef]
8. Patruno, C.; Renò, V.; Nitti, M.; Mosca, N.; di Summa, M.; Stella, E. Vision-based omnidirectional indoor robots for autonomous
navigation and localization in manufacturing industry. Heliyon 2024, 10, e26042. [CrossRef]
9. Loganathan, A.; Ahmad, N.S. A systematic review on recent advances in autonomous mobile robot navigation. Eng. Sci. Technol.
Int. J. 2023, 40, 101343. [CrossRef]
10. Siegwart, R.; Nourbakhsh, I.; Scaramuzza, D. Introduction to Autonomous Mobile Robots, 2nd ed.; Intelligent Robotics and
Autonomous Agents Series; MIT Press: Cambridge, MA, USA, 2011.
11. Mitchell, W.C.; Staniforth, A.; Scott, I. Analysis of Ackermann Steering Geometry; SAE International: Warrendale, PA, USA, 2006.
12. Mohanty, P.K.; Parhi, D.R. Controlling the motion of an autonomous mobile robot using various techniques: A review. J. Adv.
Mech. Eng. 2013, 1, 24–39. [CrossRef]
13. Macenski, S.; Moore, T.; Lu, D.V.; Merzlyakov, A.; Ferguson, M. From the desks of ROS maintainers: A survey of modern &
capable mobile robotics algorithms in the robot operating system 2. Robot. Auton. Syst. 2023, 168, 104493. [CrossRef]
14. Papadopoulos, E.; Aghili, F.; Ma, O.; Lampariello, R. Robotic manipulation and capture in space: A survey. Front. Robot. AI 2021,
8, 686723. [CrossRef]
15. Kortenkamp, D. Perception for Mobile Robot Navigation: A Survey of the State of the art. In Proceedings of the Dual-Use Space
Technology Transfer Conference and Exhibition, Houston, TX, USA, 1–3 February 1994; Volume 2.
16. Borenstein, J.; Koren, Y. Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 1989, 19, 1179–1187.
[CrossRef]
17. Slack, M. Navigation templates: Mediating qualitative guidance and quantitative control in mobile robots. IEEE Trans. Syst. Man
Cybern. 1993, 23, 452–466. [CrossRef]
18. Culjak, I.; Abram, D.; Pribanic, T.; Dzapo, H.; Cifrek, M. A brief introduction to OpenCV. In Proceedings of the 2012 35th
International Convention MIPRO, Opatija, Croatia, 21–25 May 2012; pp. 1725–1730.
19. Van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.; Boulogne, F.; Warner, J.D.; Yager, N.; Gouillart, E.; Yu, T. scikit-image: Image
processing in Python. PeerJ 2014, 2, e453. [CrossRef]
20. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic
differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017),
Long Beach, CA, USA, 4–9 December 2017.
21. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow:
Large-Scale Machine Learning on Heterogeneous Systems, 2015. Available online: https://www.tensorflow.org/ (accessed on
3 June 2024).
22. Sharma, S.; Jain, K.; Shukla, A. A Comparative Analysis of Feature Detectors and Descriptors for Image Stitching. Appl. Sci. 2023,
13, 6015. [CrossRef]
23. Blubaugh, D.A.; Harbour, S.D.; Sears, B.; Findler, M.J. OpenCV and Perception. In Intelligent Autonomous Drones with Cognitive
Deep Learning; Apress: Berkeley, CA, USA, 2022; pp. 327–361. [CrossRef]
24. Cebollada, S.; Payá, L.; Flores, M.; Peidró, A.; Reinoso, O. A state-of-the-art review on mobile robotics tasks using artificial
intelligence and visual data. Expert Syst. Appl. 2021, 167, 114195. [CrossRef]
25. Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
arXiv 2013, arXiv:1311.2524. Available online: http://arxiv.org/abs/1311.2524 (accessed on 3 June 2024).
Sensors 2024, 24, 3721 28 of 28
26. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv 2014,
arXiv:1406.4729. Available online: http://arxiv.org/abs/1406.4729 (accessed on 3 June 2024).
27. Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015,
arXiv:1506.02640. Available online: http://arxiv.org/abs/1506.02640 (accessed on 3 June 2024).
28. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 6517–6525.
29. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
30. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2015,
arXiv:1512.02325. Available online: http://arxiv.org/abs/1512.02325 (accessed on 3 June 2024).
31. Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. Available
online: http://arxiv.org/abs/1708.02002 (accessed on 3 June 2024).
32. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany,
2020; pp. 213–229.
33. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection
Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. Available online: http://arxiv.org/abs/2209.02976
(accessed on 3 June 2024).
34. Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934.
Available online: http://arxiv.org/abs/2004.10934 (accessed on 3 June 2024).
35. Wang, C.; Liao, H.M.; Yeh, I.; Wu, Y.; Chen, P.; Hsieh, J. CSPNet: A New Backbone that can Enhance Learning Capability of CNN.
arXiv 2019, arXiv:1911.11929. Available online: http://arxiv.org/abs/1911.11929 (accessed on 3 June 2024).
36. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. Available
online: http://arxiv.org/abs/1803.01534 (accessed on 3 June 2024).
37. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression.
arXiv 2019, arXiv:1911.08287. Available online: http://arxiv.org/abs/1911.08287 (accessed on 3 June 2024).
38. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection
models. Digit. Signal Process. 2022, 126, 103514. [CrossRef]
39. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. Available online:
http://arxiv.org/abs/2205.12740 (accessed on 3 June 2024).
40. Wei, J.; Qu, Y. Lightweight improvement of YOLOv6 algorithm for small target detection. Res. Sq. 2023. [CrossRef]
41. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696. Available online: http://arxiv.org/abs/2207.02696 (accessed on 3 June 2024).
42. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, Version 8.0.0. 2023. Available online: https://github.com/ultralytics/
ultralytics (accessed on 3 June 2024).
43. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information.
arXiv 2024, arXiv:2402.13616. Available online: http://arxiv.org/abs/2402.13616 (accessed on 3 June 2024).
44. Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the
2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 237–242.
45. Belouadah, E.; Popescu, A.; Kanellos, I. A comprehensive study of class incremental learning algorithms for visual tasks. Neural
Netw. 2021, 135, 38–54. [CrossRef]
46. Feng, D.; Harakeh, A.; Waslander, S.L.; Dietmayer, K. A review and comparative study on probabilistic object detection in
autonomous driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9961–9980. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.