research paper on yolo

> cs > arXiv:2304.00501v5

Other formats

Current browse context:

Change to browse by:, references & citations, dblp - cs bibliography, computer science > computer vision and pattern recognition, title: a comprehensive review of yolo: from yolov1 and beyond.

Abstract: YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers. We start by describing the standard metrics and postprocessing; then, we discuss the major changes in network architecture and training tricks for each model. Finally, we summarize the essential lessons from YOLO's development and provide a perspective on its future, highlighting potential research directions to enhance real-time object detection systems.

Comments:	34 pages, 19 figures, 4 tables, submitted to ACM Computing Surveys. This version adds information about YOLO with transformers
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
classes:	I.2.10
Cite as:	[cs.CV]
	(or [cs.CV] for this version)

Submission history

Link back to: arXiv , form interface , contact .

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Review Article
Open access
Published: 02 August 2023
Volume 16 , article number 126 , ( 2023 )

Cite this article

You have full access to this open access article

U. Sirisha 1 ,
S. Phani Praveen 2 ,
Parvathaneni Naga Srinivasu 2 ,
Paolo Barsocchi ORCID: orcid.org/0000-0002-6862-7593 5 &
Akash Kumar Bhoi 3 , 4 , 5

10k Accesses

36 Citations

Explore all metrics

Object detection is a critical and complex problem in computer vision, and deep neural networks have significantly enhanced their performance in the last decade. There are two primary types of object detectors: two stage and one stage. Two-stage detectors use a complex architecture to select regions for detection, while one-stage detectors can detect all potential regions in a single shot. When evaluating the effectiveness of an object detector, both detection accuracy and inference speed are essential considerations. Two-stage detectors usually outperform one-stage detectors in terms of detection accuracy. However, YOLO and its predecessor architectures have substantially improved detection accuracy. In some scenarios, the speed at which YOLO detectors produce inferences is more critical than detection accuracy. This study explores the performance metrics, regression formulations, and single-stage object detectors for YOLO detectors. Additionally, it briefly discusses various YOLO variations, including their design, performance, and use cases.

Object detection using YOLO: challenges, architectural successors, datasets and applications

Object Detection: State of the Art and Beyond

YOLO-based Object Detection Models: A Review and its Applications

Explore related subjects.

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Computer vision is a highly researched field, with efforts directed toward enabling machines to comprehend and interpret complex visual content. Object detection is a significant challenge in this domain, which involves identifying and locating objects of interest in images or videos. Deep learning, a subfield of machine learning and AI, gained prominence in the early 2000s after artificial neural networks, multilayer perceptrons, and support vector machines became popular. However, initially, deep learning faced scalability issues and high computing power requirements, which limited its adoption. Still, the availability of large datasets and powerful computers since 2006 has significantly contributed to the widespread popularity of deep learning.

Object detection is a method in computer vision that involves identifying and localizing objects within images or videos. The main objective is to precisely detect objects' existence, location, and dimensions in an image or video and label them with an appropriate class label. This technique has various applications, including but not limited to prediction of stock values[ 1 ], recognition of speech[ 2 ], object detection [ 3 ], recognition of characters [ 4 ], intrusion detection [ 5 ], detection of landslides [ 6 ], time series problems [ 7 ], classification of text [ 8 ], gene-expression [ 9 ], micro-blogs [ 10 ], data-handling [ 11 ], irregular data with fault-classification [ 12 ], captioning of text from images [ 13 , 14 ], aspect-based sentiment analysis [ 15 ], and generation of captions from videos [ 16 ]. Object detection models utilize a range of algorithms and deep learning architectures to detect and classify objects in real-world scenarios.

Object detection can be done on different forms of data, i.e., images, video, and audio data. The ability of computer and software systems is to find and identify individual items within an image or scene. Object detection in the video is very similar to how it operates in images. Such a tool would enable the computer to find, recognize, and categorize things visible in the provided moving images. Object detectors can identify objects based on various sounds also.

Object detection using machine learning models refers to a set of algorithms that can automatically identify and locate objects in images or videos. These models employ feature extraction, feature selection, and classification techniques to recognize objects in visual data. To train these models, labeled images are provided where each object of interest is labeled with its corresponding class. The model then utilizes these labeled images to learn features specific to each class of objects. Several machine learning models are available for object detection, including support vector machines (SVM), decision tree, and random forests [ 17 , 18 ]. These models differ in their feature extraction and classification approach and may perform differently based on the task and data at hand. Some of these models require manual feature engineering, while others can automatically learn features from the input data.

Deep learning models refer to a class of neural networks that can automatically identify and locate objects in images or videos. These models utilize multiple layers of processing units to extract complex features from the input data, which makes them efficient for object detection tasks. Some examples of models include CNNs, R-CNNs, SSDs, and you only look once models, which can recognize objects accurately and detect multiple objects in a single image or video. Training a deep learning model for object detection involves providing a large dataset of labeled images or videos to the model, with each object labeled by class and bounding-box coordinates. The model learns to identify and locate objects by minimizing a loss function that measures the difference between predicted and ground-truth labels and bounding boxes. These models are used in applications, such as autonomous driving, surveillance, and robotics.

Object detection, classification, localization, and segmentation (Fig. 1 ) are three crucial tasks that models aim to accomplish. Classification refers to identifying the object's category in an image or video by assigning a class label to the entire image or a specific region of interest. For example, a model can identify a car, a pedestrian, or a traffic sign in an image. Localization refers to identifying the object's location in an image or video by drawing a bounding box around it, which provides the coordinates of the object's position within the image, enabling the model to locate the object accurately. Segmentation refers to identifying the pixels that belong to an object in an image or video, enabling the model to create a pixel-level mask that outlines the object’s shape. Each segment usually shares the color texture and intensity of pixels. This technique is more precise than bounding-box localization and can be useful in scenarios where precise object boundaries are necessary, such as medical imaging or satellite imagery analysis. Object detection models typically aim to perform all three tasks simultaneously to comprehensively understand the objects in an image or video.

Image classification and object detection

The key implementation steps for object detection are:

**Data collection and annotation**: Collect a large dataset of images or videos with labeled objects. The labels should include the class of each object and its corresponding bounding-box coordinates.

**Pre-processing of data**: Prepare the data for training by performing tasks on the data.

**Selecting a model**: Choose a suitable object detection model based on the specific requirements of the task, such as accuracy, speed, and computational resources.

**Training the model**: Train the selected model on the labeled dataset using a suitable training algorithm and optimization technique. This involves adjusting the weights and biases of the model to minimize the loss function.

**Validation and testing**: Validate the model on a separate dataset to check its performance and fine-tune the hyperparameters if necessary. Test the final model on a dataset to evaluate its generalization ability.

**Deployment**: Deploy the trained model on a production environment or integrate it into a larger system for real-world use. This involves optimizing the model for inference speed and memory usage and ensuring its compatibility with the target hardware and software platform.

Training techniques for object detection involve the methods and strategies used to train models to accurately detect and locate objects in images or videos. Here are some common training techniques for object detection:

Supervised learning: This is the most common training technique for object detection. Each object in the dataset is annotated with its corresponding class label and bounding-box coordinates. The model learns to detect and localize objects in the input data by optimizing a loss function.

Transfer learning: This technique involves using a pre-trained model. This can save time and computing resources, as the model has already learned general features useful for object detection.

Augmentation of data: It can help improve the model's generalization ability to new and unseen data.

One-shot learning: This technique involves training the model to detect objects with only one or a few examples of each class. It can be useful in scenarios where obtaining a large labeled dataset is difficult.

Active learning: Involves selecting the most informative and uncertain samples from a pool of unlabeled data and presenting them to a human annotator for labeling. The labeled data is then used to train the object detection model, and the process is repeated iteratively to improve the model’s performance.

Reinforcement learning: This involves training the object detection model using a reward-based system, where the model learns to maximize a reward signal by detecting objects accurately. Reinforcement learning can be useful in scenarios where the object detection task involves complex and dynamic environments, such as robotics or autonomous driving.

A crucial component of computer vision is object detection. Using video surveillance, healthcare, and in-vehicle sensing in the business world is possible. Object detection, a crucial yet challenging problem in computer vision, has advanced considerably over the past decade. Nevertheless, this field has made much progress; each year, the research community sets a new standard for excellence. Deep neural networks and the massive processing capacity of NVIDIA graphics processing units made this possible. There have been two distinct periods in the development of object detection:

Up until 20th, conventional computer vision methods were in use.

When AlexNet triumphed in the ImageNet-Visual-Recognition-challenge-in-2012, a new era for convolutional neural networks was initiated.

In Fig. 2 , the development of object detection algorithms is depicted. Early object detection methods, such as Viola-Jones, Histogram of Oriented Gradients, and Deformable Parts Model, relied on manual feature extraction from the image, such as edges, corners, and gradients, and traditional machine learning algorithms.

Evolution of object detection algorithms

After that, cutting edge image classification architectures were adopted as feature extractors in object detection. Both issues are connected and depend on discovering reliable high-level characteristics. Therefore, rich feature hierarchies for accurate object detection and semantic segmentation introduced R-CNN and demonstrated how we might employ the convolutional features for object detection. Recent years have seen tremendous advancement in object detection. Deep learning detection techniques can be divided into two stages.

Two-stage object detection: Object region proposal is the first step in a two-stage process, including object classification from region proposal and bounding-box regression. Although slower than other detectors, this detector has the highest accuracy. These object detectors include the (RCNN), (Faster-RCNN), and (Mask-RCNN) algorithms.

One-stage object detection eliminates the object region suggestion step and predicts the bounding box from images. These detectors are much faster than two-stage. However, they have trouble picking up minute items. Single-stage detectors are suitable for practical applications due to their quick inference speed. Single-stage detectors, such as SSD, YOLO, EfficientDet, etc. belong to the second category of detectors.

This section provides an overview of computer vision and deep learning, object detection, and related terminologies, key implementation steps, a timeline of how object detection algorithms have developed, and the review’s main contributions. Our analysis will focus on an in-depth examination of the details of the designs of YOLO and their architectural successors. The optimizations brought to each successor and the fierce competition between various two-stage object detectors.

This is the outline of the research. In Sect. 2 , the study looks at a few survey papers on YOLO architectures. In Sect. 3 , we will review the different YOLO versions, YOLO's design concepts, and the many pre-trained models used in them. In Sect. 4 , we will review the datasets, and evaluation metrics of YOLO. Section 5 compares the analysis of YOLO versions regarding performance, architectures, and input size, providing some statistics on their relative effectiveness. Section 6 provides a detailed analysis of challenges and future research directions. Finally, we wrap up the paper with the conclusion.

2 Related Work

2.1 prior analysis in yolo algorithms.

Only survey studies have been published, but they all provide a solid overview of the history of YOLO algorithms. The authors in [ 19 ] presented a review of two-stage and one-stage techniques, an architectural overview of YOLO versions, and a comparison analysis among them. In this paper [ 20 ], the author focused on an overview of the YOLO versions through public data.

2.2 Novelty and Contributions

Most evaluations and reviews cover the one-stage and two-stage object detection techniques. As far as we know, this assessment addresses single-stage techniques using certain YOLO algorithms. Here, we thoroughly analyze YOLO algorithms based on fundamental architectures, benefits and drawbacks, comparative & incremental approaches in this field, well-known datasets, outcomes, and potential future applications. The contributions include the following:

Highlight each stage's difficulties and significance in the object detection process.

Single-stage object detector's necessity and a thorough analysis of YOLO’s incremental architectural features, suggested optimization methods, and YOLOs-based applications.

Illustration of comparisons made between several versions of YOLO in terms of performance and outcomes, as well as discussion of potential directions for future study in single -stage object detectors.

3 Evolution of YOLO Algorithms

The basic principles, designs, and incremental methods are presented in this section over various YOLO algorithms and represented in Fig. 3 .

Timeline of YOLO algorithms

The basic terms related to YOLO architecture are briefed below.

CNN: Object detection is a crucial task in computer vision, and CNNs have played a significant role in advancing this field. CNNs can extract relevant features from images and use them to classify and locate objects within the image, making them well suited for this task. The Region-based CNN (R-CNN) family of algorithms is a popular approach for object detection using CNNs. These algorithms generate a set of region proposals, use a CNN to feature extraction from each proposal, and use these features to classify and refine the object's location within each proposal. Advancements in object detection using CNN’s include faster and more accurate algorithms, which have greatly improved the speed and accuracy of object detection.

Convolutional layer: DenseNet-169 is a layered architecture that is used for classification, which incorporates convolutional layers. When an input is fed into a convolutional layer, a filter is applied to activate it. This process generates a feature map that shows the relative importance of different features within the data. The activation function, ReLU, is then applied to the feature map. A dot-product operation is calculated to compute the convolutional layer output. In the DenseNet-169 architecture, a convolutional layer with dimensions d × d is applied after a square neuron layer of size \(S \times S\) , resulting in an output of size \(\left( {S - d + 1} \right)\left( {S - d + 1} \right)\) . Equation ( 1 ) provides a way to compute the non-linear feed to the components \(ij\) , by incorporating input from all the cells in the first layer

The non-linearity of the model is assessed through Eq. ( 2 )

Max-pooling-layer: Max pooling is a technique that involves subsampling a tensor’s entire dimension while preserving its depth. Overlapping max pooling refers to contiguous windows where the maximum value extends beyond the window boundaries. To improve convergence and generalization while avoiding scaling issues, it is recommended to include a maxpool layer. This layer can be connected to every convolution layer or a subset of them. Equation ( 3 ) illustrates how the max pooling is performed over a max-pooling layer \(M_{p}\) using a filter size of k with dimensions \(k_{x} ,k_{y} ,k_{z}\)

Global-average-pooling: The global average pooling layer, which does not have any trainable parameters, can replace the flattening layer typically placed after the last pooling layer in a convolutional neural network. This technique significantly reduces the input and prepares the network for the subsequent classification layer. In fully connected layers, overfitting is a concern that can be addressed using dropout, and the global average pooling layer can help with this. Global-average-pooling layers can perform an even more extensive form of dimensionality reduction by reducing a tensor with original dimensions of \(l \times b \times h\) to dimensions of \(1 \times 1 \times d\) . For each \(h_{b}\) feature map, the global average pooling layer normalizes it to a single value by taking the mean of all \(l_{b}\) values.

Fully connected layer (FCL): A fully connected neural network layer establishes a linear connection between input and output neurons. The information learned from lower levels can then be used to classify data at the FCL. An advantage of FCL is that they can handle input data with no structural assumptions. To interpret the activation at a given layer with dimensions \(l_{1} \times l_{2} \times l_{3}\) , a multilayer perceptron function (MPF) is constructed from a class probability distribution. The final layer of the MPF-based multilayer perceptron will have 1 × 1 × d output neurons, where m is the total number of layers in the network. Equation ( 4 ) is used to compute the MPF

The purpose of the fully connected structure would be to provide a probability interpretation of each category by altering the weight parameters \(w_{i,j}^{l} { }\) based on the feature map produced by the linear combination of the convolutional, non-linearity, rectification, and pooling layers.

Softmax layer: When the input is negative, the result is extremely low, but when the input is large, the result is high. The softmax function takes a vector of numbers as input, where each element can be either a positive or negative number, or zero. The softmax assessment yields a probability distribution with the normalization factor included in the denominator thanks to the normalization factor.

3.1 YOLO (V1)

On June 8th, 2015, YOLO (V1) [ 21 ] was introduced. It employs a convolutional neural network that involves two main processes: fully connected layers to predict output probabilities and coordinates, and early convolutional layers to extract image features. The model’s architecture is inspired by the googlenet framework, and was trained and evaluated on the pascal dataset 2007 and 2012 using the Darknet framework. In YOLO (V1), googlenet inception modules are replaced with (1 × 1) convolutional filters followed by (3 × 3) filters, except for the first layer which uses a (7 × 7) filter. Figure 4 illustrates that YOLO (V1) has 24 convolution layers and two fully connected layers. Only four of the convolutional or max-pooling layers have additional layers following them. This version of the method highlights the use of (1 × 1) convolution and global average pooling.

Localization and detection of objects based on YOLO architecture

The authors spent around a week training and tuning the model using the ImageNet 2012 dataset, using the top 20 layers, an average pooling layer, and a fully connected layer. In addition, four more convolutional layers and two fully connected layers with random initializations are added to the model to further fine-tune it for object detection. Large localization errors and limited recall are two key issues with this implementation of YOLO compared to two-stage object detectors.

A Fast-yolo variant of YOLO (V1) with a simpler model is suggested for quicker object recognition. There are nine convolutional layers with weaker filters in them. YOLO-lite [ 22 ] is a different version of YOLO designed specifically for nonGPU machines for real-time object recognition. The authors show that shallower networks may detect objects without explicitly requiring accelerators. Additionally, they show that the existence of batch normalization hinders shallow neural network’s object detection ability. Table 1 summarizes the features of YOLO.

3.2 YOLO (V2)

The “YOLO9000: Better, Faster, Stronger [ 23 ]” paper was released by Redmon and Farhadi in 2017 at the CVPR conference. In this study, the authors offered two cutting edge YOLO variations, YOLOv2 and YOLO9000, which were identical but had different training approaches. Over 9000 categories can be searched using YOLO (V2), the successor to YOLO (V1). Most object detection methods now in use can only classify objects into a small number of categories. This is due to a lack of tagged object data. Therefore, writers experimented with scaling the object detection task for more categories. More than 9418 types of object instances were produced due to combining the COCO dataset and ImageNet.

The architecture of YOLO (V2) is influenced by (VGG and Network in Network). As shown in Table 2 , it employs the darknet-19 structure, which consists of max-pooling layers and 19 convolutional layers. In contrast to the base version, it contains a lot more functionality. For model training, various data-augmentation methods, including random crops, rotations, and many more, are used; nevertheless, this version has trouble detecting smaller objects. In addition to using pre-existing features like global average pooling and one-to-one convolution, authors also introduced fresh approaches to optimization. Table 3 summarizes the key features of YOLO (V2).

3.3 YOLO (V3)

The third iteration of YOLO (V3) was introduced in Joseph Redmon and Ali Farhadi’s paper “YOLOv3: An Incremental Improvement [ 24 ]” in 2018. Although slightly larger than the prior models, this one was still adequate in speed and accuracy. An enhanced version of YOLO (V1), YOLO (V2), and YOLO (V3) is available. In movies, live streams, or still photographs, an item is recognized in real time using the YOLOv3 algorithm. While the first version of YOLO had localization issues, the second version had difficulties detecting smaller items. Using the COCO dataset [ 24 ], the third iteration of YOLO addresses the problems above and provides a quick and easy way to find items. This version excels at handling smaller objects, but struggles with medium and large objects.

YOLO (V3) design is built on the Darkent53 frameworks. It is a network with 53 convolutional layers that employ 3 × 3 and 1 × 1 convolutional filters and certain shortcut connections. It is twice as quick as ReNet152 without sacrificing performance. Figure 5 illustrates the general architecture that underpins YOLO (V3).

The architecture of YOLO (V3)

Feature Pyramid Network (FPN) served as an inspiration for YOLO (V 3). It uses FPN like up-sampling, skip connections, and strategies like residual blocks. Like FPN, YOLO (V3) detects objects using feature maps and (1 × 1) convolution. Three different scales of feature maps are produced by it. The input is down-sampled by 32, 16, and 8 factors. The output tensor is a (13 × 13) feature map (i.e.) converted into a (1 × 1) convolution after an initial 81 series of convolutions. Second, a 16-step stride is used to make the detection after the 94th layer. A (26 × 26) feature map is produced by adding convolutions to the 79th layer before concatenating it with the 61st layer on a 2 × up-sampling basis. Following applying an 8 stride, the detection is completed utilizing the 106th layer and a 52 × 52 feature map.

Fine grained features are extracted by concatenating down-sampled and up-sampled feature maps to detect tiny objects. Three different feature maps \(\left( { 52 \times 52, 13 \times 13 , 26 \times 26{ }} \right)\) are employed to distinguish between large, small, and medium-sized objects. Table 4 summarizes the key features of YOLO (V3).

3.4 YOLO (V4)

The YOLO (V4) architecture is the result of a series of experiments and studies that aim to improve the accuracy and speed of the convolutional neural network. The authors of the paper “YOLOv4: Optimal Speed and Accuracy of Object Detection” published in 2020. It aims to create an object detector suitable for production systems. YOLO (V4) has surpassed all previous versions in terms of both speed and accuracy. Figure 6 presents the key features of YOLO (V4).

YOLO (V4) possible attributes

To create the YOLO (V4) architecture, the authors compared CSP-ResNeXt50, CSP-Darknet53, and EfficientNetB3. They chose CSP-Darknet53, which has 29 convolutional layers with 3 × 3 filters and about 27.6 M parameters, as the backbone network, because it outperforms the other architectures. Figure 7 shows that CSPNets provide rich gradient combinations at low computational cost.

Compares CSPNet and DenseNet architecture used in YOLO (V4)

Classification in COCO [ 25 ] is accomplished with ImageNet pre-trained model. This study employed spatial pyramid pooling (SPP), a method also utilized by RCNN. Linked layers cap the input and output volumes after a CNN. In this version, input is not resized or manipulated. SPP translates CNN inputs to fully connected layer outputs. Path-Aggregation Network is a technique that uses adaptive feature-pooling and is preferred over feature-pyramid-network as a bottom–up path augmentation method. Table 5 summarizes the key features of YOLO (V4).

3.5 Scaled YOLO V4

The authors presented a paper named “SCALED-YOLOV4: SCALING CROSS STAGE PARTIAL NETWORK” [ 26 ]. By effectively extending the network’s design and scale, Scaled-YOLOv4 improves on the Google Research Brain team’s EfficientDet model.

Regarding speed and accuracy, the suggested detection network, which is based on the Cross-Stage Partial method, outperforms past benchmarks from small and large object identification models. The scaling technique makes a network’s depth, breadth, resolution, and structure susceptible to change.

On the other hand, the simplified variant known as ScaledYOLO (V4) tiny uses TensorRT optimization (batch size = 4) to reach 22.0% AP at a rate of approximately 443 FPS. The scaled YOLO (V4) is different from YOLO (V4) in the following aspects:

Optimized network scaling techniques are used in ScaledYOLOv4.

Increased network training speed with modified activations for width and height.

CSP connections and MISH activation are used in the Neck (Path-Aggregation Network) as part of improved network architecture.

The YOLOv4 network was trained on multiple resolutions using a single network rather than training a network for each resolution. Table 6 summarizes the key features of ScaledYOLO (V4).

3.6 PP-YOLO

In August 2020, researchers published a paper on “PP-YOLO: AN EFFECTIVE AND EFFICIENT IMPLEMENTATION OF OBJECT DETECTOR” [ 27 ]. The PP-YOLO32 object detector is constructed using YOLO (V3) architecture. Darknet and PyTorch are the two frameworks in which YOLO versions are previously implemented.

The main objective is a PP-YOLO object detector that could be immediately used in real-world application scenarios and had a fairly balanced efficacy and efficiency. And the Paddle Detection development kit's motive aligns with this objective. Combining these tips and tactics makes the detector more effective and efficient and shows how each step improves performance.

Like YOLO (V4), the PP-YOLO model also combines various existing tricks to reduce model parameters and flops while improving the detector accuracy and ensuring the detector speed remains almost the same. PP-YOLO did not examine Darknet53, ResNext50, or apply Neural-Architecture-Search to find model hyperparameters, unlike YOLO (V4). Table 7 summarizes the key features of PP-YOLO.

With all these tricks and techniques combined, PP-YOLO achieved 45.2% mAP and 72.9 FPS when trained on a volta 100 GPU with batch-size-one. This detector surpasses YOLO (V4), EfficientDet, and RetinaNet in efficiency and effectiveness. The PP-YOLO detector consists of three sections:

The suggested model utilized a ResNet50-vd-dc as the backbone. Fully convolutional networks serve as the object detector's backbone, helping to extract feature maps from the input image. It shares many characteristics with a trained image classification model. The final stage of the architecture in the proposed backbone model substitutes deformable convolutions for the 33convolution layer. ResNet50-vd has a significantly lower amount of parameters and flops than Darknet-53. Due to this, a man AP of 39.1 was achieved, which is better than YOLO (V3).

Detectionneck: The Feature Pyramid Network constructs a pyramid of features.

The detection head, the last step in the pipeline for detecting objects, makes predictions about the box coordinates of objects. The PP-YOLO head is identical to the YOLO (V3) head. A 33 convolution layer forecasts the output and then an 11convolution layer.

3.7 YOLO (V5)

“Glenn Jocher,” CEO of “Ultralytics,” posted YOLO (V5) on GitHub 2 months after YOLO (V4) in 2020. A collection of object detection architectures already trained on the MS-COCO dataset is available in YOLO (V5). The debut of EfficientDet and YOLOv4 came after it. The fact that this is the only YOLO object detector without a research report caused some controversy initially. Still, as soon as its capabilities were demonstrated, the controversy was disproved. The important features of YOLO (V5) are represented in Fig. 8 .

YOLO (V5) possible attributes

The most recent and cutting edge version of the YOLO object detection series, YOLO (V5), has raised the bar for object detection models with its constant effort and 58 open-source contributors. A set of compound-scaled object detection models known as YOLO (V5) was developed using the COCO dataset. This model has several useful features, such as the ability to perform test-time augmentation, model ensembling, hyperparameter evolution, and export to various formats including ONNX, CoreML, and TFLite.

Although YOLO (V5) is not a direct replacement for YOLO (V4), its structural architecture is the same. The following are its components:

An image, patch, or other piece of data is presented to the system as input.

The neural network that makes up the system's backbone is what learns everything. YOLOv5’s Cross-Stage Partial (CSP) Networks are its skeleton.

Neck: Feature pyramids are built using the neck. Before being transmitted for prediction, it has layers that mix and combine visual characters. PANet serves as YOLO V5’s neck.

Head: The head receives the output from the neck and uses it to generate predictions for both classes and boxes. The head might be either one or two stages for dense or sparse prediction.

YOLO(V5) separates processed photos into various portions after processing them using a single neural network. Using an automatic anchoring technique, each part receives its oanchor box; this increases accuracy. The entire process is automated, and if the default anchor boxes are inaccurate, a new anchor box computation is made. WitThestem analyses and forecasts the outcome. Table 8 summarizes the key features of YOLO (V5).

“YOLOX: Exceeding YOLO Series in 2021 [ 28 ]” was published by the authors in 2021. Only YOLO (V1) is anchor-free, although YOLOX is too. Decoupled heads data-augmentation approaches and Sim-OTA are used to obtain state-of-the-art results. As part of CVPR 2021’s Workshop on Autonomous Driving, YOLOX came in first with their YOLOX-L model. On the MS-COCO dataset, YOLOX-Nano achieved 25.3% AP, exceeding NanoDet by 1.8% AP. The COCO accuracy increased from 44.3 to 47.3% after adding various modifications to YOLOv3. At 68.9 frames per second on Tesla V100, YOLOX-L model achieved 50.0% average precision on COCO, exceeding YOLOv5-L by 1.8% AP.

Developers and researchers can use YOLO-X, since it was implemented in the PyTorch framework. OnNX, TensorRT, and OpenVino deployment versions are also available for YOLOX. Table 9 summarizes the key features of YOLO-X.

YOLO-R, unlike YOLO (V1)–YOLO (V5), has a different approach in terms of authorship, design, and model infrastructure, specifically for object identification. While YOLO stands for “You Only Look Once, “YOLO-R stands for You Only Learn One Representation”. The YOLO-R network incorporates both implicit information and explicit knowledge, which are both considered beneficial for learning based on data and input. YOLO-R is based on the co-encoding of implicit–explicit knowledge, similar to mammalian brains. It creates a unified network that can represent multiple tasks simultaneously. This is achieved through a convolutional neural network with multi-task learning, which performs three notable procedures: kernel space alignment, prediction fine-tuning, and kernel space alignment. Figure 9 demonstrates that a neural network that is already trained with explicit knowledge performs better when implicit knowledge is added.

Image represents the YOLO-R architecture

YOLO-R achieved comparable object-detection precision to Scaledyolo (V4) but increased inference speed by 88%. YOLO-R mean average precision is 3.8% greater than PP-YOLO (V2) according to the study paper. Table 10 provides a summary of YOLO-R.

3.10 PP-YOLOV2

“PP-YOLO (V2): A Practical Object Detector,” was released by Baidu in 2021 and made a significant impact in the object detection field. This project aimed to create a fast and accurate object detector, building on the success of the original PP-YOLO. The authors used a strategy of assembling different methods and procedures and emphasized ablation studies to create a well-balanced and effective detector. PP-YOLO(V2) incorporated several enhancements that significantly improved performance, increasing mean average precision from 45.9 to 49.5% on the mscoco2017 test dataset. Moreover, it achieved a high frame rate of 68.9 FPS at 640 × 640 image resolution. Unlike PP-YOLO, which used only the ResNet-50 backbone architecture, PP-YOLO(V2) used two different backbone architectures.

When the detector’s ResNet50 backbone was changed to ResNet101, PP-YOLO(V2) reached “50.3%” mAP, matching YOLO V5x's performance but outperforming it significantly of speed by approximately 16%. Table 11 summarizes the key features of PP-YOLO(V2).

3.11 YOLO (V6)

YOLO (V6) aimed to address the practical issues when working with industrial applications. Meituan Visual Intelligence Department developed the target detection framework, MT-YOLO V6 [ 29 ], in 2022.

YOLO (V6) is a single-stage object detection framework with a hardware-friendly architecture. Detection precision and inference speed is superior to YOLO (V5). The main attributes related to YOLO (V6) are shown in Fig. 10 .

Image representing the possible attributes of YOLO (V6)

The YOLOv6 architecture is focused on the primary advancements.

EfficientRep backbone: This backbone architecture is different from YOLO V5 CSP-Backbone and has been designed to have powerful representational abilities and optimize hardware processing resources.

Rep-PAN neck: Regarding the neck design, a more effective feature fusion network was created for YOLO (V6) based on the hardware neural network design idea structure. It was designed to improve hardware consumption and better balance accuracy and speed.

A decoupled head: In YOLO (V6), the Decoupled Head structure was used, simplifying the decoupling head's design while balancing the pertinent operators’ representative capacity and the computational burden on the hardware.

Effective training strategies: The anchor-free paradigm, SimOTA-label-assignment strategy, and SIOU Boundingbox regression loss are used by YOLOv6 to increase detection accuracy.

Model deployment is made significantly simpler by YOLO (V6)’s support for a variety of deployment techniques. YOLO (V6) has 2 × faster inference time and greater mean Average Precision (mAP) than V5. Table 12 summarizes the key features of YOLO (V6).

3.12 YOLO (V7)

YOLO (V7) [ 30 ] object detector whose outstanding features transform the computer vision market in 2022. The official YOLO (V7) offers incredible speed and accuracy compared to its earlier iterations. No pre-trained weights are employed; instead, YOLO (V7) weights are trained using Microsoft’s COCO dataset. The main attributes of YOLO (V7) are shown in Fig. 11 .

Image representing the possible attributes of YOLO (V7)

The YOLO (V7) architecture's primary focal features include the following:

“Extended-Efficient Layer Aggregation Network (E-ELAN)” mainly concentrates on the computational density and model architectural characteristics. By regulating the gradient route, ELAN's key benefit was that improving learning and convergence capabilities of deeper networks.

“Model Scaling for Concatenation-Based Models”: Concatenation-based model scaling involves scaling calculation block depth and transmission layer width.

“Re-parameterized convolution that is planned”: A RepConvN layer without identity connections can take the place of "Rep-Conv".

“Coarse for auxiliary and fine for lead loss”: This label assigner uses ground-truth labels and predictions for heads to create labels for training and auxiliary heads. Effectiveness of YOLO (V7) Object Detection.

The most recent piece in the YOLO series is YOLO (V7). Based on the prior work, this network considerably enhances the detection speed and accuracy. As part of the research, E-ELAN is recommended as the overall design and explains how cardinality expand-shuffles-merges to continuously improve the learning capacity of the network. E-ELAN can direct different groups of computational blocks to understand various features.

YOLO (V7) is still a young algorithm that is still being developed. The difficulties that the developers are trying to solve still have a lot of room for advancement. The algorithm will be very helpful in resolving many computer vision problems once it becomes widely used. Table 13 summarizes the key features of YOLO (V7).

3.13 YOLO (V8)

Ultralytics, the company behind the development of YOLO (V5), released YOLO (V8) in January 2023. While there are no published papers on this version yet, it has been noted that YOLOv8 follows the recent trend of anchor-free models, resulting in fewer box predictions and faster non-maximum suppression (NMS). Additionally, YOLO (V8) uses mosaic augmentation during training. However, it has been observed that using this technique throughout the entire training process can be harmful, so it has been disabled for the last ten epochs. YOLO (V8) is available both as a command line interface (CLI) tool and as a PIP package, and it includes various integrations for labeling, training, and deployment.

This statement implies that YOLO V8x was tested on the MS-COCO dataset using the test-dev 2017 split and achieved an average precision (AP) score of 53.9% when evaluated on images with a size of 640 pixels. In comparison, YOLO V5 achieved an AP of 50.7% on the same input size. Table 14 summarizes the key features of YOLO (V8).

4 Training Parameters, Datasets, and Evaluation Metrics

4.1 training parameters.

Here are some common training parameters used in YOLO and its variants:

Batch size: The batch size determines the number of images that are processed in a single forward/backward pass of the neural network during training. A larger batch size can improve training speed, but it also requires more memory.

Learning rate: The learning rate controls how much the model’s parameters are adjusted with each update during training. A higher learning rate can lead to faster training but may also cause the model to converge on a suboptimal solution. A lower learning rate may result in slower training, but the model is more likely to converge to a better solution.

Number of epochs: The number of epochs is the number of times the entire training dataset is processed during training. A higher number of epochs can lead to overfitting, while too few epochs can result in underfitting.

Augmentation of data: It refers to the process of creating new training data from existing data by applying transformations, such as rotation, scaling, and flipping. Data augmentation can help improve the model's ability to generalize to new data and reduce overfitting.

Objectness threshold: The objectness threshold is the minimum score required for an object to be considered a positive detection. Increasing the objectness threshold can reduce false positives, but it can also increase false negatives.

Intersection over Union (IoU) threshold: The IoU threshold is used to determine whether a predicted bounding box overlaps with ground-truth bounding box. Increasing the IoU threshold can increase the accuracy of the model but can also result in fewer positive detections.

4.1.1 Multi-scale Training in YOLO

It is a technique used to enhance the performance of the YOLO model in detecting objects. Unlike traditional training, where the model is trained on a fixed input image size, multi-scale training involves training the model on multiple scales of input images. During training, input images are randomly resized to different scales, and the model is trained on batches of images with different scales. The YOLO model is updated with the gradients computed from the loss function for each image scale, allowing it to effectively detect objects at different scales. This technique is helpful in scenarios where objects of various sizes may be present.

4.1.2 Attention Mechanisms in YOLO

These mechanisms used in different computer vision tasks, including detecting objects. In YOLO, attention mechanisms can help concentrate the model's attention on specific image parts critical for object detection. One technique of utilizing attention mechanisms in YOLO is spatial attention. This involves weighting the network's feature maps based on their relevance to the detection task. Afterward, these attention weights are utilized to adjust the feature maps before the final object detection step.

Another approach is channel-wise attention, which involves weighing the feature maps based on their relevance to the detection task across channels. This can be achieved by calculating a channel-wise attention vector based on the feature map's global statistics, such as mean and variance. The channel-wise attention vector is then applied to re-weight the feature maps before the final object detection step. YOLO (V4) introduced Spatial Pyramid Pooling (SPP) attention, a new mechanism that uses a spatial pyramid pooling layer to extract multi-scale features from the image. A convolutional block is then utilized to apply various attention mechanisms to the feature maps. Overall, utilizing attention mechanisms in YOLO can enhance the model's accuracy and speed by focusing on the most relevant parts of the image.

4.1.3 Non-maximum Suppression

It is a technique used in object detection models to improve their accuracy by removing redundant bounding boxes. Since object detection models tend to generate multiple bounding boxes with varying confidence scores for the same object, NMS helps to filter out those boxes that are irrelevant or redundant, and retains only the most precise ones. Figure 12 illustrates the effect of NMS on an object detection model’s output by reducing the number of overlapping bounding boxes.

Results before and after processing through NMS

4.1.4 Activation Functions

Activation functions play a crucial role in deep learning models, including YOLO and its variants, by introducing non-linearity to the output of each layer. Here are some commonly used activation functions in YOLO and its variants:

RectifiedLinearUnit: A simple and widely used activation function that returns the input if positive, and 0 otherwise. It is defined as \(f\left( x \right) = {\text{max}}\left( {0,x} \right)\) .

LeakyReLU: A variant of “ReLU” that adds a small slope to the negative values to avoid dying neurons. It is defined as \(f\left( x \right) = {\text{max}}\left( {ax,x} \right)\) , where \(\alpha\) is a small positive constant.

Swish: A relatively new activation function that is a smoothed version of ReLU, defined as \(f\left( x \right) = x \times {\text{sigmod}}i\left( x \right)\) .

Mish: Another novel activation function similar to Swish, but with a more gradual transition from linear to non-linear behavior. It is defined as \(f\left( x \right) = x \times {\text{tanh}}\left( {{\text{softplus}}\left( x \right)} \right)\) .

Hardswish: A faster and more memory-efficient variant of Swish that uses the thresholded linear function instead of the sigmoid function.

Sigmoid: A commonly used activation function for binary classification tasks, defined as \(f\left( x \right) = 1/\left( {1 + e^{ - x} } \right)\) .

Softmax: A function used to convert a vector of real numbers into a probability distribution over several classes, often used in the final layer of a classification network.

The choice of activation function can significantly impact the performance and convergence speed of a deep learning model. Different activation functions may be more suitable for different tasks and architectures, so their selection should be carefully evaluated.

4.2 Datasets

The most commonly used and recognized datasets for computer vision applications is the MS-COCO dataset, as illustrated in Fig. 13 a and b. The dataset contains fewer categories, but each category has more entries. It has 91 distinct categories of items, such as people, dogs, trains, and other everyday objects. Many occurrences are observed in each category [ 31 ], along with various attributes per image.

A Sample images from the MSCOCO dataset. B Image representing Classes in MSCOCO dataset

The Pascal Visual Object Classes [ 32 ] is another dataset for objects (categorization, segmentation, and detection). From 4 classes in 2005 to 20 classes in 2007, the dataset community consistently made contributions, putting it on par with more current developments. Figure 14 a and b lists the many classes of Pascal VOC. The training dataset consists of approximately 11,530 pictures, 27,540 Regions of Interest, and 6929 segmentations.

A Sample images of Pascal VOC. B Classes of Pascal VOC

Several other datasets are commonly used as follows:

ImageNet [ 33 ]: This dataset includes over 1 million labeled images of objects from 1000 different categories. Although ImageNet is commonly used for image classification tasks, it has also been used as a pre-training dataset for object detection models.

KITTI [ 34 ]: This dataset includes images and videos captured from a car driving around urban environments, with annotations for various objects, such as cars, pedestrians, and cyclists. KITTI is often used to test object detection models designed for use in autonomous vehicles.

Open Images [ 25 ]: This dataset includes millions of images with annotations for various objects, including some rare and unusual classes. Open Images is a large and diverse dataset for training and testing object detection models.

Visual Genome [ 35 ]: This dataset includes images with rich annotations describing the objects, attributes, and relationships in the scene. Visual Genome has been used to train object detection models that can reason about the context and relationships between objects in the scene.

4.3 Evaluation Metrics

Several evaluation metrics are used in YOLO and its variants for object detection. Here are some of the most commonly used ones:

Average precision (AP): Average precision (AP) is a widely used metric in object detection that measures the model's accuracy in detecting objects at different levels of precision. It calculates the area under the precision–recall curve (AUC-PR) for different thresholds. The formula for calculating AP is shown in Eq. ( 5 )

Intersection over Union (IoU): Intersection over Union (IoU) is a metric that measures the overlap between the predicted bounding box and the ground-truth bounding box. It is calculated as the ratio of the intersection area to the union area of the two boxes. The formula for calculating IoU is shown in Eq. ( 6 )

Mean Average Precision (mAP): Mean Average Precision (mAP) is the average of the AP’s calculated at different levels of precision. It is used to measure the model’s overall performance across all classes. The mAP can be calculated as shown in Eq. ( 7 )

False-Positive Rate (FPR): The false Positive Rate (FPR) is the proportion of negative samples incorrectly classified as positive. It is used to measure the model's performance in detecting false positives. The formula for calculating FPR is as shown in Eq. ( 8 )

Recall: The recall is the proportion of positive samples model correctly identifies. It is used to measure the model's performance in detecting true positives. The formula for calculating Recall is as shown in Eq. ( 9 )

5 Comparison Analysis of YOLO in Different Aspects

In this section, a comparison analysis of different YOLO models in different aspects. The comparison analysis of YOLO V7 concerning other models is shown in Table 15 .

Table 16 compares YOLO versions. The darknet is where YOLOs are implemented. As mentioned before, this version has several optimizations. Multi-scale training improves YOLO (V2) model's performance and conclusions. As we can see, YOLO(V3) introduced the FPN architecture to improve performance in detecting objects at different scales. YOLO(V4) and YOLO(V5) improved the architecture using the CSP (Cross-stage Partial) architecture. YOLO-X, on the other hand, introduced a decoupled head and backbone architecture to achieve better performance with fewer parameters. YOLO-Tiny uses a lightweight architecture with a few layers to reduce the computational cost, making it suitable for deployment on mobile devices with limited computational resources. YOLO (V6) address practical issues relating to industrial applications. YOLO (V7) offers incredible speed and accuracy.

Table 17 and Fig. 15 show YOLO version performance in terms of frames per sec (FPS), mean average precision (mAP), and average precision (AP). A single- or two-stage technique may be utilized depending on the applications and dataset.

Plot on performance results for different versions of YOLO’s

Table 18 analyzes YOLO's performance with varied input sizes. Performance results of YOLO’s concerning different parameters and flops are analyzed and shown in Table 19 . Note that the AP50 (COCO) metric is the Average Precision (AP) at 50% IoU threshold on the COCO validation dataset.

Table 20 summarizes the activation function, optimizer, momentum, weight decay, and learning rate used in different versions of the YOLO object detection algorithm.

Table 21 shows that YOLOv5 has the highest [email protected] on the COCO dataset among the listed algorithms, followed by PP-YOLOv2 and YOLOv3. Scaled YOLOv4, PP-YOLO, and YOLOv4 also have high [email protected] scores but lower FPS compared to YOLOv5 and PP-YOLOv2. Here is a comparison Table 22 of YOLO-based algorithms on Open Images dataset and other popular object detection datasets.

Here is a comparison Table 23 of YOLO-based algorithms on KITTI dataset and other popular object detection datasets.

Here is a comparison Table 24 of YOLO-based algorithms on Visual Genome dataset and other popular object detection datasets.

Finally, Table 25 lists some YOLO-based detection and recognition applications.

The tables summarize the work regarding illustrative comparisons, empirical findings, and practical implications.

6 Challenges and Future Directions

Here are some challenges in detection of objects:

Variability in object appearance: Objects in images can have different shapes, sizes, and colors, which makes it difficult to detect them accurately.

Occlusion: Objects in real-world scenarios can be partially or fully occluded by other objects or environmental factors such as shadows or reflections, making it challenging for the detection model to locate them.

Scale variation: Objects can appear at different scales in an image or video, and detecting them at all scales is computationally expensive.

Illumination changes: Changes in lighting conditions can affect the appearance of objects, making it difficult for the model to recognize them accurately.

Limited training data: Training an accurate object detection model requires a large labeled data, which can be time-consuming and expensive to collect and annotate.

Computational complexity: In this model, it can be computationally expensive, requiring powerful hardware such as GPUs to train and deploy them.

Here are some potential future directions for object detection models:

One potential development area for object detection models is improving their speed and efficiency. While many current models are highly accurate, they can be computationally expensive and time-consuming, especially in real-time applications. Future research could focus on developing more lightweight and efficient models without sacrificing too much in terms of accuracy.

Another potential development area is improving the robustness of object detection models. Current models are often highly dependent of the data and can struggle to generalize to new and different environments. Future research could focus on developing more adaptable and flexible models that can perform well even in highly variable and dynamic environments.

Developing more specialized and task-specific object detection models is another potential direction. While many current models are general purpose and can be used for various applications, there are often specific use cases where a more specialized model would be more effective. For example, a model designed specifically for detecting objects in medical images might be more effective than a general-purpose model.

Finally, another potential development area is integrating object detection models with other models and technologies, such as natural language processing or augmented reality. By combining object detection with other technologies, it may be possible to create more sophisticated and powerful applications to understand and interact with the world in new and exciting ways.

7 Conclusion

This study provides a detailed understanding of the YOLO architecture and its variants, along with their strengths and weaknesses, making them a great resource for anyone interested in object detection with YOLO. This research paper presents a detailed analysis of the latest progress in object detection using YOLO and its various variants. The paper discusses the evolution of the YOLO architecture and the improvements made in each version. Also discusses various techniques used to improve the performance of YOLO and its variants, including multi-scale training, feature pyramid networks, and attention mechanisms. Additionally, the paper compares the performance of YOLO and its variants with other state-of-the-art object detection algorithms on various benchmark datasets. Overall, the paper concludes that YOLO and its variants have achieved state-of-the-art performance on various benchmark datasets regarding accuracy, speed, and memory consumption. The paper also highlights the limitations of YOLO and its variants, such as their inability to detect small objects and their sensitivity to object aspect ratios. The paper suggests that future research can address the limitations of YOLO and its variants, explore new architectures, and develop techniques to improve their accuracy and speed further. Additionally, the paper highlights the potential applications of object detection algorithms in various domains, such as autonomous driving, robotics, and surveillance systems.

Data Availability

Authors utilized publicly available datasets, i.e., the MS-COCO dataset [ 31 ] and the Pascal Visual Object Classes dataset [ 32 ].

Abbreviations

Artificial intelligence

Artificial neural network

Average precision

Convolutional neural network

Common objects in context

Cross-stage-partial

Deep learning

Extended-efficient layer aggregation network

Feature pyramid network

Frames per sec

Graphics processing unit

K-nearest neighbor

Mean average precision

Monotonic activation function

Machine learning

Multilayer perceptron function

Path aggregation area network

Region-based convolutional neural network

Rectified linear activation unit

Single-shot detector

Support vector machine

You only look once

Rather, A.M., Agarwal, A., Sastry, V.N.: Recurrent neural network and a hybrid model for prediction of stock returns. Expert Syst. Appl. 42 (6), 3234–3241 (2015)

Article Google Scholar

Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)

Liang, M., Hu, X.: Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 3367–3375. (2015) https://doi.org/10.1109/CVPR.2015.7298958

Zhang, X.Y., Yin, F., Zhang, Y.M., Liu, C.L., Bengio, Y.: Drawing and recognizing Chinese characters with recurrent neural network. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), 849–862 (2017)

Kim, J., Kim, J., Thu, H.L.T., Kim, H.: Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 International Conference on Platform Technology and Service (PlatCon), Jeju, Korea, pp. 1–5. (2016). https://doi.org/10.1109/PlatCon.2016.7456805

Mezaal, M.R., Pradhan, B., Sameen, M.I., Shafri, M., Zulhaidi, H., Yusoff, Z.M.: Optimized neural architecture for automatic landslide detection from high resolution airborne laser scanning data. Appl Sci 7 (7), 730 (2017). https://doi.org/10.3390/app7070730

Swamy, S.R., Praveen, S.P., Ahmed, S., Srinivasu, P.N., Alhumam, A.: Multi-features disease analysis based smart diagnosis for COVID-19. Comput. Syst. Sci. Eng. 45 , 869–886 (2023)

Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Twenty-ninth AAAI Conference on Artificial Intelligence (2015), Austin, Texas, USA.

Quang, D., Xie, X.: DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44 (11), e107–e107 (2016). https://doi.org/10.1093/nar/gkw226

Arava, K., Chaitanya, R.S.K., Sikindar, S., Praveen, S.P., Swapna, D.: Sentiment analysis using deep learning for use in recommendation systems of various public media applications. In: 2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC), pp. 739–744. IEEE (2022)

Liao, S., Wang, J., Yu, R., Sato, K., Cheng, Z.: CNN for situations understanding based on sentiment analysis of twitter data. Procedia Comput. Sci. 111 , 376–381 (2017). https://doi.org/10.1016/j.procs.2017.06.037

Wei, D., Wang, B., Lin, G., Liu, D., Dong, Z., Liu, H., Liu, Y.: Research on unstructured text data mining and fault classification based on RNN-LSTM with malfunction inspection report. Energies 10 (3), 406 (2017). https://doi.org/10.3390/en10030406

Sirisha, U., Bolem, S.C.: Semantic interdisciplinary evaluation of image captioning models. Cogent Eng. 9 (1), 2104333 (2022)

Sirisha, U., Bolem, S.C.: GITAAR-GIT based Abnormal Activity Recognition on UCF Crime Dataset. 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE (2023)

Sirisha, U., Bolem, S.C.: Aspect based sentiment & emotion analysis with ROBERTa, LSTM. Int. J. Adv. Comput. Sci. Appl. (2022). https://doi.org/10.14569/IJACSA.2022.0131189

Xu, N., Liu, A.A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Vid Technol. 29 (8), 2482–2493 (2018). https://doi.org/10.1109/TCSVT.2018.2867286

Thai, L.H., Hai, T.S., Thuy, N.T.: Image classification using support vector machine and artificial neural network. Int. J. Inform. Technol. Comput. Sci. 4 (5), 32–38 (2012)

Google Scholar

Guleria, P., Naga Srinivasu, P., Ahmed, S., Almusallam, N., Alarfaj, F.K.: XAI framework for cardiovascular disease prediction using classification techniques. Electronics 11 (24), 4086 (2022). https://doi.org/10.3390/electronics11244086

Diwan, T., Anirudh, G., Tembhurne, J.V.: Object detection using YOLO: challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 82 , 9243–9275 (2022)

Jiang, P., Ergu, D., Liu, F., Cai, Y., Ma, B.: A review of Yolo algorithm developments. Procedia Comput. Sci. 199 , 1066–1073 (2022)

Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp 779–788. (2016)

Huang, R., Pedoeem, J., Chen, C.: YOLO-LITE: a real-time object detection algorithm optimized for non-GPU computers. In: 2018 IEEE International Conference on Big Data (big data), Seattle, WA, USA, pp. 2503–2510. (2018). https://doi.org/10.1109/BigData.2018.8621865

Muthumari, M., Akash, V., Charan, K.P., Akhil, P., Deepak, V., Praveen, S.P.: Smart and multi-way attendance tracking system using an image-processing technique. 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 1805–1812. (2022). https://doi.org/10.1109/ICSSIT53264.2022.9716349

Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

Jiang, T., Wang, J., Cheng, Y., Zhou, J., Cai, H., Liu, X., Zhang, X.: Pp-yolov2: an improved faster version of yolov2. In: Proceedings of the 2021 3rd International Conference on Advances in Image Processing (ICAIP 2021), pp. 136–141. Association for Computing Machinery (2021).

Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, pp 13029–13038. (2021)

Long, X., Deng, K., Wang, G., Zhang, Y., Dang, Q., Gao, Y., Wen, S.: PP-YOLO: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099 (2020)

Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Wei, X.: YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)

Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft coco: common objects in context. In: European Conference Computer Vision, pp. 740–755. arXiv:1405.0312

Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111 (1), 98–136 (2015)

https://storage.googleapis.com/openimages/web/index.html . Accessed 3 May 2023

Mathurinache. (n.d.). Visual Genome. Retrieved from https://www.kaggle.com/datasets/mathurinache/visual-genome . Accessed 3 May 2023

Wong, A.: Yolo v5: improving real-time object detection with yolo. arXiv preprint arXiv:2011.08036 (2020)

Bochkovskiy, A., Wang, C.Y., Liao, H.Y.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

Shafiee, M.J., et al.: Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv preprint arXiv:1709.05943 (2017)

Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021)

Huang, X., Wang, X., Lv, W., Bai, X., Long, X., Deng, K., Yoshie, O. (2021). PP-YOLOv2: a practical object detector. arXiv preprint arXiv:2104.10419 (2021)

Ultralytics LLC. (n.d.). Ultralytics documentation. https://docs.ultralytics.com/ . Accessed 3 May 2023

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. IEEE (2009). https://www.image-net.org/ . Accessed 3 May 2023

Zhang, T., Yang, C., Chen, C.: Yolor: you only look once for real-time embedded object detection. IEEE Trans. Ind. Electron. 68 (4), 3374–3384 (2021)

Ye, A., Pang, B., Jin, Y., Cui, J.: A YOLO-based neural network with VAE for intelligent garbage detection and classification. In: 2020 3rd International Conference on Algorithms Computing and Artificial Intelligence, pp. 1–7. (2020)

Zheng, Y., Ge, J.: Binocular intelligent following robot based on YOLO-LITE. In: MATEC web of conferences, vol. 336, pp. 03002. EDP sciences (2021).

Rastogi, A., Ryuh, B.S.: Teat detection algorithm: YOLO vs Haar-cascade. J. Mech. Sci. Technol. 33 (4), 1869–1874 (2019)

Li, X., Liu, Y., Zhao, Z., Zhang, Y., He, L.: A deep learning approach of vehicle multitarget detection from traffic video. J. Adv. Transport. (2018). https://doi.org/10.1155/2018/7075814

Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Cities Soc. 65 , 102600 (2021)

Zhang, X., Qiu, Z., Huang, P., Hu, J., Luo, J.: Application research of YOLO v2 combined with color identification. In: 2018 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 138–1383. (2018)

Cao, Z., Liao, T., Song, W., Chen, Z., Li, C.: Detecting the shuttlecock for a badminton robot: a YOLO based approach. Expert Syst Appl 164 , 113833 (2021). https://doi.org/10.1016/j.eswa.2020.113833

Chen, B., Miao, X.: Distribution line pole detection and counting based on YOLO using UAV inspection line video. J. Electr. Eng. Technol. 15 (1), 441–448 (2020). https://doi.org/10.1007/s42835-019-00230-w

Article MathSciNet Google Scholar

Mao, Q.C., Sun, H.M., Liu, Y.B., Jia, R.S.: Mini-YOLOv3: real-time object detector for embedded applications. IEEE Access 7 , 133529–133538 (2019)

Li, J., Gu, J., Huang, Z., Wen, J.: Application research of improved YOLO V3 algorithm in PCB electronic component detection. Appl. Sci. 9 (18), 3750 (2019)

Kannadaguli, P.: YOLO v4 based human detection system using aerial thermal imaging for UAV based surveillance applications. In: 2020 International Conference on Decision Aid Sciences and Application (DASA), pp. 1213–1219. (2020)

Jiang, J., Fu, X., Qin, R., Wang, X., Ma, Z.: High-speed lightweight ship detection algorithm based on YOLO-V4 for three-channels RGB SAR image. Remote Sens. 13 (10), 1909 (2021)

Wu, D., Lv, S., Jiang, M., Song, H.: Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 178 , 105742 (2020). https://doi.org/10.1016/j.compag.2020.105742

Kasper-Eulaers, M., Hahn, N., Berger, S., Sebulonsen, T., Myrland, Ø., Kummervold, P.E.: Detecting heavy goods vehicles in rest areas in winter conditions using YOLOv5. Algorithms 14 (4), 114 (2021)

Haque, M.E., Rahman, A., Junaeid, I., Hoque, S.U., Paul, M.: Rice leaf disease classification and detection using YOLOv5. arXiv preprint arXiv:2209.01579 (2022).

Mathew, M.P., Mahesh, T.Y.: Leaf-based disease detection in bell pepper plant using YOLO v5. SIViP 16 (3), 841–847 (2022)

Sirisha, U., Chandana, B.S.: Privacy preserving image encryption with optimal deep transfer learning based accident severity classification model. Sensors 23 (1), 519 (2023)

Patel, D., Patel, S., Patel, M.: Application of image-to-image translation in improving pedestrian detection. arXiv preprint arXiv:2209.03625 (2022)

Liang, Z., Xiao, G., Hu, J. et al. MotionTrack: rethinking the motion cue for multiple object tracking in USV videos. Vis Comput (2023). https://doi.org/10.1007/s00371-023-02983-y

Hussain, M., Al-Aqrabi, H., Munawar, M., Hill, R., Alsboui, T.: Domain feature mapping with YOLOv7 for automated edge-based pallet racking inspections. Sensors 22 (18), 6927 (2022)

Aboah, A., et al.: Real-time multi-class helmet violation detection using few-shot data sampling technique and yolov8. arXiv preprint arXiv:2304.08256 (2023)

Ahmed, D., et al.: Machine vision-based crop-load estimation using YOLOv8. arXiv preprint arXiv:2304.13282 (2023)

Ju, R.-Y., Weiming, C.: Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. arXiv preprint arXiv:2304.05071 (2023)

Morris, T.: Computer Vision and Image Processing, 1st edn., pp. 1–320. Palgrave Macmillan Ltd, London (2004)

Zhang, H., Deng, Q.: Deep learning-based fossil-fuel power plant monitoring in high resolution remote sensing images: a comparative study. Remote Sens. 11 (9), 1117 (2019)

Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391. (2020)

Changyong, S., Yifan, L., Jianfei, G., Zheng, Y., Chunhua, S.: Channel-wise knowledge distillation for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5311–5320. (2021)

Xiaohan, D., Honghao, C., Xiangyu, Z., Kaiqi, H., Jungong, H., Guiguang, D. Reparameterizing your optimizers rather than architectures. arXiv preprint arXiv:2205.15242 (2022)

Anuradha, C., Swapna, D., Thati, B., Sree, V.N., Praveen, S.P.: Diagnosing for liver disease prediction in patients using combined machine learning models. 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 889–896. (2022). https://doi.org/10.1109/ICSSIT53264.2022.9716312

Srinivasu, P.N., Shafi, J., Krishna, T.B., Sujatha, C.N., Praveen, S.P., Ijaz, M.F.: Using recurrent neural networks for predicting type-2 diabetes from genomic and tabular data. Diagnostics 12 (12), 3067 (2022). https://doi.org/10.3390/diagnostics12123067

Gao, H., Zhuang, L., Van Der Laurens, M., Kilian, Q.W.: Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. (2017)

Xiaohan, D., Xiangyu, Z., Ningning, M., Jungong, H., Guiguang, D., Jian, S.: RepVGG: making VGG-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13733–13742. (2021).

Vidushi Meel.: https://viso.ai/deep-learning/yolor/ . Accessed 3 May 2023

Download references

The authors declare that no funds, grants, or other supports were received during the preparation of this manuscript.

Author information

Authors and affiliations.

School of Computer Science and Engineering, VIT-AP University, Amaravati, 522237, India

Department of Computer Science and Engineering, Prasad V Potluri Siddhartha Institute of Technology, Vijayawada, 520007, India

S. Phani Praveen & Parvathaneni Naga Srinivasu

Directorate of Research, Sikkim Manipal University, Gangtok, Sikkim, 737102, India

Akash Kumar Bhoi

KIET Group of Institutions, Delhi-NCR, Ghaziabad, 201206, India

Institute of Information Science and Technologies, National Research Council, 56124, Pisa, Italy

Paolo Barsocchi & Akash Kumar Bhoi

You can also search for this author in PubMed Google Scholar

Contributions

All the authors have designed the study, developed the methodology, performed the analysis, and written the manuscript. All authors have contributed equally to this work.

Corresponding authors

Correspondence to Paolo Barsocchi or Akash Kumar Bhoi .

Ethics declarations

Conflict of interest.

The authors have no relevant financial or non-financial interests to disclose.

Ethical Approval

Not applicable as the study did not require ethical approval.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Sirisha, U., Praveen, S.P., Srinivasu, P.N. et al. Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection. Int J Comput Intell Syst 16 , 126 (2023). https://doi.org/10.1007/s44196-023-00302-w

Download citation

Received : 28 February 2023

Accepted : 19 July 2023

Published : 02 August 2023

DOI : https://doi.org/10.1007/s44196-023-00302-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Object detection
Performance analysis
Find a journal
Publish with us
Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 01 August 2024

MPE-YOLO: enhanced small target detection in aerial imaging

Yichang Qin 1 ,
Ze Jia 1 &
Ben Liang 1

Scientific Reports volume 14 , Article number: 17799 ( 2024 ) Cite this article

470 Accesses

Metrics details

Aerospace engineering
Electrical and electronic engineering

Aerial image target detection is essential for urban planning, traffic monitoring, and disaster assessment. However, existing detection algorithms struggle with small target recognition and accuracy in complex environments. To address this issue, this paper proposes an improved model based on YOLOv8, named MPE-YOLO. Initially, a multilevel feature integrator (MFI) module is employed to enhance the representation of small target features, which meticulously moderates information loss during the feature fusion process. For the backbone network of the model, a perception enhancement convolution (PEC) module is introduced to replace traditional convolutional layers, thereby expanding the network’s fine-grained feature processing capability. Furthermore, an enhanced scope-C2f (ES-C2f) module is designed, utilizing channel expansion and stacking of multiscale convolutional kernels to enhance the network’s ability to capture small target details. After a series of experiments on the VisDrone, RSOD, and AI-TOD datasets, the model has not only demonstrated superior performance in aerial image detection tasks compared to existing advanced algorithms but also achieved a lightweight model structure. The experimental results demonstrate the potential of MPE-YOLO in enhancing the accuracy and operational efficiency of aerial target detection. Code will be available online (https://github.com/zhanderen/MPE-YOLO).

Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic

Centralised visual processing center for remote sensing target detection

Lightweight aerial image object detection algorithm based on improved YOLOv5s

Introduction.

Aerial images, acquired through aerial photography technology, feature high-resolution and extensive area coverage, providing critical support to fields such as traffic monitoring 1 and disaster relief 2 through the automated extraction and analysis of geographic information. With continuous advancements in remote sensing technology, aerial image detection offers valuable data support for geographic information systems and related applications, playing a significant role in enhancing the identification and monitoring of surface objects and the development of geographic information technology.

Aerial images are characterized by complex terrain, varying light conditions, and difficulties in data acquisition and storage. However, the high-dimensionality and massive volume of aerial image data pose numerous challenges to image detection, particularly because aerial images often contain small targets, making detection even more challenging 3 . In light of these issues, target detection algorithms are increasingly vital as the core technology for aerial image analysis.

Traditional object detection algorithms often rely on manually designed feature extraction methods such as scale-invariant feature transform (SIFT), and speeded up robust feature (SURF). These methods represent targets by extracting local features from images but might fail to capture higher-level semantic information. Machine learning approaches such as support vector machines (SVMs) 4 , random forests 5 , etc., have effectively improved the accuracy and efficiency of aerial detection, but struggle with the detection of complex backgrounds. With the rapid development of deep learning technology, neural network-based image object detection methods have become mainstream. The end-to-end learning capability of deep learning allows algorithms to automatically learn and extract more abstract and higher-level semantic features, replacing traditionally manually designed features.

Deep learning-based object detection algorithms can be divided into single-stage and two-stage algorithms. The two-stage algorithms are represented by the R-CNN 6 , 7 , 8 series, which adopts a two-stage detection process; First candidate regions are created via the region proposal network (RPN), and then the location and classification are fine-tuned through classifiers and regressors. Such algorithms can precisely locate and identify various complex land objects, especially when dealing with small or densely arranged targets, and have received widespread attention and application. However, two-stage detection algorithms still have room for improvement in terms of speed and efficiency. Single-stage detection algorithms, represented by SSD 9 and YOLO 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 series, approach object detection as a regression problem and predict the categories and locations of targets directly from the global image, enabling real-time detection. These algorithms offer good real-time performance and accuracy, and are particularly suitable for processing large-scale aerial image data. They hold significant application prospects for quickly obtaining geographic information, monitoring urban changes, and natural disasters. However, single-stage object detection algorithms still face challenges in the accurate detection and positioning of small targets.

In the context of UAV aerial imagery, object detection encounters several specific challenges:

Dense small objects and occlusion Images captured from low altitudes often contain a large number of dense small objects, particularly in urban or complex terrains. Due to the considerable distance, these objects appear smaller in the images and are prone to occlusion. For instance, buildings might obscure each other, or trees might cover parked vehicles. Such occlusion leads to partial hiding of target object features, thereby affecting the performance of detection algorithms. Even advanced detection algorithms struggle to accurately identify and locate all objects in highly dense and severely occluded environments.

Real-time requirements vs. accuracy trade-off UAV aerial image object detection must meet real-time requirements, particularly in monitoring and emergency response scenarios. Achieving real-time detection necessitates a reduction in algorithmic computational complexity, which frequently conflicts with detection accuracy. High-accuracy detection algorithms typically require substantial computational resources and time, whereas real-time demands necessitate algorithms that can process vast amounts of data swiftly. The challenge lies in maintaining high detection accuracy while ensuring real-time performance. This requires optimization in network architecture to balance the number of parameters and accuracy effectively.

Complex backgrounds Aerial images often include a significant amount of irrelevant background information like buildings, trees, and roads. The complexity and diversity of background information can interfere with the correct detection of small objects. Moreover, the features of small objects are inherently less pronounced. Traditional single-stage and two-stage algorithms primarily focus on global features and may overlook the fine-grained features crucial for detecting small objects. These algorithms often fail to capture the details of small objects, resulting in lower detection accuracy. Therefore, there is a pressing need for more advanced deep learning models and algorithms that can handle these subtle features, thereby enhancing the accuracy of small object detection.

To address the aforementioned issues, this study proposes an algorithm called MPE-YOLO, which is based on the YOLOv8 model, and enhances the detection accuracy of small objects while maintaining a lightweight model. The main contributions of this study are as follows.

We developed a multilevel feature integrator (MFI) module with a hierarchical structure to merge image features at different levels, enhancing scene comprehension and boosting object detection accuracy.

A perception enhancement convolution (PEC) module is proposed, which uses multislice operations and channel dimension concatenation to expand the receptive field, thereby improving the model’s ability to capture detailed target information.

By incorporating the proposed enhanced scope-C2f (ES-C2f) operation and introducing an efficient feature selection and utilization mechanism, the selective use of features is further enhanced, effectively improving the accuracy and robustness of small object detection.

After comprehensive comparative experiments with various other object detection models, MPE-YOLO has demonstrated a significant improvement in performance , proving its effectiveness.

The rest of this paper includes the following content: Section 2 briefly introduces the recent research results on aerial image detection and the main idea of YOLOv8. Section 3 introduces the innovations of this paper. Section 4 describes the experimental setup, including the experimental environment, parameter configuration, datasets used, and performance evaluation metrics, and presents detailed experimental steps and results, verifying the effectiveness of the improvement strategies. Section 5 summarizes the main contributions of this research and discusses future directions of work.

Background and related works

Related works.

Deep learning-based object detection algorithms are widely applied in fields such as aerial image detection, medical image processing, precision agriculture, and robotics due to their high detection accuracy and inference speed. The following are some algorithms used in aerial image detection: Cheng et al. 18 proposed a method combining cross-scale feature fusion to enhance the network’s ability to distinguish similar objects in aerial images. Guo et al. 19 presented a novel object detection algorithm that improves the accuracy and efficiency of highway intrusion detection by refining feature extraction, feature fusion, and computational complexity methods. Sahin et al. 20 introduced YOLODrone, an improved version of the YOLOv3 algorithm that increases the number of detection layers to enhance the model’s capability to detect objects of various sizes, although this adds to the model’s complexity. Chen et al. 21 enhanced the feature extraction capability of the model by optimizing residual blocks in the multi-level local structure of DW-YOLO and improved accuracy by increasing the number of convolution kernels. Zhu et al. 22 incorporated the CBAM attention mechanism into the YOLOv5 model to address the issue of blurred objects in aerial images. Additionally, Yang 23 enhanced small object detection capability by adding upsampling in the neck part of the YOLOv5 network. And integrated an image segmentation layer into the detection network. Lin et al. 24 proposed GDRS-YOLO, which first constructs multi-scale features through deformable convolution and gathering-dispersing mechanisms, and then introduces normalized Wasserstein distance for mixed loss training, effectively improving the accuracy of object detection in remote sensing images. Jin et al. 25 improved the robustness and generalization of UAV image detection under different shooting conditions by decomposing domain-invariant features, domain-specific features, and using balanced sampling data augmentation techniques. Bai et al.’s CCNet 26 suppresses interference in deep feature maps using high-level RGB feature maps while achieving cross-modality interaction, enhancing salient object detection.

In the field of medical image processing, typical object detection algorithms include: Pacal et al. 27 demonstrated that by improving the YOLO algorithm and using the latest data augmentation and transfer learning techniques, the efficiency and accuracy of polyp detection could be significantly enhanced. Xu et al. 28 showed that the improved Faster R-CNN model exhibited excellent performance in lung nodule detection, particularly in small object detection capability and overall detection accuracy.Xi et al. 29 improved the sensitivity of small object detection by introducing a super-resolution reconstruction branch and an attention fusion module in the MSP-YOLO network. In the agricultural field, Zhu et al. 30 demonstrated how to achieve high-precision drone control systems through a combination of hardware and software. Its application in agricultural spraying provides a reference for the performance of automated control systems in practical applications. In the field of robotics, Wang et al. 31 researched robotic mechanical models and optimized jumping behavior through bionic methods. This combination of biological observation and mechanical modeling can inspire the development of other robots or systems that require motion optimization, using bionic mechanisms to achieve efficient and reliable motion control.

The aforementioned methods face challenges such as the limitations of the receptive field and insufficient feature fusion in highly complex backgrounds or dense small object scenes, resulting in poor performance in low-resolution and densely occluded situations. Driven by these motivations, we propose an algorithm called MPE-YOLO that improves the detection accuracy of small objects while maintaining a lightweight model. Numerous experiments have demonstrated that by integrating multilevel features and strengthening detail information perception modules, we can achieve higher detection accuracy across different datasets.

YOLOv8 network structure.

YOLOv8 is the latest generation of object detection algorithms developed by Ultralytics, and officially released on January 10, 2023. YOLOv8 improves upon YOLOv5 by replacing the C3 module with the C2f module. The head utilizes a contemporary decoupled head structure, separating classification and detection heads, and transitions from an anchor-based to an anchor-free approach, resulting in higher detection accuracy and speed. The YOLOv8 model comprises an input layer, a backbone network, a neck network, and a head network, as shown in Fig. 1 . The input image is first resized to 640x640 to meet the size requirements of the input layer, and the backbone network achieves downsampling and feature extraction via multiple convolutional operations, with each convolutional layer equipped with batch normalization and SiLU 32 activation functions. To improve the network’s gradient flow and feature extraction capacity, the C2f block was introduced, drawing on the E-ELAN structure from YOLOv7, and employing multilayer branch connections. Furthermore, the SPPF 33 block is positioned at the end of the backbone network and combines multiscale feature processing to enhance the feature abstraction capability. The neck network adopts the FPN 34 and PAN 35 structures for effective fusion of different scale feature maps, which are then passed on to the head network. The head network is designed in a decoupled manner, including two parallel convolutional branches that handle regression and classification tasks separately to improve focus and performance on each task. The YOLOv8 series offers five different scaled models for users to choose from, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Compared to other models, YOLOv8s strikes a balance between accuracy and model complexity. Therefore, this study chooses YOLOv8s as the baseline network.

Methodology

MPE-YOLO network structure.

In response to the need for detecting small objects in aerial and drone imagery, we propose the MPE-YOLO algorithm to adjust the structure of the original YOLOv8 components. As shown in Fig. 2 , by designing the multilevel feature integrator (MFI) module, the representation and information fusion of small target features are optimized, so as to reduce the information loss in the process of feature fusion. The introduction of the perception enhancement convolution (PEC) module replaces the traditional convolutional layer, expands the ability of fine-grained feature processing of the network, and significantly improves the recognition accuracy of small targets in complex backgrounds. We replaced the last two downsampling layers and the detection layer for 20*20 size targets in the backbone network with a detection layer for small 160*160 size targets. This enables the model to focus more on the details of small targets. Finally, through the enhanced scope-C2f (ES-C2f) module, the feature extraction efficiency and operation efficiency of the model are further improved by using channel expansion and the stacking of multi-scale convolution kernels. Combining these improvements, MPE-YOLO performs well in small object detection tasks in complex environments, and significantly improves the accuracy and performance of the model. To differentiate from the baseline model, MPE-YOLO marks the improved modules with darker colors. The gray area at the bottom represents the removal of the 20*20 detection head, while the yellow area at the top represents the addition of the 160*160 detection head.

Multilevel feature integrator

In object detection tasks, the feature representation of small objects is often unclear due to size restrictions, which can lead to them being overlooked or lost in the feature fusion process, resulting in decreased detection performance. To effectively address this issue, we adopted the structure of Res2Net 36 and designed an innovative multilevel feature integrator (MFI). The structure of the MFI module, as shown in Fig. 3 , aims to optimize the feature representation and information fusion of small objects through a series of detailed strategies, reducing the loss of feature information and suppressing redundancy and noise.

Multilevel feature integrator structure.

First, the MFI module uses convolutional operations to reduce the channel dimensions of the input feature maps, simplifying the subsequent computation process. Immediately following, the reduced feature maps are uniformly divided into four groups (Group 1 to Group 4), with each group containing 25% of the total number of original feature maps. This partition is not random, but a uniform segmentation of the number of channels based on the feature map, aiming to optimize the computational efficiency and the subsequent feature fusion effect. We use a squeeze convolution layer to shape and compress the feature maps from all groups, resulting in output Out1, which aims to focus on key target features, reduce feature redundancy, and preserve details helpful for small object detection. Second, by performing proportional feature fusion of Group 1 and Group 2, we construct complex low-level feature representations, forming the output part Out2, and enhancing the feature details of small objects. Additionally, the bottleneck module 17 is applied to Group 3 to refine high-level semantic information, and produce Out3. This advanced feature output helps capture richer contextual information, improving the detection efficiency of small objects.

Out4 is obtained by fusing the high-level features from Out3 with the Group4 features and then processing them again through the bottleneck module. The purpose of this step is to integrate the low-level features with the high-level features, enabling the model to understand the characteristics of small objects more comprehensively. Then by concatenating and integrating the four different levels of outputs-Out1, Out2, Out3, and Out4-in the channel direction, the features of all the scales are fully utilized, thereby improving the overall performance of the model in small object detection tasks.

Ultimately, MFI module adopts a channel-wise feature integration approach to aggregate features from various levels, enhancing the ability to recognize different target behaviors, particularly improving the accuracy of capturing small object behaviors and interactions in dynamic scenes.

Perception enhancement convolution

Perception enhancement convolution structure.

When dealing with multiscale object detection tasks, traditional convolutional neural networks typically face challenges such as fixed receptive fields 37 , insufficient use of context information, and limited environmental perception. In particular, in the detection of small objects, these limitations can significantly suppress the performance of the model. To overcome these issues, we introduce Perception-Enhanced Convolution (PEC), as shown in Fig. 4 , which is a module specifically designed for the backbone network and intended to replace traditional convolutional layers. The main advantage of PEC is that it introduces a new dimension during the phase of extracting primary features in the model, which can significantly expand the receptive field and more effectively integrate context information, thus further deepening the model’s understanding of small objects and their environment.

In detail, the PEC module begins by precisely cutting the input feature map into four smaller feature map blocks, each of which is reduced in size by half in the spatial dimension. This cutting process involves the selection of specific pixels, ensuring that representative information from the top-left, top-right, bottom-left, and bottom-right of the original feature map is captured separately in each channel. Through such a meticulous division of the spatial dimension, the resulting small blocks retain important spatial information while ensuring even coverage of information. Subsequently, these small blocks are concatenated in the channel dimension to form a new feature map, with an increased number of channels but reduced spatial resolution, thus significantly reducing the computational burden while maintaining a large receptive field.

To further enhance feature expressiveness and computational efficiency, a squeeze layer is integrated into the PEC, which reduces model parameters by compressing feature dimensions while ensuring that key features are emphasized even as the model is simplified. For deeper feature extraction, we apply the classic bottleneck structure, which not only refines the hierarchical representation of features but also significantly enhances the model’s sensitivity and cognitive ability for small objects, further boosting the computational efficiency of features.

Overall, through the PEC module, the model is endowed with stronger environmental adaptability and understanding of object relations. The innovative design of the PEC enables feature maps to obtain more comprehensive and detailed information on targets and the environment while expanding the receptive field. This is particularly crucial in areas such as traffic monitoring for object classification and behavior prediction, as these aeras greatly depend on accurate interpretations of subtle changes and complex scenes.

Enhanced Scope-C2f

Enhanced Scope-C2f structure.

In the YOLOv8 model, researchers designed the C2f module 17 to maintain a lightweight network while obtaining richer gradient flow information. However, when dealing with small targets or low-contrast targets in aerial images, this module does not sufficiently express fine features, affecting the detection accuracy of targets with complex scales. To address this issue, this study proposes an improved module called Enhanced Scope-C2f (ES-C2f), as shown in Fig. 5 , which focuses on improving the network’s ability to capture details and feature utilization efficiency, especially in expressing small targets and low-contrast targets.

The ES-C2f module enhances the network’s representation capability for targets by expanding the channel capacity of feature maps, enabling the model to capture more subtle feature variations. This strategy is dedicated to enhancing the network’s sensitivity to small target details and improving the adaptability to low-contrast target environments through a wider range of feature representations.

To expand the channel capacity while considering computational efficiency, the ES-C2f module cleverly integrates a series of squeeze layers. These layers perform intelligent selection and compression of feature channels, not only streamlining feature representations but also preserving the capture of key information. The design of this feature operation fully considers the need to enhance identification capabilities while reducing model complexity and computational load. ES-C2f further employs a strategy of stacking multiscale convolutional kernels as well as combining local and global features. This provides an effective means to integrate features at different levels, enabling the model to make decisions on a richer feature dimension. Deep semantic information is cleverly woven with shallow texture details, enhancing the perception of scale diversity.

An optimized squeeze layer is introduced at the end of the module to further refine the essence of the features and adapt to the needs of subsequent processing layers. This engineering not only enhances the feature representation capacity but also improves the information decoding efficiency of subsequent layers, allowing the model to detect and recognize targets with greater precision. With the improvements made to the original C2f module in the YOLOv8 architecture, the proposed ES-C2f module provides a more effective solution for small targets and low-contrast scenes. The ES-C2f module not only maintains the lightweight structure and response speed of the model in extremely challenging scenarios but also significantly improves the overall recognition ability for complex-scale target detection.

Experiments

Experimental setup.

The batch size was set to 4 to avoid memory overflow, the learning rate was set to 0.01, the learning rate was adjusted by the cosine annealing algorithm, the momentum of the stochastic gradient descent (SGD) was set to 0.937, and the mosaic method was used for data augmentation. The resolution of the input graphics is uniformly set to 640 \(\times \) 640. A total of 200 epochs were trained on all models, and no pretrained models were used in training to ensure the fairness of the experiment. We opted for random weight initialization, ensuring that the initial weights of each model originate from the same distribution. Although the specific initial values differ, this guarantees that all models start from a fair and balanced point, enabling comparison under identical training conditions without the influence of historical biases from pretrained models. Pretrained models are typically trained on large datasets that may not align with our target dataset distribution, potentially introducing unforeseen biases. Therefore, we decided against using pretrained models. To mitigate the impact of randomness in weight initialization, we conducted multiple independent experiments and averaged the results. Table 1 lists the training environment configurations.

To ensure the rationality of the experimental data, this article selected three representative public datasets for experiments, namely VisDrone2019 38 , RSOD 39 , and AI-TOD 40 . VisDrone2019, as the main dataset of this experiment, was subjected to very detailed comparative and ablation studies. To validate the generalizability and universality of the model, experiments were conducted on the RSOD and AI-TOD datasets.

Considering the consistency of the dataset and the continuity of the study, we selected the VisDrone2019 dataset, it collected and released by Tianjin University’s Machine Learning and Data Mining Lab, comprises a total of 8629 images. Among them, 6471 images were used for training, 548 images were used for validation, and 1610 images were used for testing. The dataset encompasses 10 categories from daily scenes-pedestrian, person, bicycle, car, van, truck, tricycle, awning tricycle, bus, and motorcycle. In this dataset, the proportion of categories is unbalanced, and most images contain small targets, making detection difficult.

The RSOD dataset is a public dataset released by Wuhan University in 2017, it consists of 976 optical remote sensing images taken from Google Earth and Tianditu, and is composed of four object classes: aircraft, oiltank, overpass, and playground, totalling 6950 targets. To increase the number of samples, the dataset was expanded by means of rotation, translation, and splicing, increasing the total to 2000 images. To avoid data leakage issues, data augmentation is performed only on the training set, and the validation and test sets remain in their original state. Then randomly split it into training, validation, and test sets at a ratio of 8:1:1, with the training set comprising 1,600 images and both the validation and test sets containing 200 images each.

The AI-TOD dataset is a specialized remote sensing image dataset focused on tiny objects, consisting of 28,036 images and 700,621 targets. These targets are divided into eight categories: bridge, ship, vehicle, storage-tank, person, swimming-pool, wind-mill, and airplane. Compared to other aerial remote sensing datasets, the average size of targets in AI-TOD is approximately 12.8 pixels, which is significantly smaller than that in other datasets, increasing the difficulty of detection. The dataset is divided into training, validation, and test sets at a ratio of 6:1:3.

Evaluation criteria

We selected mAP0.5, mAP0.5:0.95, and APs as indicators to measure the model’s accuracy in small target detection. To evaluate the model’s efficiency, we used the number of parameters and model size as indicators of its lightweight nature. Additionally, latency was chosen to assess the model’s real-time detection performance.

Precision is the ratio of the number of samples correctly predicted as positive to the number of all samples predicted as positive. The formula is as follows:

Recall is the ratio of the number of samples correctly predicted as positive to the number of samples of all true cases. The formula is as follows:

TP (true positives) represents the number of correctly identified positive instances, FP (false positives) represents the number of incorrectly identified negative instances as positive, and FN (false negatives) represents the number of incorrectly identified positive instances as negative.

mAP refers to the average AP of all defect categories, AP refers to the area of the curve below the precision recall curve, and the formula for AP and mAP is as follows, the greater the mAP is, the better the comprehensive detection performance of the model in all categories, the specific formula is as follows:

The APs metric is the average accuracy of calculating the detection results of small objects, and this metric can help us understand how well the model performs when detecting small objects. The number of parameters represents the number of the parameters used by the model, measured in millions. The number of parameters provides a direct indicator of the complexity of the model, a greater number of parameters usually means greater representation power, but can likewise lead to longer training times and the risk of overfitting. Model size usually refers to the size of the model file stored on disk and is usually quantified in megabytes (MB). Model size reflects the amount of storage space the model occupies, which is especially important in resource-constrained environments such as mobile devices or where the model needs to be deployed to embedded devices. Latency refers to the time it takes to process a frame in object detection, and is one of the metrics to measure whether a model can meet real-time detection.

Ablation atudy

To validate the effectiveness of the proposed module in aerial image detection, we conducted ablation studies for each module, using the YOLOv8s model as the baseline. The experimental results are shown in Table 2 , where ✓ indicates the addition of the module to the model, A represents adding the MFI module, B represents improving the network structure, C represents adding the PEC module, and D represents adding the ES-C2f module.

By incorporating the multilevel feature integrator (MFI) module, experiments demonstrate a notable enhancement in small object detection performance, notably reflected in a 1.6% increase in mean average precision ([email protected]) and a 0.9% increase in [email protected]:0.95. Simultaneously, the total number of model parameters is reduced by 0.8 million, and the model size decreases by 1.6 megabytes. Additionally, it reduces latency to 8.5 milliseconds, indicating that the MFI module has optimized the model’s computational efficiency and feature extraction capabilities, particularly in integrating multi-level semantic information and reducing redundant calculations.

By optimizing the network structure, removing redundant deep feature mappings, and introducing detection heads optimized for small object detection, the precision of the model is significantly enhanced, as is the model’s ability to capture low-frequency detail information. These changes resulted in an improvement of 1.8% in mAP0.5 and 1.3% in mAP0.5:0.95. By compressing the number of channels and reducing the number of network layers, the model can abstractly extract semantic information from deeper feature maps, further enhancing the recognition of small objects. The simplification of the structure not only reduced the parameter count by 7.2 M but also reduced the model size to 6.3 MB. However, an increase in latency to 12ms suggests that the addition of a specific small object detection head has led to an increase in latency.

Subsequently, by introducing the PEC module, the feature maps are finely sliced and fused along the channel dimension, enhancing the spatial integrity and richness of the features. At the same time, with the introduction of squeeze layers, we compress key information while reducing computational complexity, thus improving the efficiency of feature processing. By using the bottleneck structure for deep feature processing, the small object detection and processing capabilities of the module are enhanced, and the complexity of the model increases only slightly compared to that of the baseline model, maintaining the latency at 12.5 ms, resulting in a 1.2% improvement in the mAP0.5 and a 0.7% improvement in the mAP0.5:0.95. This result shows that even with a slight increase in complexity, the PEC module achieves a significant improvement in the accuracy of small object detection, especially in complex scenarios, where the model’s performance has been effectively improved.

Finally, by integrating the ES-C2f module, the model can combine the advantages of \(3 \times 3\) and \(5 \times 5\) convolutional kernels to capture local detail features of the target more efficiently than the traditional C2f module while integrating a wider range of contextual information. This module not only improves computational efficiency but also enhances the model’s representational capacity through internal feature channel transformation and information compression. This allows the model to more comprehensively analyze the image content and accurately capture the details of small objects. As a result, the model’s mAP0.5 and mAP0.5:0.95 increased by approximately 1.1% and 0.6%, respectively, while the number of parameters and the model size were reduced by 6.7 M and 12.7 MB compared to the baseline, and then seeing an increase to 14 ms, still ensures a reasonable latency time.

These results validate our improvement strategy, which effectively enhances the accuracy of target detection in aerial images while ensuring that the model is lightweight, demonstrating the profound significance of the research.

Compared with the baseline model, MPE-YOLO shows a significant improvement in the detection accuracy of all categories. As shown in Table 3 , the accuracy of both the pedestrian and people categories is improved by more than 8 points, which indicates that the MPE-YOLO model has a strong detail capture ability for small-scale targets. Overall, the average accuracy of the MPE-YOLO model (mAP0.5) reached 37.0%, which is nearly 6% higher than that of YOLOv8, proving the effectiveness of MPE-YOLO.

Comparative experiments

To validate the effectiveness of the model. we selected the most popular object detection algorithms to compare with MPE-YOLO, including YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOX 41 , RT-DETR 42 , and Gold-YOLO 43 ,ASF-YOLO 44 , two of the recent research results, as shown in Table 4 .

The test results on the VisDrone2019 dataset show the differences in the performances of different object detection algorithms. First, we observed that the performances of the most classical YOLOv5s model was 26.8% on mAP0.5 and 7.0% on APs for small target detection. This result reflects the challenges of the basic YOLO model for small target detection on aerial image datasets. In comparison, YOLOv6s performed slightly worse, with mAP0.5 at 26.6% and APs at 6.7%, but despite this, the performances of the two methods were not very different. The model size and the number of parameters significantly differ, with the model size of YOLOv6s being nearly three times larger than that of YOLOv5s, and the number of parameters being more than doubled. YOLOX-s increased mAP0.5 to 29.5% and APs to 8.8%, indicating a significant improvement in the detection effect. However, this improvement comes at the cost of an increased model size (50.4 MB) and a larger number of parameters (8.9 M).

We then analyzed more advanced models - YOLOv8s and YOLOv8m. The YOLOv8s model achieves 31.3% on mAP0.5 and 8.2% APs, indicating that structural optimization has led to significant improvements. The YOLOv8m model achieves 35.4% and 9.8% on mAP0.5 and APs, respectively, which further proves that larger models may have better accuracy, especially for the more complex task of small object detection.

The RT-DETR-R18 model has a high score (35.9% vs. 10.2%) on both mAP0.5 and APs compared to the traditional architecture of the YOLO series, and it uses the DETR architecture, indicating the potential of the attention mechanism for more accurate object detection, and its model size and number of parameters are also lower than YOLOv8m.

To further validate the superiority of the MPE-YOLO model, we included two advanced models from existing literature, Gold-YOLO and X-YOLO, for comparative experiments. The experimental results show that Gold-YOLO achieved mAP0.5 and APs of 33.2 % and 9.5% respectively, with a model size of 26.3 MB and 13.4 million parameters. X-YOLO achieved mAP0.5 and APs of 34.0% and 9.6% respectively, with a model size of 22.8 MB and 11.3 million parameters. Both models showed significant improvements in performance and small object detection compared to the early YOLO series.

In the end, the MPE-YOLO model achieved the highest mAP0.5 of 37.0% and APs of 10.8%, while maintaining a model size of only 11.5 MB and 4.4 million parameters. This demonstrates that MPE-YOLO not only outperforms other current models in terms of performance but also achieves low resource consumption through its lightweight design, making it highly practical and attractive for real-world applications.

Visual analytics

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the RSOD VisDrone dataset.

By carefully selecting image samples, we applied the baseline model and the MPE-YOLO model for object detection. This allowed us to compare and analyze the detection performances of the two models. As shown in Fig. 6 , the detection confidence of the MPE-YOLO model is significantly better than that of the baseline model under multiple scenarios and challenging conditions. This is manifested in the fact that the target bounding boxes it identifies have higher confidence scores, and these scores are more consistent with the actual target. More importantly, MPE-YOLO also shows significant improvements in reducing false positives and false negatives, accurately identifying and identifying most targets, while minimizing misidentification of non-target areas. Moreover, even under suboptimal shading or lighting conditions, MPE-YOLO achieved a low missed detection rate. These comparison results highlight the effectiveness of the enhanced feature extraction network in MPE-YOLO in dealing with overlapping, size changes and complex backgrounds between targets, indicating that it has more robust feature learning and more accurate target prediction capabilities.

In Fig. 7 , the improved MPE-YOLO model demonstrates its superior feature extraction and targeting capabilities. This is evident by the more concentrated and reinforced high-response regions it reflects. This feature is presented as a brighter area on the heat map, closely following the actual position and contour of the target, demonstrating that the MPE-YOLO model can effectively focus on important signals. In addition, compared with the baseline model, the heat map generated by the improved model shows fewer scattered hot spots around the target, which reduces the possibility of false detection and false alarms, demonstrating the precision and robustness of MPE-YOLO in small target detection tasks. First, the heat map of the night scene in the first row reveals the recognition ability of MPE-YOLO under low-light conditions, in which areas with strong brightness are accurately mapped to the target location, indicating that the model still has efficient feature capture capabilities at low lighting levels. Then, in the second row, when faced with a complex background scene, the heat map generated by MPE-YOLO maintained the ability to accurately identify the target without being affected by the complex environment. The model’s clear positioning of the target verifies its effectiveness in distinguishing the target from the cluttered background in the actual environment. Finally, in the case of dense small targets in the third row, the MPE-YOLO heat map shows excellent discrimination, even when the targets are very close to each other. The highlights of the heat map correspond densely and distinctly to the contours of each small target, showing the model’s ability to accurately locate multiple targets.

These visual evidences are consistent with the increase in mAP0.5 and mAP0.5:0.95 in the experiment, which provides intuitive and strong support for our research.

Relationships between the AP50:95 and model parameter count for different models.

Figure 8 shows the relationship between mAP0.5:0.95 and the parameters of each model, where the x-axis represents the parameters of the model and the y-axis represents the detection performance index. As can be seen from the figure, MPE-YOLO achieves an improvement in detection accuracy while maintaining a low weight. Compared to all the comparison models, our model is best suited for drone vehicle inspection tasks.

Generalization study

Through comprehensive comparative tests on two different remote sensing image datasets RSOD and AI-TOD in Table 5 , our MPE-YOLO model demonstrates its superior generalizability. According to these tests, the MPE-YOLO model showed high accuracy in the two key performance indicators of mAP0.5 and mAP0.5:0.95 compared with several existing advanced object detection models, especially on the AI-TOD dataset, for which the average target size was only 12.8 pixels.

The experimental results reveal the strong detection ability of MPE-YOLO, which maintains high accuracy even in small target detection scenarios, confirming its practicability and effectiveness in the field of remote sensing image analysis. These conclusions support the use of the MPE-YOLO model as a remote sensing target detection algorithm with strong adaptability and generalizability, and indicate its broad potential for future practical applications.

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the RSOD dataset.

Comparison of YOLOv8(mid)and MPE-YOLO(right)on the AI-TOD dataset.

To more clearly demonstrate the strength of our algorithm in detecting small-sized targets, we selected several representative photographs from both the RSOD and AI-TOD datasets. Figures 9 and 10 show that YOLOv8 has a great number of missed detections on smaller targets than MPE-YOLO, which has significantly fewer missed cases. Additionally, MPE-YOLO shows a general improvement in detection precision. These comparative visuals underscore that MPE-YOLO is a more suitable model for practical detection in aerial imagery applications.

Upon examining these sets of illustrations, it becomes evident that our MPE-YOLO outperforms YOLOv8, especially in scenarios with smaller and easily overlooked targets, reinforcing its efficacy and reliability for deployment in aerial target detection tasks.

Conclusions

In this study, we propose the MPE-YOLO model, which effectively improves the accuracy of small and medium-sized object detection in aerial images, and optimizes the object detection performance in complex environments. First, the MFI module is proposed to effectively improve the efficiency of feature fusion, reduce information loss, and qualitatively improve the detection characteristics of small targets. The PEC module enhances the ability of the network to capture the detailed features of the target, which has a significant effect on the object detection in complex backgrounds. The ES-C2f module further strengthens the feature representation ability of small targets by optimizing the sensing range. The model has been tested on multiple aerial image datasets to confirm its excellent performance, especially in terms of real-time processing power and detection accuracy. Future work will focus on improving the generalization ability of the model and optimizing the operational efficiency, with a view to deploying it in a wider range of practical applications.

Data availability

All the images and experimental test images in this paper were from the open source VisDrone dataset, RSOD dataset and AI-TOD dataset. These datasets analyzed during the current research period can be found at the following website.Visdrone: (https://github.com/VisDrone/VisDrone-Dataset), RSOD: (https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-) and AI-TOD :( https://github.com/jwwangchn/AI-TOD).

Liu, H. et al. Improved gbs-yolov5 algorithm based on yolov5 applied to uav intelligent traffic. Sci. Rep. 13 , 9577 (2023).

Article ADS CAS PubMed PubMed Central Google Scholar

Bravo, R. Z. B., Leiras, A. & CyrinoOliveira, F. L. The use of uav s in humanitarian relief. An application of pomdp-based methodology for finding victims. Prod. Oper. Manag. 28 , 421–440 (2019).

Article Google Scholar

Suthaharan, S. & Suthaharan, S. Support vector machine. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning 207–235 (2016).

Biau, G. & Scornet, E. A random forest guided tour. TEST 25 , 197–227 (2016).

Article MathSciNet Google Scholar

Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) Vol. 1, 886–893 (IEEE, 2005).

Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 580–587 (2014).

Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision 1440–1448 (2015).

Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. 28 (2015).

Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 21–37 (Springer, 2016).

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 779–788 (2016).

Redmon, J. & Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6517–6525 (2017).

Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).

Glenn, J. Ultralytics yolov5 (2022).

Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).

Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7464–7475 (2023).

Glenn, J. Ultralytics yolov8 (2023).

Cheng, G., Si, Y., Hong, H., Yao, X. & Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 18 , 431–435 (2020).

Article ADS Google Scholar

Guo, J. et al. A new detection algorithm for alien intrusion on highway. Sci. Rep. 13 , 10667 (2023).

Sahin, O. & Ozer, S. Yolodrone: Improved yolo architecture for object detection in drone images. In 2021 44th International Conference on Telecommunications and Signal Processing (TSP) , 361–365 (IEEE, 2021).

Chen, Y., Zheng, W., Zhao, Y., Song, T. H. & Shin, H. Dw-yolo: An efficient object detector for drones and self-driving vehicles. Arab. J. Sci. Eng. 48 , 1427–1436 (2023).

Zhu, X., Lyu, S., Wang, X. & Zhao, Q. Tph-yolov5: Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2778–2788 (2021).

Yang, Y. Drone-view object detection based on the improved yolov5. In 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA) 612–617 (IEEE, 2022).

Lin, Y., Li, J., Shen, S., Wang, H. & Zhou, H. In GDRS-YOLO: More Efficient Multiscale Features Fusion Object Detector for Remote Sensing Images 21 , 1–5 (2024).

Jin, R., Jia, Z., Yin, X., Niu, Y. & Qi., Y. In Domain Feature Decomposition for Efficient Object Detection in Aerial Images Vol. 16, 1626 (2024).

Bai, Z., Liu, Z., Li, G., Ye, L. & Wang, Y. Circular Complement Network for RGB-D Salient Object Detection Vol. 451, 95–106 (Elsevier, 2021).

Google Scholar

Pacal, I. et al. In An efficient real-time colonic polyp detection with YOLO algorithms trained by using negative samples and large datasets 141 , 105031 (2022).

Xu, J., Ren, H., Cai, S. & Zhang, X. An Improved faster R-CNN Algorithm for Assisted Detection of Lung Nodules Vol. 153, 106470 (Elsevier, 2023).

Chen, X., Zheng, H., Tang, H. & Li, F. Multi-Scale Perceptual YOLO for Automatic Detection of Clue Cells and Trichomonas in Fluorescence Microscopic Images 108500 (Elsevier, 2024).

Zhu, H. et al. Development of a PWM Precision Spraying Controller for Unmanned Aerial Vehicles Vol. 7, 276–283 (Elsevier, 2010).

Wang, M., Zang, X.-Z., Fan, J.-Z. & Zhao, J. Biological Jumping Mechanism Analysis and Modeling for Frog Robot Vol. 5, 181–188 (Elsevier, 2008).

Nishiyama, T., Kumagai, A., Kamiya, K. & Takahashi, K. Silu: Strategy involving large-scale unlabeled logs for improving malware detector. In 2020 IEEE Symposium on Computers and Communications (ISCC) 1–7 (IEEE, 2020).

He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 , 1904–1916 (2015).

Article PubMed Google Scholar

Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2117–2125 (2017).

Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 8759–8768 (2018).

Gao, S.-H. et al. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43 , 652–662 (2019).

Luo, W., Li, Y., Urtasun, R. & Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 29 (2016).

Du, D. et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF international conference on computer vision workshops (2019).

Long, Y., Gong, Y., Xiao, Z. & Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 55 , 2486–2498 (2017).

Wang, J., Yang, W., Guo, H., Zhang, R. & Xia, G.-S. Tiny object detection in aerial images. In 2020 25th International Conference on Pattern Recognition (ICPR) 3791–3798 (IEEE, 2021).

Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).

Lv, W. et al. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069 (2023).

Wang, C. et al. Gold-yolo: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inform. Process. Syst. 36 (2024).

Kang, M., Ting, C.-M., Ting, F. & Phan, R. Asf-yolo: A novel yolo model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 147 , 105057 (2024).

Download references

Acknowledgements

This work was supported by a Grant from the National Natural Science Foundation of China (No.62105093)

Author information

Authors and affiliations.

College of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang, 050018, China

Jia Su, Yichang Qin, Ze Jia & Ben Liang

You can also search for this author in PubMed Google Scholar

Contributions

J.S. conceived the experiments, J.S. and Y.Q. conducted the experiments, Z.J. and B.L. analysed the results. Y.Q. wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yichang Qin .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Su, J., Qin, Y., Jia, Z. et al. MPE-YOLO: enhanced small target detection in aerial imaging. Sci Rep 14 , 17799 (2024). https://doi.org/10.1038/s41598-024-68934-2

Download citation

Received : 29 February 2024

Accepted : 30 July 2024

Published : 01 August 2024

DOI : https://doi.org/10.1038/s41598-024-68934-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Object detection
Aerial image
Small target
Model lightweight

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Dflm-yolo: a lightweight yolo model with multiscale feature fusion capabilities for open water aerial imagery.

1. Introduction

The paper introduces a new data augmentation algorithm called SOM, which aims to expand the number of objects in specific categories without adding actual objects. This algorithm ensures that the characteristics of the added objects remain consistent with the original ones. The experiments demonstrate that this method enhances dataset balance and improves the model’s accuracy and generalization capabilities.
Depthwise separable convolutions were utilized as the feature extraction module in the backbone network, reducing model parameters, the computation required for convolution operations, and network inference latency.
A new plug-and-play module, FC-C2f, was designed to optimize the backbone network structure, reduce computational redundancy, and lower the model’s parameters and FLOPs.
By gradually integrating features from different levels, the connections between layers are effectively increased, and the model’s feature fusion process is optimized. This enhances the model’s capability to fuse multiscale features, improving detection accuracy for objects of various scales.
Combined dilated convolutions with small kernels and cascaded convolutions into a re-parameterized large kernel convolution. This approach retains the benefits of small kernels, such as reduced computational load and fewer parameters, while achieving the large effective receptive field of large kernels. The experimental results demonstrate that this structure reduces model parameters and increases the receptive field.

2. Related Work

2.1. data augmentation, 2.2. lightweight methods for object detection networks based on deep learning, 2.3. multiscale feature fusion, 3. materials and methods, 3.1. small object multiplication data augmentation algorithm.

The Working Steps of the SOM Data Augmentation Algorithm

Image (I ), Annotation (A ), Object category index (L), Copy-Paste times (n).
Iterate through the annotation information (A ) and check if there are any objects with the category index L and if their pixel size is smaller than 32 × 32. If these conditions are met, save the size and pixel information of these objects in List 1.
for class_number in A :
if (class_number == L) and (object size < 32 × 32):
List1.append(A [class_number])
else:
continue
The paste regions are randomly generated in the original image according to the number of duplications and their suitability is assessed. If unsuitable, new regions are generated. The qualifying object areas are pasted into these regions, and the annotation information for the newly generated objects is added to the annotation file.
for object information in List1:
for i in range(0, n):
top-left coordinates = random(0, X), random(0, Y)
bottom-right coordinates = top-left coordinates(x, y)—object size
if bottom-right coordinates(x) > X or bottom-right coordinates(y) > Y:
n = n – 1
continue
SOM_Image = Paste(top-left coordinates, bottom-right coordinates, object information)
SOM_Annotation = Add_New_Annotation(class number, object coordinates)
Use a Gaussian filter to smooth the edges of the pasted objects. The Blurred_image is the image after Gaussian smoothing, while GaussianBlur is the Gaussian filter applied to the entire image, particularly to the edges of the pasted regions.
Blurred_image = GaussianBlur(SOM_Image, Gaussian kernel size, Gaussian kernel standard deviation)
The image after SOM data augmentation, Blurred_image, and the corresponding annotation file, SOM_ Annotation.

3.2. Depthwise Separable Convolution

3.3. improved c2f module based on convolutional gated linear unit and faster block, 3.4. lightweight multiscale feature fusion network, 4. experimental and analysis, 4.1. experimental environment and parameter settings, 4.2. experimental metrics, 4.3. ablation experiments, 4.3.1. the effect of som data augmentation on the original model, 4.3.2. the effect of dsconv on the original model, 4.3.3. the effect of fc-c2f on the original model, 4.3.4. the effect of lmfn on the original model, 4.3.5. the effect of combining multiple improvement modules on the original model, 4.4. comparative experiment, 5. conclusions, author contributions, data availability statement, conflicts of interest.

Yang, T.; Jiang, Z.; Sun, R.; Cheng, N.; Feng, H. Maritime Search and Rescue Based on Group Mobile Computing for Unmanned Aerial Vehicles and Unmanned Surface Vehicles. IEEE Trans. Ind. Inform. 2020 , 16 , 7700–7708. [ Google Scholar ] [ CrossRef ]
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024 , 16 , 149. [ Google Scholar ] [ CrossRef ]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Deep Learning Techniques to Classify Agricultural Crops through UAV Imagery: A Review. Neural Comput. Appl. 2022 , 34 , 9511–9536. [ Google Scholar ] [ CrossRef ]
Zhao, C.; Liu, R.W.; Qu, J.; Gao, R. Deep Learning-Based Object Detection in Maritime Unmanned Aerial Vehicle Imagery: Review and Experimental Comparisons. Eng. Appl. Artif. Intell. 2024 , 128 , 107513. [ Google Scholar ] [ CrossRef ]
Guo, Y.; Xiao, Y.; Hao, F.; Zhang, X.; Chen, J.; de Beurs, K.; He, Y.; Fu, Y.H. Comparison of Different Machine Learning Algorithms for Predicting Maize Grain Yield Using UAV-Based Hyperspectral Images. Int. J. Appl. Earth Obs. Geoinf. 2023 , 124 , 103528. [ Google Scholar ] [ CrossRef ]
Yang, Z.; Yin, Y.; Jing, Q.; Shao, Z. A High-Precision Detection Model of Small Objects in Maritime UAV Perspective Based on Improved YOLOv5. J. Mar. Sci. Eng. 2023 , 11 , 1680. [ Google Scholar ] [ CrossRef ]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014 ; Proceedings, Part V 13; Springer: Cham, Switzerland, 2014; pp. 740–755. [ Google Scholar ]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep Learning in Remote Sensing Applications: A Meta-Analysis and Review. ISPRS J. Photogramm. Remote Sens. 2019 , 152 , 166–177. [ Google Scholar ] [ CrossRef ]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017 , 5 , 8–36. [ Google Scholar ] [ CrossRef ]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016 , arXiv:1506.02640. [ Google Scholar ]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016 , arXiv:1612.08242. [ Google Scholar ]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018 , arXiv:1804.02767. [ Google Scholar ]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020 , arXiv:2004.10934. [ Google Scholar ]
Jocher, G. YOLOv5 by Ultralytics. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 May 2024).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022 , arXiv:2209.02976. [ Google Scholar ]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022 . [ Google Scholar ] [ CrossRef ]
Varghese, R.; M., S. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [ Google Scholar ]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024 , arXiv:2402.13616. [ Google Scholar ]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024 , arXiv:2405.14458. [ Google Scholar ]
Chen, G.; Pei, G.; Tang, Y.; Chen, T.; Tang, Z. A Novel Multi-Sample Data Augmentation Method for Oriented Object Detection in Remote Sensing Images. In Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), Shanghai, China, 26–28 September 2022; pp. 1–7. [ Google Scholar ]
Zhang, Q.; Meng, Z.; Zhao, Z.; Su, F. GSLD: A Global Scanner with Local Discriminator Network for Fast Detection of Sparse Plasma Cell in Immunohistochemistry. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 86–90. [ Google Scholar ]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [ Google Scholar ]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50× Fewer Parameters and <0.5 MB Model Size. arXiv 2016 , arXiv:1602.07360. [ Google Scholar ]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017 , arXiv:1704.04861. [ Google Scholar ]
Gholami, A.; Kwon, K.; Wu, B.; Tai, Z.; Yue, X.; Jin, P.; Zhao, S.; Keutzer, K. SqueezeNext: Hardware-Aware Neural Network Design. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018. [ Google Scholar ]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [ Google Scholar ]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4—Universal Models for the Mobile Ecosystem. arXiv 2024 , arXiv:2404.10518. [ Google Scholar ]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [ Google Scholar ]
Zhang, J.; Chen, Z.; Yan, G.; Wang, Y.; Hu, B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023 , 15 , 4974. [ Google Scholar ] [ CrossRef ]
Gong, W. Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer. arXiv 2024 , arXiv:2403.01736. [ Google Scholar ]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [ Google Scholar ]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [ Google Scholar ]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [ Google Scholar ]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2023 , arXiv:2211.15444. [ Google Scholar ]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [ Google Scholar ]
Li, K.; Geng, Q.; Wan, M.; Cao, X.; Zhou, Z. Context and Spatial Feature Calibration for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2023 , 32 , 5465–5477. [ Google Scholar ] [ CrossRef ]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for Small Object Detection. arXiv 2019 , arXiv:1902.07296. [ Google Scholar ]
Guo, Y.; Li, Y.; Feris, R.; Wang, L.; Rosing, T. Depthwise Convolution Is All You Need for Learning Multiple Visual Domains. Available online: https://arxiv.org/abs/1902.00927v2 (accessed on 16 May 2024).
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. arXiv 2023 , arXiv:2303.03667. [ Google Scholar ]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Computer Vision—ECCV 2018 ; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 122–138. [ Google Scholar ]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [ Google Scholar ]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2024 , arXiv:2304.08069v3. [ Google Scholar ]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. PMLR 2017 , 70 , 933–941. [ Google Scholar ]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. arXiv 2024 , arXiv:2311.17132. [ Google Scholar ]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023. [ Google Scholar ]
Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. arXiv 2022 , arXiv:2203.06717. [ Google Scholar ]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. arXiv 2017 , arXiv:1701.04128. [ Google Scholar ]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. arXiv 2024 , arXiv:2311.15599. [ Google Scholar ]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. YoloOW: A Spatial Scale Adaptive Real-Time Object Detection Neural Network for Open Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2024 , 62 , 5623115. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Algorithms	FLOPs (G)	Pre-Process (ms)	Inference (ms)	NMS (ms)
YOLOv8s	28.8	0.2	1.6	0.8
YOLOv8s + HGNetV2	23.3	0.2	1.6	0.7

Experimental Environment	Parameter/Version
Operating System	Ubuntu20.04
GPU	NVIDIA Geforce RTX 4090
CPU	Intel(R) Xeon(R) Gold 6430
CUDA	11.3
PyTorch	1.10.0
Python	3.8

Parameter	Setup
Image size	640 × 640
Momentum	0.937
BatchSize	16
Epoch	200
Initial learning rate	0.01
Final learning rate	0.0001
Weight decay	0.0005
Warmup epochs	3
IoU	0.7
Close Mosaic	10
Optimizer	SGD

Algorithms	YOLOv8s			YOLOv8s + SOM
Classes	P(%)	R(%)	mAP(%)	P (%)	R(%)	mAP(%)
swimmer	78.7	66.5	69.6	80.1 (+1.4)	64.8 (−1.7)	70.3 (+0.7)
boat	89.8	86	91.6	89.9 (+0.1)	87.4 (+1.4)	91.2 (+0.4)
jetski	76.8	82.2	83.7	86.3 (+9.5)	82.5 (+0.3)	84.6 (+0.9)
life_saving_appliances	78.2	14.5	28.2	81 (+2.8)	25.5 (+11)	35.4 (+6.9)
buoy	77.7	50.5	57.2	88.5 (+10.8)	51.6 (+1.1)	61.4 (+4.2)
All	80.2	59.9	66.1	85.2 (+5)	62.4 (+2.5)	68.6 (+2.5)

Algorithms	P (%)	R (%)	mAP50 (%)	Params (M)	FLOPs (G)	Speed RTX4090 b16 (ms)
YOLOv8s	80.2	59.9	66.1	11.14	28.7	2.6
YOLOv8s + DSConv	79.8 (−0.4)	60.5 (+0.6)	66.6 (+0.5)	9.59 (−1.55)	25.7 (−3)	1.2 (−1.4)

Algorithms	P (%)	R (%)	mAP50 (%)	Params (M)	FLOPs (G)	Speed RTX4090 b16 (ms)
YOLOv8s	80.2	59.9	66.1	11.14	28.7	2.6
YOLOv8s + FB-C2f	80.4 (+0.2)	61.2 (+1.3)	66.1 (+0)	9.69 (−1.45)	24.4 (−4.3)	2.2 (−0.4)
YOLOv8s + FC-C2f	82.8 (+2.6)	59.5 (−0.4)	66.5 (+0.4)	9.48 (−1.66)	23.8 (−4.9)	1.4 (−1.2)

Algorithms	P (%)	R (%)	mAP50 (%)	Params (M)	FLOPs (G)	Speed RTX4090 b16 (ms)
YOLOv8s	80.2	59.9	66.1	11.14	28.7	2.6
AFPN	83.2 (+3.0)	71.4 (+11.5)	76.3 (+10.2)	8.76 (−2.38)	38.9 (+10.2)	3.1 (+0.5)
AFPN_C2f	84.5 (+4.3)	69.6 (+9.7)	76.0 (+9.9)	7.09 (−4.05)	34.2 (+5.5)	2.8 (+0.2)
LMFN	86.8 (+6.6)	70.5 (+10.6)	76.7 (+10.6)	6.86 (−4.28)	31.1 (+2.4)	2.3 (−0.3)

Class	Algorithms	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	Params (M)	FLOPs (G)	Speed RTX4090 b16 (ms)
1	YOLOv8s	80.2	59.9	66.1	39.9	11.14	28.7	2.6
2	a	84.1	62.2	67.7	40.2	11.14	28.7	2.6
3	b	79.8	60.5	66.6	39.9	9.59	25.7	1.2
4	c	82.8	59.5	66.5	39.3	9.48	23.8	1.4
5	d	86.8	70.5	76.7	42.7	6.86	31.1	2.4
6	b + c	81	59.4	66.1	39.0	7.93	20.9	1.1
7	b + c + d	82.6	71.3	76.6	43.5	3.65	23.3	2.1
8	a + b + c + d (our)	85.5	71.6	78.3	43.7	3.64	22.9	2.1

Class	Algorithms	Swimmer	Boat	Jetski	Life_Saving_Appliances	Buoy	mAP50 (%)
1	YOLOv8s	69.6	91.6	83.7	28.2	57.2	66.1
2	a	66.0	91.2	84.6	35.4	61.4	67.7
3	b	69.4	91.1	85.6	30.7	56.1	66.6
4	c	70.4	91.7	81.4	28.7	60.2	66.5
5	d	77.6	95.6	87.4	45.5	77.2	76.7
6	b + c	69.8	90.5	82.4	27.6	60.1	66.1
7	b + c + d	78.3	95.5	86.3	45.8	77.3	76.6
8	a + b+c + d (our)	79.6	95.5	85.9	54.8	75.7	78.3

Class	Algorithms	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	Params (M)	FLOPs (G)	Speed RTX4090 b16 (ms)
1	YOLOv5s	82.7	57.9	65.4	38.6	9.11	23.8	2.1
2	YOLOv6n	79.5	57.7	60.6	35.9	4.23	11.8	1.7
3	YOLOv8n	79.0	58.8	63.6	37.1	3.0	8.1	1.6
4	YOLOv9t	74.1	58.5	62.3	37.8	2.62	10.7	4.5
5	YOLOv10s	82.3	59.3	63.8	37.7	8.04	24.5	1.0
6	YOLO-OW	83.1	75.5	75.5	39.9	42.1	94.8	4.6
7	RT-DETR-R18	88.4	82.6	83.6	49.6	20.0	57.0	3.8
8	DFLM-YOLO (our)	85.5	71.6	78.3	43.7	3.64	22.9	2.1

Class	Algorithms	Swimmer	Boat	Jetski	Life_Saving_Appliances	Buoy	mAP50 (%)
1	YOLOv5s	69.9	91.6	81.1	26.0	58.5	65.4
2	YOLOv6n	65.4	90.7	78.6	14.2	54.1	60.6
3	YOLOv8n	66.7	92.0	83.9	19.8	55.6	63.6
4	YOLOv9t	68.5	91.5	84.5	14.3	52.7	62.3
5	YOLOv10s	67.5	90.1	86.0	23.6	51.9	63.8
6	YOLO-OW	67.9	92.0	93.0	49.8	74.8	75.5
7	RT-DETR-R18	82.2	97.8	92.3	54.6	90.9	83.6
8	DFLM-YOLO (our)	79.6	95.5	85.9	54.8	75.7	78.3

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Sun, C.; Zhang, Y.; Ma, S. DFLM-YOLO: A Lightweight YOLO Model with Multiscale Feature Fusion Capabilities for Open Water Aerial Imagery. Drones 2024 , 8 , 400. https://doi.org/10.3390/drones8080400

Sun C, Zhang Y, Ma S. DFLM-YOLO: A Lightweight YOLO Model with Multiscale Feature Fusion Capabilities for Open Water Aerial Imagery. Drones . 2024; 8(8):400. https://doi.org/10.3390/drones8080400

Sun, Chen, Yihong Zhang, and Shuai Ma. 2024. "DFLM-YOLO: A Lightweight YOLO Model with Multiscale Feature Fusion Capabilities for Open Water Aerial Imagery" Drones 8, no. 8: 400. https://doi.org/10.3390/drones8080400

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

ORIGINAL RESEARCH article

Yolo-cfruit: a robust object detection method for camellia oleifera fruit in complex environments.

1 Engineering Research Center for Forestry Equipment of Hunan Province, Central South University of Forestry and Technology, Changsha, China
2 Engineering Research Center for Smart Agricultural Machinery Beidou Navigation Adaptation Technology and Equipment of Hunan Province, Hunan Automotive Engineering Vocational University, Zhuzhou, China

Introduction: In the field of agriculture, automated harvesting of Camellia oleifera fruit has become an important research area. However, accurately detecting Camellia oleifera fruit in a natural environment is a challenging task. The task of accurately detecting Camellia oleifera fruit in natural environments is complex due to factors such as shadows, which can impede the performance of traditional detection techniques, highlighting the need for more robust methods.

Methods: To overcome these challenges, we propose an efficient deep learning method called YOLO-CFruit, which is specifically designed to accurately detect Camellia oleifera fruits in challenging natural environments. First, we collected images of Camellia oleifera fruits and created a dataset, and then used a data enhancement method to further enhance the diversity of the dataset. Our YOLO-CFruit model combines a CBAM module for identifying regions of interest in landscapes with Camellia oleifera fruit and a CSP module with Transformer for capturing global information. In addition, we improve YOLOCFruit by replacing the CIoU Loss with the EIoU Loss in the original YOLOv5.

Results: By testing the training network, we find that the method performs well, achieving an average precision of 98.2%, a recall of 94.5%, an accuracy of 98%, an F1 score of 96.2, and a frame rate of 19.02 ms. The experimental results show that our method improves the average precision by 1.2% and achieves the highest accuracy and higher F1 score among all state-of-the-art networks compared to the conventional YOLOv5s network.

Discussion: The robust performance of YOLO-CFruit under different real-world conditions, including different light and shading scenarios, signifies its high reliability and lays a solid foundation for the development of automated picking devices.

1 Introduction

Camellia oleifera is a unique oil tree species in China that produces a healthy to eat oil recognized by the World Food and Agriculture Organization (FAO) ( Zhang and Wang, 2022 ). The picking season for Camellia oleifera fruit occurs in October each year, with a short harvesting period ( Yan et al., 2020 ). Consequently, timely picking is crucial to ensure optimal fruit quality and quantity for maximum profitability. However, the complex and labor-intensive growth environment necessitates the use of localized picking robots to achieve high efficiency ( Wang et al., 2019 ). These robots need to accurately identify the target crops in their natural environment to optimize the harvesting process. Therefore, timely and accurate identification of ripe Camellia oleifera fruit is critical for improving overall picking efficiency.

The growth environment of Camellia oleifera presents several technical challenges that hinder the efficiency and accuracy of machine vision systems. The primary issues include uneven lighting conditions, which can distort the color and texture features of the fruit, making it difficult for detection algorithms to differentiate between the fruit and the background. Additionally, occlusions caused by dense foliage and overlapping branches obscure the fruit from the view of picking robots, leading to a significant reduction in detection rates. These occlusions not only impede the visual access to the fruit but also create a dynamic and unpredictable environment that current machine vision systems struggle to adapt to in real-time ( Zhou et al., 2022 ). To address these challenges, traditional image processing methods leveraging fruit color, contour, and texture features have been widely employed for detection ( Chen and Wang, 2020 ). In scenarios where there are variations in fruit colors and backgrounds, extraction algorithms based on color and shape features have commonly been used, resulting in successful fruit segmentation ( Nguyen et al., 2016 ; Yu et al., 2021 ). Despite the prevalence of traditional image processing methods that leverage fruit color, contour, and texture features, they exhibit notable limitations when the coloration of the fruit and the surrounding leaves are similar. This similarity in coloration hampers the system’s ability to accurately segment the fruit from the background, thereby limiting the overall recognition capability and leading to increased false negatives in detection ( Wang et al., 2016 ). Consequently, some researchers have proposed a more effective detection method that combines fruit texture with color features to enhance target identification ( Kurtulmus et al., 2011 ; Rakun et al., 2011 ). These studies have demonstrated that incorporating both color and texture/shape features can significantly improve fruit recognition accuracy. While traditional image processing methods offer certain benefits, they are not without their drawbacks, particularly in complex and variable scenes typical of Camellia oleifera cultivation. These methods often exhibit reduced robustness due to their sensitivity to environmental changes and the need for frequent recalibration to maintain optimal performance. The requirement for specialized calibration conditions further limits their practicality in real-world scenarios, where conditions are rarely controlled and can fluctuate widely ( Gongal et al., 2015 ; Koirala et al., 2019 ).

With the rapid development of deep learning, it has emerged as a widely used tool in image processing tasks. Among the various types of deep neural networks used for visual recognition, convolutional neural networks (CNN) have shown promising outcomes ( Gu et al., 2018 ). Currently, CNN-based object detectors can be categorized into two types: one-stage detectors and two-stage detectors. Two-stage detectors have garnered preference among researchers due to their higher accuracy and robustness. For instance, Yu et al. ( Yu et al., 2019 ) proposed a Mask-RCNN-based model capable of detecting ripe fruit in non-structured environments, achieving an average detection precision rate of 0.957 and a recall rate of 0.954. In another study, Inkyu et al. ( Sa et al., 2016 ) employed the Faster-RCNN model that used both RGB (red, green, and blue) and near-infrared images to detect sweet pepper. This model also demonstrated the ability to identify several other fruits, such as oranges and melons. Despite the higher accuracy and robustness of two-stage detectors, their application in the development of picking robots is significantly hindered by the substantial computational resources they require for region selection. The relatively long inference time of these detectors is a critical limitation, as it impedes real-time performance, a crucial requirement for robotic picking systems operating in dynamic and time-sensitive agricultural environments. ( Fu et al., 2020 ) Consequently, one-stage detectors, especially the YOLO. series ( Redmon et al., 2016 ; Redmon and Farhadi, 2017 , 2018 ; Bochkovskiy et al., 2020 ; Jocher et al., 2022 ; Li et al., 2022 ; Wang et al., 2022 ), are becoming increasingly popular for object recognition in orchards due to their real-time detection capability and strong robustness under complex field conditions.

Tang et al. ( Tang et al., 2023 ) proposed an improved version of the YOLOv4-tiny model for detecting Camellia oleifera . They utilized the k-means++ clustering algorithm to determine the bounding box prior and optimized the network structure to reduce computational complexity. The performance of this model surpassed that of both YOLOv3-tiny and YOLOv4-tiny models, achieving faster processing speed and higher average precision (AP) value. Similarly, Lu et al. ( Lu et al., 2022 ) developed the Swim-transformer-YOLOv5 model for detecting premium grape bunches. They combined the Swim-transformer and YOLOv5 models to enhance performance. The results demonstrated that Swim-transformer-YOLOv5 outperformed Fast-er R-CNN, YOLOv3, YOLOv4, and YOLOv5 models, achieving higher average precision (AP). Wang et al. ( Wang et al., 2023 ) used an improved YOLOv5s model to recognize and localize apples, which improved apple detection accuracy.

Vision transformers ( Dosovitskiy et al., 2020 ), a relatively new approach in image processing, have shown promise in addressing some of the limitations of CNNs. Unlike CNNs, transformers are adept at capturing global contextual information and establishing feature dependencies through multi-head self-attention mechanisms, which can be advantageous in scenarios with occlusions and perturbations. However, the integration of transformers into practical picking robots is still in its infancy, and there are technical gaps to be bridged, such as the need for further research into how to effectively combine the strengths of transformers with the real-time requirements of robotic systems. Unlike CNNs, transformers excel in capturing global contextual information and establishing dependencies among image feature blocks using multi-head self-attentions while preserving spatial information. Several studies ( Rosenfeld and Tsotsos, 2019 ; Huang et al., 2023 ) are shown that visual transformers exhibit enhanced robustness to challenges such as occlusions and perturbations compared to CNNs. Sun et al. ( Sun et al., 2023 ) proposed the FBoT-Net model specifically for detecting small green apples. They modified the transformer layer by replacing it with a 3 × 3 convolutional layer in the last three bottleneck structures of the ResNet-50 architecture. The experimental results demonstrated impressive performance, with high average precision scores for small and large-scale apple detection on the Small Apple and Pascal VOC datasets.

To address the challenges of detecting Camellia oleifera fruits in natural environments, we propose an approach called YOLO-CFruit. Our approach incorporates the following strategies:

(1) Data augmentation and extension: We apply data augmentation techniques to enhance the robustness of the target detection model by augmenting the acquired image data of Camellia oleifera fruits.

(2) CSP bottleneck transformer (CBT) module: To enable the interaction of local and global information, we integrate the CSP structure with a transformer. This CBT module is introduced into the network backbone.

(3) CBAM integration: We incorporate the CBAM module into YOLOv5, which aids the network in recognizing regions of interest in images with large spatial coverage.

(4) EIoU loss replacement: To improve the accuracy of bounding box detection, we replace the original CIoU loss in YOLOv5 with an EIoU loss, allowing for more accurate measurement of similarity between detected bounding boxes.

In “Section 2 Materials and Methods”, we will focus on the construction of the dataset and the structural principles of the algorithm. In “Section 3 Results and Discussion”, we will verify the correctness of the structural theory analysis through experiments and evaluate the performance and discussion of our algorithm. In “Section 4 Conclusions”, we summarize the conclusions drawn from our experiments.

2 Materials and methods

2.1 camellia oleifera image acquisition.

The image dataset utilized in this study was obtained from a Camellia oleifera orchard located in Liuyang City, Hunan Province, China. The orchard follows standardized planting arrangements, with approximately 2 meters of row spacing and 1 meter of plant spacing. The Camellia oleifera fruit trees have a height ranging from 1 to 3 meters. During the growth period, the color of the Camellia oleifera fruit transitions from green to reddish brown.

On October 12, 2022, image data of Camellia oleifera fruit were captured using iPhone 12 and saved in JPEG format with pixel resolutions of 4302 x 2268 (16:9), 3024 x 3024 (1:1).The images were captured at angles ranging from 0° to 45° with respect to the vertical line perpendicular to the tree trunk. The shooting height and distance were adjusted based on the tree’s height, ranging from 0.9 meters to 1.8 meters and 0.6 meters to 1.8 meters, respectively. Camera position in relation to tree is shown in Figure 1 .

Figure 1 . The method of image acquisition.

To enhance the recovery capability and generalization performance of the visual recognition module, the data collection process took into account the working time (from morning to evening) and weather conditions (sunny and cloudy) of the Camellia fruit robot. This approach ensured the inclusion of images captured under different lighting conditions, such as natural light, exposure variations, and backlight caused by the camera’s orientation relative to the sun’s direction. Additionally, factors like occlusion were considered.

2.2 Image preprocessing

The unprocessed images underwent manual annotation using the “LabelImg” software dedicated to image data annotation. Each annotated image was then stored as a txt file, containing essential information such as the object’s category, normalized central coordinates, and normalized width and height of the bounding box outlining the target. Once the entire set of original images had been meticulously annotated, the dataset has been underwent enrichment via data augmentation methodologies. These techniques include horizontal flip, vertical flip, adding noise, random rotation and intensity adjustment, or a combination of each. The original image is enhanced with different enhancement techniques, and different enhancement results are produced to augment the original data set. An example of data enhancement is shown in Figure 2 . This augmented dataset was designed to significantly enhance the target detection model’s capacity for generalization and resilience, as outlined in reference ( Liu et al., 2023 ).

Figure 2 . Data augmentation in the Camellia oleifera fruit dataset. (A) horizontal flip; (B) vertical flip; (C) adding noise; (D) random rotation; (E) intensity adjustment; (F) combination of two technologies.

The dataset encompasses 4,780 images of Camellia oleifera fruit. A stratified random sample of 3,824 images (80%) constitutes the training set, while the remaining 956 images (20%) form the validation set. This division ensures that the corresponding original and enhanced images are consistently assigned to either the training or validation set.

2.3 YOLOv5 model

The YOLOv5 ( Jocher et al., 2022 ), specifically the YOLOv5s variant, stands out as an efficient target detection model with a relatively modest parameter count. This characteristic renders it particularly well-suited for real-time applications, such as those involving picking robots. The YOLOv5 model is organized into four fundamental constituents: Input, Backbone, Neck, and Head.

The Input phase involves resizing and normalization of the image to match the network’s input dimensions. The Mosaic Data Enhancement Algorithm is a variant of CutMix (CutMix) ( Yun et al., 2019 ) that is applied to improve model training speed and network accuracy.

The backbone network comprises three central structures: the Convolution block (Conv block), the Cross Stage Partial (CSP) unit (comprising the C3 1 and C3 2 blocks), and the Spatial Pyramid Pooling-Fast (SPPF). The CSP architecture serves to amplify network depth and perceptual scope, thereby augmenting feature extraction capabilities. SPPF constitutes an upgraded iteration of the Spatial Pyramid Pooling (SPP) technique ( He et al., 2015 ), amalgamating diverse features possessing varying resolutions to yield a more comprehensive information substrate for input into the network’s neck.

The neck networks, including FPN (Feature Pyramid Network) and PAN (Pixel Aggregation Network), fuse image features. FPN conveys semantic features from top to bottom, while PAN transmits localization features from bottom to top. The fusion of FPN and PAN enhances feature extraction in the network.

The main part of the head is three detection layers, including several components such as convolutional, pooling and fully connected layers. The detection head module uses grid-based anchor points to predict objects on feature maps from different scales of the neck.

2.4 Model improvements

2.4.1 yolo-cfruit network architecture.

The original version of YOLOv5 adopted a pure CNN architecture, with a primary emphasis on capturing localized details. However, to account for the need for global modeling capabilities, the introduction of a transformer element becomes pertinent. Thus, a novel approach, the CSP Bottleneck Transformer module (CBT), has been devised. This module sophisticated convolution mixture and transformer structures, resulting in improved accuracy and precision in identifying Camellia oleifera fruit. It’s important to highlight that incorporating a vision transformer might be constrained by the quadratic computational complexity during image processing. Additionally, in cases where the network is shallow and feature mapping is extensive, an early application of the transformer layer to enforce regression boundaries could inadvertently lead to the loss of crucial contextual information, as underscored in reference ( Zhang et al., 2021 ). In YOLOv5s, this module exclusively replaces the C3 module in layers 8 and 26.

Moreover, for enhancing the CNN’s adaptability to focus on the target and extract nuanced features, the neck network integrates the Convolutional Block Attention Module (CBAM).

The structural depiction of YOLO-CFruit is illustrated in Figure 3 . Notably, this configuration boasts a low computational burden, rendering it ideally suited for the detection of Camellia oleifera fruit within natural environments.

Figure 3 . Architecture of YOLO-CFruit network.

2.4.2 CSP Bottleneck Transformer module

In contrast to the original C3 module, the CSP Bottleneck Transformer (CBT) demonstrates the ability to encompass both global and contextual information regarding Camellia oleifera fruit features. Refer to Figure 4A for an illustrative representation of its structure.

Figure 4 . Architecture of CSP bottleneck transformer and bottleneck transformer. (A) Architecture of CSP Bottleneck Transformer; (B) Architecture of Bottleneck Transformer.

Traditional CNN-based models primarily aggregate local information and often struggle to capture comprehensive global insights. Conversely, Transformer-based models inherently excel at acquiring global context. The Bottleneck Transformers (BoT) block ( Srinivas et al., 2021 ), as depicted in Figure 4B , harmoniously merges ResNet bottleneck components with transformer architecture, with spatial 3x3 convolutions replaced by a Multi-Head Self Attention (MHSA) layer.

Within the MHSA framework, for the self-attention pertaining to the h th instance, the identical input undergoes three separate 1x1 convolutions to yield the vectors q , k and v . Acknowledging that feature maps entail two-dimensional data, the position encodings r employed in self-attention mechanisms are also two-dimensional, as opposed to one-dimensional. The query q h , key k h , value v h , and position encoding r h for the hth head are shown in Equation 1 .

where X is the input vector, W q h , W k h , W v h is the linear transformation from X to the vector q , k , v of h th head. R H h and R W h respectively represent relative positional information in the vertical and horizontal directions. O h represents the h th result of self-attention, which is computed using scaled dot-product attention. The process of calculating O h is shown in Equation 2 .

2.4.3 Convolutional Block Attention Module

To address the issue of foliage obscuring fruits and improve the model’s sensitivity to fruit features, this study incorporates the Convolutional Block Attention Module (CBAM) ( Woo et al., 2018 ) within the Neck network. CBAM is an effective attention module designed for convolutional neural networks (CNNs). Its lightweight design allows seamless integration into existing CNN architectures with minimal overhead. It can be jointly trained with the base CNN, enabling end-to-end learning. The CBAM module consists of two sub-modules: the channel attention module and the spatial attention module. The process begins with the feature map traversing the channel attention module, which generates a weighted outcome. It then proceeds to the spatial attention module, further refining the weighting process. Figure 5 provides a conceptual illustration of the CBAM module.

Figure 5 . Schematic of CBAM and each attention sub-module in CBAM. ⊗ denotes element-wise multiplication.

In the channel attention module, the input feature map F is transformed into two one-dimensional vectors using global max pooling (“MaxPool”) and global average pooling (“AvgPool”). These two vectors, with different dimensions, are passed through a multi-layer perceptron (MLP) consisting of dimensionality reduction and expansion layers. This MLP generates weight factors W 0 and W 1 . The two one-dimensional vectors are then added element-wise, resulting in the channel attention feature map M c , which is activated using a sigmoid function.

The channel attention-adjusted feature map F ' is obtained by element-wise multiplication between the original feature map F and the channel attention map M c . This modified feature map F ' is then fed into the spatial attention module to further enhance the model’s ability to focus on relevant details.

In the spatial attention module, the channel attention-adjusted feature map F ' undergoes global max pooling and global average pooling operations along the channel dimension, resulting in two two-dimensional vectors. These vectors are concatenated based on the channel dimension and passed through a standard convolutional layer for dimensionality reduction, resulting in a single-channel two-dimensional spatial attention map M s . To ensure valid values, a sigmoid activation function is applied to generate the spatial attention map M s . Finally, the spatial attention map M s and the channel attention-adjusted feature map F ' are multiplied element-wise to obtain the final refined output F '' . In summary, the process of CBAM is shown in Equation 3 – 6 .

where c denotes channel attention module and s is spatial attention module. ⊗ denotes element-wise multiplication, and σ denotes the sigmoid function. F a v g x and F m a x x represent average-pooled features and max-pooled features. Respectively, where x can take c or s . f (7×7) denotes the convolution operation where the kernel is 7 × 7.

The introduction of the Convolutional Block Attention Module (CBAM) does not alter the original spatial dimensions of the feature map. Instead, it assigns weights to each feature channel and utilizes these weights to filter out important features. This emphasis on fine-grained features allows the network to obtain improved feature mappings, leading to enhanced accuracy.

2.4.4 Bounding box regression loss function

The bounding box regression loss function is a crucial method for evaluating the accuracy of model predictions and is commonly used in conjunction with Intersection over Union (IoU) ( Yu et al., 2016 ).The IoU is calculated by dividing the intersection between the predicted box (A) and the ground truth box (B) by their union. IoU is defined as Equation 7 .

The Complete Intersection over Union Loss (CIoU Loss) algorithm ( Zheng et al., 2020 ) introduced in YOLOv5 addresses the limitations of the traditional IoU Loss by considering the distance and aspect ratio discrepancies between the candidate bounding box and the ground truth bounding box. This provides a more comprehensive assessment of the model’s detection performance and enhances its ability to accurately locate objects. The CIoU loss is shown in Equations 8 , 9 , where ρ is the Euclidean distance, b and b gt denote the central points of B and B gt , c is the diagonal length of the smallest enclosing box covering the two boxes, w is the width of the prediction box, h is the height of the prediction box, w gt is the width of the ground truth box, and h gt is the height of the ground truth box.

The v in CIoU used in YOLOv5 reflects the difference in aspect ratio rather than the difference between the width and height of the bounding box and its confidence, which can sometimes hinder the model’s effective optimization of similarity.

The EIoU ( Zhang et al., 2022 ), derived from the CIoU penalty term, divides the aspect ratio impact factor into separate calculations for the target box’s length and width as well as the anchor box. The loss function is composed of three essential elements: overlap loss, center-distance loss, and width-height loss. The initial two components adopt the CIoU approach, while the width-height loss actively reduces the difference in width and height between the target box and the anchor box. This results in a more rapid convergence. EIoU is defined as Equation 10 .

where w c and h c are the width and height of the minimum bounding box that covers both boxes. The schematic of CIoU and EIoU is presented in Figure 6 .

Figure 6 . Schematic of CIoU and EIoU.

2.5 Evaluation indicators of network model

The performance evaluation of Camellia oleifera fruit detection in this study utilized four indicators: Precision, Recall, F1 score, and Average Precision (AP). These parameters are commonly used in object detection tasks to assess the accuracy and effectiveness of the detection model. These parameters are defined as shown in Equations 11 – 14 :

where TP represents the number of true positives (i.e., positive samples predicted as positive), FN represents the number of false negatives (i.e., negative samples predicted as negative), and FP denotes the number of false positives (i.e., negative samples predicted as positive). The intersection set IoU indicates the overlap ratio between the predicted bounding box and the true bounding box. Typically, the IoU threshold is set to 0.5, where samples with an IoU greater than 0.5 are considered true positives, while those with an IoU less than 0.5 are considered false positives.

Higher values of Precision, Recall, F1 score, and AP indicate better performance in Camellia oleifera fruit detection. Precision reflects the accuracy of fruit recognition by the network, while Recall indicates the ability to correctly detect all instances of Camellia oleifera fruits. The F1 score combines Precision and Recall into a single metric, providing a balance between the two. Average Precision (AP) measures the overall detection performance across different recall levels.

2.6 Training platform

Training was conducted on a computer equipped with Intel Xeon W-2223 CPU processor with 128GB RAM, and NVIDIA RTX A4000 GPU. The software tools used include CUDA 11.1, CUDNN 7.6.5, OpenCV3.4.1, and Visual Studio 2017.

The detection model for Camellia oleifera fruit was established through the finetuning of the YOLOv5s model using the self-made Camellia oleifera fruit dataset and transfer learning. The YOLOv5s model was utilized to initialize the configuration parameters. YOLO-CFruit receives input images of 640 × 640 pixels, 32 batch size, 0.01 learning rate and 150 epochs for training.

3 Results and discussion

3.1 ablation experiment with different modifications.

The proposed method aims to improve the accuracy of object detection in the YOLO-CFruit model by integrating the CBT module, CBAM attention module to improve the network structure of YOLOv5s and improving the loss function. To evaluate the effectiveness of this method, ablation experiments were conducted by removing each of the improved modules one at a time and training the model to measure the impact of the modifications. The goal was to identify specific substructures of the model and optimize them for better performance. To ensure the validity of the experiments, the model was trained using consistent hyperparameters and operating environment.

Table 1 shows the results of the ablation experiments, with mean average precision and F1 score used as the evaluation metrics. The modifications made to different parts of the model had a positive impact on its accuracy. Notably, compared to the original YOLOv5, the EIoU module has the most significant impact. The EIoU module accelerates the convergence of predicted boxes, enhances the regression accuracy of predicted boxes, and increases the F1-score to 95.9%, [email protected] to 98.1%, and AP@[0.5:0.95] to 76.5% compared to the original YOLOv5s.

Table 1 . Results of the ablation experiments.

The addition of CBT and CBAM modules improves the model’s ability to acquire global information and accurately capture the regions of interest. Performance is also enhanced compared to the baseline. These results show that the EIoU module, the CBT module, and the CBAM module are all effective in improving the accuracy of the detection model.

Then the YOLO-CFruit algorithm implemented by combining the sub-modules, compared with the original YOLOv5, [email protected] improved 1.2% relative to the first group, AP@[0.5:0.95] improved 7.6%, and F1-score improved 3.7%, which is higher than that of the sub-modules alone, which demonstrates that YOLO-CFruit performs very well in the detection of Camellia oleifera fruit in complex environments.

Figure 7 shows the comparison of the P-R curves of the elimination of each sub-module in the ablation experiments, in which the P-R curve of YOLO-CFruit is closest to the upper right corner, which indicates that the better the model performance of YOLO-CFruit.

Figure 7 . P-R curves for improved YOLOv5 model based on different model.

To qualitatively analyze the impact of CBAM, the Grad-CAM (Gradient-weighted Class Activation Mapping) technique ( Selvaraju et al., 2017 ) was employed to compare different networks. Grad-CAM is a gradient-based visualization method that identifies the significance of spatial locations within convolutional layers, effectively highlighting regions of interest.

Figure 8 illustrates the Grad-CAM masks obtained from YOLOv5 combined with CBAM and YOLOv5 alone. The Grad-CAM masks of YOLOv5 with CBAM more accurately cover the regions of the target objects compared to YOLOv5 alone. This indicates that the combination of YOLOv5 with CBAM enables the network to effectively learn and consolidate features from the regions of interest, resulting in improved localization accuracy.

Figure 8 . Grad-CAM visualization results. (A) original; (B) YOLOv5s; (C) CBAM.

3.2 Performance of YOLO-CFruit model

To evaluate the detection capabilities of the YOLO-CFruit model specifically for Camellia oleifera fruit, we applied YOLO-CFruit to the self-made Camellia oleifera fruit test set. The precision-recall curve and the loss during training curve is shown in Figure 9 . The validation loss of YOLO-CFruit decreases from 0.091 to 0.023 during training. The precision (P), recall (R), average precision (AP), and F1 score of YOLO-CFruit are 98%, 94.5%, 98.2%, and 96.2%, respectively. Therefore the model can maintain a high detection performance in the detection of Camellia oleifera fruits.

Figure 9 . P-R curve and loss curve of YOLO-CFruit model. (A) P-R curve of YOLO-CFruit model; (B) loss curve of YOLO-CFruit model.

To further assess the model’s robustness across various lighting angles, we handpicked 120 images from the test set, encompassing three distinct lighting environments: natural light, backlight, and exposure. These images were categorized into three groups based on these lighting conditions. Within each group, we tallied the total count of Camellia oleifera fruits, along with the number of missed detections and false detections. The statistical findings are presented in the Table 2 . The results demonstrate that the YOLO-Cfruit model adeptly identifies the majority of oleander fruits across diverse lighting scenarios, exhibiting a low rate of false detections and missed detections.

Table 2 . Detection results under different lighting conditions.

As shown in Figure 10 , most of the Camellia oleifera fruits that were missed in the three different lighting scenarios were either heavily occluded or in an exposed environment. This is because there are few features that can be effectively extracted, so the probability of missed detection is relatively high.

Figure 10 . YOLO-CFruit detection results under different light environments. (A) natural light; (B) back light; (C) sidelight.

When detecting Camellia oleifera fruit in practical application scenarios, the captured images often contain numerous fruits with varying sizes, severe occlusions, and disordered densities. These factors make it challenging to detect and leads to low detection accuracy. By YOLO-CFruit to effectively detect Camellia oleifera fruit, it can provide a feasible solution for deep neural networks in agriculture.

3.3 Comparison of results with other detection models

For a comprehensive assessment of YOLO-CFruit, we conducted a comparative assessment with contemporary models, including Faster-RCNN, YOLOv4, YOLOv7, YOLOv8s and the original YOLOv5s. Employing an identical test dataset and experimental conditions, we scrutinized their prediction outcomes. Table 3 and Figure 11 encapsulate the specific outcomes.

Table 3 . Comparison of results with other detection models.

Figure 11 . P-R curves for different detection models.

As detailed in Table 3 , among diverse detection metrics, YOLO-CFruit achieves a [email protected] reaches 98.2%, outperforming Faster-RCNN, YOLOv4, the original YOLOv5, YOLOv7 and YOLOv8s by 15.3%, 9.6%, 1.2%, 0.4% and 0.6%, respectively. The F1-score attains 96.2%, demonstrating superiority over Faster-RCNN, YOLOv4, the original YOLOv5, YOLOv7 and YOLOv8 by margins of 16.3%, 16.2%, 3.7%, 1.5% and 2.0%, correspondingly. Both of these pivotal metrics stand above those of the other models. The discernible trend portrayed in Figure 11 highlights YOLO-CFruit’s P-R curve converging towards the upper right corner more closely compared to alternative models. Collectively, these findings conclusively affirm that YOLO-CFruit excels in terms of detection accuracy, surpassing its counterparts in the realm of target detection models.

As for the model size, the model size of YOLO-CFruit is 11.77 Mb, which is 370.41Mb, 50 Mb, 1.95 Mb, 9.69Mb, and 59.55 Mb smaller than Faster-RCNN, YOLOv4, the original YOLOv5s, YOLOv7, and YOLOv8s, respectively, indicating that YOLO-CFruit can be better adapted to the recognition system of the harvesting robots, which is conducive to model deployment and migration.

Although slightly lower than the original YOLOv5s in terms of inference speed, it performs well in all other detection metrics, so YOLO-CFruit can be used for Camellia oleifera fruit recognition in complex environments.

Figure 12 compares the detection results of YOLO-CFruit over other target detection algorithms in complex scenes under different lighting conditions. It can be seen that the YOLO-CFruit model can detect Camellia oleifera fruits missed by other models in different scenes. This shows that YOLO-CFruit can be more suitable for Camellia oleifera fruit detection than other models in complex scenes.

Figure 12 . Comparison of YOLO-CFruit detection with four classical models under exposure conditions, natural light conditions, inverse conditions. (A) Faster-RCNN; (B) YOLOv4; (C) YOLOv5s; (D) YOLOv7; (E) YOLOv8s; (F) YOLO-CFruit.

Overall, YOLO-CFruit can detect all Camellia oleifera fruit with the highest localization accuracy. It has enormous potential applications for collecting and detecting on mobile devices with limited computing capabilities.

Our improved YOLO-CFruit model achieves an impressive average precision(AP) of 98.2% with a FPS of 19.02, surpassing the performance of YOLOv4, YOLOv5S, YOLOv7, YOLOv8 models, and Faster-RCNN. This model shows significant potential for detecting and harvesting Camellia oleifera fruit using mobile devices with limited computational power. Implementing automatic harvesting based on this model could lead to cost reduction, improved efficiency, and benefit the Camellia oleifera fruit industry and local economy.

Cultivating Camellia oleifera fruit trees within an intricate and open environment presents inherent challenges to achieving accurate fruit detection. In response, we introduce YOLO-CFruit, a deep learning-based model meticulously designed for the purpose of detecting Camellia oleifera fruit.

However, we observed in our experiments that the model has a higher probability of missed detection and low precision localization when severe occlusion is present. Therefore, further optimization and improvement of the proposed method are necessary. In the future research we plan to collect more images using different methods and tools and build a larger dataset for detecting Camellia oleifera fruit. In particular, adding images of heavily occluded oleander fruits improves the ability to extract features of Camellia oleifera fruits in occluded situations. In addition, we aim to further optimize the network model to reduce the computational cost while maintaining high detection accuracy and improving the inference speed of YOLO-CFruit. And the model will be further applied to the detection of ripeness of Camellia oleifera fruit.

4 Conclusions

In this study, the proposed YOLO-Cfruit network model is used for the recognition of Camellia oleifera fruits in natural environments. The necessity for this research is driven by the critical need for accurate and efficient fruit detection to facilitate automated harvesting processes, which can substantially enhance productivity and reduce labor costs in the agricultural sector.

The YOLO-CFruit model has proven its effectiveness through rigorous evaluation, showcasing a mean average precision of 98.2%, a recall of 94.5%, precision of 98.0%, and an impressive F1 score of 0.962. These exemplary metrics, aligned with well-known objective evaluation standards, underscore the model’s high accuracy and reliability in fruit detection under diverse conditions.

Moreover, the model’s efficiency is evidenced by a FPS of 19.02, positioning YOLO-CFruit as a viable candidate for real-time applications. This swift performance, coupled with its superior accuracy, sets YOLO-CFruit apart from its counterparts, such as Faster-RCNN, YOLOv4, YOLOv5s, YOLOv7 and YOLOv8, in both average accuracy and overall performance.

In conclusion, the YOLO-CFruit model not only meets but exceeds the current benchmarks for object detection models, offering a compelling solution for the automated harvesting of Camellia oleifera fruits. The contributions of this study are multifaceted, including the development of a high-performing detection model and the potential to revolutionize agricultural practices. Future work will focus on further refining the model and exploring its applicability to other agricultural products, thereby expanding the impact of our research.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

YYL: Writing – original draft, Validation, Software, Conceptualization. YL: Writing – review & editing, Conceptualization. HW: Writing – review & editing. HC: Writing – review & editing, Methodology, Data curation. KL: Writing – review & editing, Methodology, Data curation. LL: Writing – review & editing, Supervision, Resources, Conceptualization.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was supported by Emergency Science and Technology item of the State Forestry and Grassland Administration of China (No.202202-2), Provincial Science and Technology Special item (No.20222-051247) of Jingangshan National Agricultural High-Tech Industrial Demonstration Zone, National Key Research and Development Program of China (No.2022YFD2202103), and Natural Science Foundation of Hunan Province (No.2024JJ8037).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 . doi: 10.48550/arXiv.2004.10934

Crossref Full Text | Google Scholar

Chen, X., Wang, B. (2020). Invariant leaf image recognition with histogram of gaussian convolution vectors. Comput. Electron. Agric. 178, 105714. doi: 10.1016/j.compag.2020.10571

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn Houlsby, D. N. (2020). An image is worth 16x16 words: transformers for image recognition at scale . doi: 10.48550/arXiv.2010.11929

Fu, L., Duan, J., Zou, X., Lin, J., Zhao, L., Li, J., et al. (2020). Fast and accurate detection of banana fruits in complex background orchards. IEEE Access 8, 196835–196846. doi: 10.1109/Access.6287639

Gongal, A., Amatya, S., Karkee, M., Zhang, Q., Lewis, K. (2015). Sensors and systems for fruit detection and localization: A review. Comput. Electron. Agric. 116, 8–19. doi: 10.1016/j.compag.2015.05.021

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., et al. (2018). Recent advances in convolutional neural networks. Pattern recogn. 77, 354–377. doi: 10.1016/j.patcog.2017.10.013

He, K., Zhang, X., Ren, S., Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904–1916. doi: 10.1109/TPAMI.2015.2389824

PubMed Abstract | Crossref Full Text | Google Scholar

Huang, Y., He, J., Liu, G., Li, D., Hu, R., Hu, X., et al. (2023). Yolo-ep: A detection algorithm to detect eggs of pomacea canaliculata in rice fields. Ecol. Inf. 77, 102211. doi: 10.1016/j.ecoinf.2023.102211

Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., et al. (2022). ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations. Zenodo . doi: 10.5281/zenodo.7002879

Koirala, A., Walsh, K. B., Wang, Z., McCarthy, C. (2019). Deep learning–method overview and review of use for fruit detection and yield estimation. Comput. Electron. Agric. 162, 219–234. doi: 10.1016/j.compag.2019.04.017

Kurtulmus, F., Lee, W. S., Vardar, A. (2011). Green citrus detection using ‘eigenfruit’, color and circular gabor texture features under natural outdoor conditions. Comput. Electron. Agric. 78, 140–149. doi: 10.1016/j.compag.2011.07.001

Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., et al. (2022). Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 . doi: 10.48550/arXiv.2209.02976

Liu, Y., Wang, H., Liu, Y., Luo, Y., Li, H., Chen, H., et al. (2023). A trunk detection method for camellia oleifera fruit harvesting robot based on improved yolov7. Forests 14, 1453. doi: 10.3390/f14071453

Lu, S., Liu, X., He, Z., Zhang, X., Liu, W., Karkee, M. (2022). Swin-transformer-yolov5 for real-time wine grape bunch detection. Remote Sens. 14, 5853. doi: 10.3390/rs14225853

Nguyen, T. T., Vandevoorde, K., Wouters, N., Kayacan, E., De Baerdemaeker, J. G., Saeys, W. (2016). Detection of red and bicoloured apples on tree with an rgb-d camera. Biosyst. Eng. 146, 33–44. doi: 10.1016/j.biosystemseng.2016.01.007

Rakun, J., Stajnko, D., Zazula, D. (2011). Detecting fruits in natural scenes by using spatial-frequency based texture analysis and multiview geometry. Comput. Electron. Agric. 76, 80–88. doi: 10.1016/j.compag.2011.01.007

Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2016). “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (Las Vegas, NV, USA: IEEE) 779–788. doi: 10.1109/ACCESS.2020.3029215

Redmon, J., Farhadi, A. (2017). “Yolo9000: better, faster, stronger,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (Honolulu, HI, USA: IEEE) 7263–7271.

Google Scholar

Redmon, J., Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 . doi: 10.48550/arXiv.1804.02767

Rosenfeld, A., Tsotsos, J. K. (2019). “Intriguing properties of randomly weighted networks: Generalizing while learning next to nothing,” in 2019 16th Conference on Computer and Robot Vision (CRV). (Kingston, QC, Canada: IEEE) 9–16. doi: 10.1109/CRV.2019.00010

Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C. (2016). Deepfruits: A fruit detection system using deep neural networks. sensors 16, 1222. doi: 10.3390/s16081222

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D. (2017). “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV). (Venice, Italy: IEEE). doi: 10.1109/ICCV.2017.74

Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A. (2021). “Bottleneck transformers for visual recognition,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (Nashville, TN, USA: IEEE) 16519–16529. doi: 10.48550/arXiv.2101.11605

Sun, M., Zhao, R., Yin, X., Xu, L., Ruan, C., Jia, W. (2023). Fbot-net: Focal bottleneck transformer network for small green apple detection. Comput. Electron. Agric. 205, 107609. doi: 10.1016/j.compag.2022.107609

Tang, Y., Zhou, H., Wang, H., Zhang, Y. (2023). Fruit detection and positioning technology for a camellia oleifera c. abel orchard based on improved yolov4-tiny model and binocular stereo vision. Expert Syst. Appl. 211, 118573. doi: 10.1016/j.eswa.2022.118573

Wang, C., Bochkovskiy, A., Liao, H. (2022). Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Vancouver, BC, Canada, 7464-7475. doi: 10.1109/CVPR52729.2023.00721

Wang, C., Luo, T., Zhao, L., Tang, Y., Zou, X. (2019). Window zooming–based localization algorithm of fruit and vegetable for harvesting robot. IEEE Access 7, 103639–103649. doi: 10.1109/Access.6287639

Wang, C., Zou, X., Tang, Y., Luo, L., Feng, W. (2016). Localisation of litchi in an unstructured environment using binocular stereo vision. Biosyst. Eng. 145, 39–51. doi: 10.1016/j.biosystemseng.2016.02.004

Wang, J., Su, Y., Yao, J., Liu, M., Du, Y., Wu, X., et al. (2023). Apple rapid recognition and processing method based on an improved version of yolov5. Ecol. Inf. 77, 102196. doi: 10.1016/j.ecoinf.2023.102196

Woo, S., Park, J., Lee, J.-Y., Kweon, I. S. (2018). “Cbam: Convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV). (Munich, Germany: Springer) doi: 10.48550/arXiv.1807.06521

Yan, F., Li, X., Huang, G., Li, X. (2020). “ Camellia oleifera fresh fruit harvesting in China,” in 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE). (Harbin, China: IEEE) 699–702. doi: 10.1109/ICMCCE51767.2020.00154

Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T. (2016). “Unitbox: An advanced object detection network,” in Proceedings of the 24th ACM international conference on Multimedia. (Amsterdam, The Netherlands: Association for Computing Machinery) 516–520, MM ‘16. doi: 10.1145/2964284.2967274

Yu, L., Xiong, J., Fang, X., Yang, Z., Chen, Y., Lin, X., et al. (2021). A litchi fruit recognition method in a natural environment using rgb-d images. Biosyst. Eng. 204, 50–63. doi: 10.1016/j.biosystemseng.2021.01.015

Yu, Y., Zhang, K., Yang, L., Zhang, D. (2019). Fruit detection for strawberry harvesting robot in non-structural environment based on mask-rcnn. Comput. Electron. Agric. 163, 104846. doi: 10.1016/j.compag.2019.06.001

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., Yoo, Y. (2019). “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). (Seoul, Korea (South: IEEE) 6023–6032. doi: 10.1109/ICCV.2019.00612

Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., Liu, F. (2021). “Vit-yolo: Transformer-based yolo for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision. 2799–2808.

Zhang, Y.-F., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T. (2022). Focal and efficient iou loss for accurate bounding box regression. Neurocomputing 506, 146–157. doi: 10.1016/j.neucom.2022.07.042

Zhang, L., Wang, L. (2022). Prospect and development status of oil-tea camellia industry in China. China Oils and Fats. 46 (6), 6–9. doi: 10.19902/j.cnki.zgyz.1003-7969.2021.06.002

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D. (2020). “Distance-iou loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI Conference on Artificial Intelligence, (New York, USA: AAAI Press) Vol. 34. 12993–13000. doi: 10.1109/ICCVW54120.2021.00314

Zhou, Y., Tang, Y., Zou, X., Wu, M., Tang, W., Meng, F., et al. (2022). Adaptive active positioning of camellia oleifera fruit picking points: Classical image processing and yolov7 fusion algorithm. Appl. Sci. 12, 12959. doi: 10.3390/app122412959

Keywords: Camellia oleifera , fruit detection, CBAM module, transformer, EIoU loss

Citation: Luo Y, Liu Y, Wang H, Chen H, Liao K and Li L (2024) YOLO-CFruit: a robust object detection method for Camellia oleifera fruit in complex environments. Front. Plant Sci. 15:1389961. doi: 10.3389/fpls.2024.1389961

Received: 22 February 2024; Accepted: 15 July 2024; Published: 14 August 2024.

Reviewed by:

Copyright © 2024 Luo, Liu, Wang, Chen, Liao and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Lijun Li, [email protected]

† These authors have contributed equally to this work and share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

A woman's smile being identified with object detection

Train AI models in seconds with Ultralytics YOLO

Explore our state-of-the-art AI architecture to train and deploy your highly-accurate AI models like a pro

5M Monthly Visits

to Ultralytics Products

400M/day Images Analyzed

with Ultralytics pip pacage

2M/day Models Trained

80k GitHub Stars

total for all Ultralytics respositoies

Fully bootstrapped

achieving milestones with a team of <20

Boost your business or research in 3 simple steps

Scale your business with ai.

Integrate Ultralytics YOLO into your applications or oprimize the ML model pipeline with our no-code solution.

No matter whether you’re an aspiring start-up or a large enterprise – YOLO offers efficient and scalable solutions for computer vision problems.

Bottles being detected in a manufacturing facility next to a chart that shows how well a vision AI model is being trained

Boost your academic research with AI

Conduct thorough evaluations and testing of newly developed algorithms and models and easily publish scientific papers for your research.

Boost work efficiency

Ultralytics YOLO is an efficient tool for professionals working in computer vision and ML that can help create accurate object detection models.

Simplify the ML development process and improve collaboration among team members using our no-code platform.

A valve being identified with object detection

Try YOLO for personal experiments

Learn and experiment with computer vision and object detection, or use Ultralytics YOLO for personal projects and learning.

Test Ultralytics YOLO now

Have a go using our API by uploading your own image and watch as Ultralytics YOLO identifies objects using our pre-trained models

Doing the impossible..

Once trained on Ultralytics HUB, you can test your models in our Vision AI app for iOS and Android

The best AI architecture you’ll ever use

Train models, view results, track losses and metrics with our no-code solution or pip install with just two lines of code to get started

Enhance object detection and segmentation with new features: backbone network, anchor-free detection head, and loss function

We offer thorough documentation and examples for YOLOv8's 4 main modes - predicting, validating, training, and exporting

Our code is written from scratch and documented comprehensively with examples, both in the code and in our Ultralytics Docs

YOLOv8 supports all YOLO versions, even those of competitors (Google MobileNet etc.)

Easily export trained models to most common formats (ONNX, OpenVINO, CoreML, etc.) an run them on various platforms, from CPUs to GPUsrting

Glenn Jocher

Ultralytics Founder & CEO

After 2 years of continuous research and development, we are excited to announce the release of Ultralytics YOLOv8. This YOLO model sets a new standard in real-time detection and segmentation, making it easier to develop simple and effective AI solutions for a wide range of use cases.

Contribute on GitHub

We've transformed the core structure of the architecture from a simple version into a robust platform. And now, YOLOv8 is designed to support any YOLO architecture, not just v8. We're excited to support user-contributed models, tasks, and applications.

IMAGES

(PDF) YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature
(PDF) A Novel YOLO-based Real-time People Counting Approach
(PDF) Small-Object Detection Based on YOLO and Dense Block via Image
1: Main idea of YOLO [12]
(PDF) A Practice for Object Detection Using YOLO Algorithm
An illustration showing how the YOLO algorithm works

COMMENTS

You Only Look Once: Unified, Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. View a PDF of the paper titled You Only Look Once: Unified, Real-Time Object Detection, by Joseph Redmon and 3 other authors. We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection.
Real-Time Object Detection Using YOLO: A Review
Lakshini Kuganandamurthy. IT17073592. Sri Lanka Institute of Information Technology. Malabe, Sri Lanka. [email protected]. Abstract —With the availability of enormous amounts of data. and the ...
A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond
In addition to discussing the speciﬁc advancements of each YOLO version, the paper highlights the trade-offs between ... model. Finally, we envision the future directions of the YOLO framework, touching upon potential avenues for further research and development that will shape the ongoing progress of real-time object detection systems. arXiv ...
A Review of Yolo Algorithm Developments
There are a few revised-limited versions, such as YOLO- ITE [11-12]. This research paper only focused on the five main YOLO versions. This paper will compare the main differences among the five YOLO versions from both conceptual designs and implementations. The YOLO versions are improving, and it is essential to understand the main motivations ...
A Comprehensive Review of YOLO Architectures in Computer Vision: From
Lastly, YOLO has been widely used in robotic applications [42, 43] and object detection from drones [44, 45]. Figure 2 shows a bibliometric network visualization of all the papers found in Scopus with the word YOLO in the title and filtered by object detection keyword. Then, we manually filtered all the papers related to applications.
Object detection using YOLO: challenges, architectural successors
In this paper, we explored two stage object detectors viz. RCNN, Fast-RCNN, and Faster-RCNN along with their important applications. ... Zhang X, Qiu Z, Huang P, Hu J, Luo J (2018) Application research of YOLO v2 combined with color identification. In 2018 international conference on cyber-enabled distributed computing and knowledge discovery ...
YOLO-based Object Detection Models: A Review and its Applications
This paper presents a complete survey of YOLO versions up to YOLOv8. This article begins with explained about the performance metrics used in object detection, post-processing methods, dataset availability and object detection techniques that are used mostly; then discusses the architectural design of each YOLO version.
A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond
YOLO has become a central real-time object detection system for robotics, driverless cars, and video. monitoring applications. We present a comprehensive analysis of Y OLO's ev olution ...
MAKE
YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with transformers. We start by describing the standard metrics and postprocessing; then, we ...
A Comprehensive Review of YOLO: From YOLOv1 and Beyond
Abstract: YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers.
YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance
In recent years, the You Only Look Once (YOLO) series of object detection algorithms have garnered significant attention for their speed and accuracy in real-time applications. This paper presents YOLOv8, a novel object detection algorithm that builds upon the advancements of previous iterations, aiming to further enhance performance and robustness. Inspired by the evolution of YOLO ...
YOLOv1 to v8: Unveiling Each Variant-A Comprehensive Review of YOLO
This paper implements a systematic methodological approach to review the evolution of YOLO variants. Each variant is dissected by examining its internal architectural composition, providing a thorough understanding of its structural components. Subsequently, the review highlights key architectural innovations introduced in each variant, shedding light on the incremental refinements. The review ...
PDF You Only Look Once: Unified, Real-Time Object Detection
Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance. We also train YOLO using VGG-16.
(PDF) A Comprehensive Review of YOLO Architectures in ...
YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining ...
A Review on YOLOv8 and Its Advancements
This research study provides an analysis of YOLO v8 by highlighting its innovative features, improvements, applicability in different environments, and a detailed comparison of its performance metrics to other versions and models. Download conference paper PDF. ... Cite this paper. Sohan, M., Sai Ram, T., Rami Reddy, C.V. (2024). A Review on ...
[2304.00501] A Comprehensive Review of YOLO Architectures in Computer
YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers. We start by describing the standard metrics and postprocessing; then, we ...
Object Detection and Tracking Using Yolo
In YOLO, Object detection is implemented as a regression problem and class probabilities are provided for detected images. Published in: 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA) Article #: Date of Conference: 02-04 September 2021. Date Added to IEEE Xplore: 01 October 2021.
Insulator detection based on FA‐YOLO network with improved feature
This paper proposes a refined insulator detection algorithm that integrates the attention mechanism in YOLOv8 to improve the feature extraction ability. Specifically, this paper introduces a fast vision transformers structure in the you only look once (YOLO) v8 backbone section to enhance feature extraction by capturing local and global features.
Statistical Analysis of Design Aspects of Various YOLO-Based ...
2.1 Prior Analysis in YOLO Algorithms. Only survey studies have been published, but they all provide a solid overview of the history of YOLO algorithms. The authors in [] presented a review of two-stage and one-stage techniques, an architectural overview of YOLO versions, and a comparison analysis among them.In this paper [], the author focused on an overview of the YOLO versions through ...
Abstract arXiv:1506.02640v5 [cs.CV] 9 May 2016
the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN. Third, YOLO learns generalizable representations of ob-jects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly gen-
MPE-YOLO: enhanced small target detection in aerial imaging
To address this issue, this paper proposes an improved model based on YOLOv8, named MPE-YOLO. Initially, a multilevel feature integrator (MFI) module is employed to enhance the representation of ...
A Practice for Object Detection Using YOLO Algorithm
This paper introduces YOLO, the best approach to object detection. Real-time detection plays a significant role in various domains like video surveillance, computer vision, autonomous driving and ...
Drones
Object detection algorithms for open water aerial images present challenges such as small object size, unsatisfactory detection accuracy, numerous network parameters, and enormous computational demands. Current detection algorithms struggle to meet the accuracy and speed requirements while being deployable on small mobile devices. This paper proposes DFLM-YOLO, a lightweight small-object ...
YOLO-CFruit: a robust object detection method for Camellia oleifera
To evaluate the detection capabilities of the YOLO-CFruit model specifically for Camellia oleifera fruit, we applied YOLO-CFruit to the self-made Camellia oleifera fruit test set. The precision-recall curve and the loss during training curve is shown in Figure 9. The validation loss of YOLO-CFruit decreases from 0.091 to 0.023 during training.
[1804.02767] YOLOv3: An Incremental Improvement
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite ...
PDF You Only Look Once: Uniﬁed, Real-Time Object Detection
ion performance and the entire model is trained jointly.Fast YOLO is the fastest general-purpose object detec-tor in the literature and YOLO p. shes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for. s that rely on fast, robust object de.
Ultralytics YOLOv8
After 2 years of continuous research and development, we are excited to announce the release of Ultralytics YOLOv8. This YOLO model sets a new standard in real-time detection and segmentation, making it easier to develop simple and effective AI solutions for a wide range of use cases.
Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation
We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we ...

> cs > arXiv:2304.00501v5

Current browse context:

Submission history

IEEE Account

Purchase Details

Profile Information

Statistical Analysis of Design Aspects of Various YOLO-Based Deep Learning Models for Object Detection

Cite this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

Object Detection: State of the Art and Beyond

YOLO-based Object Detection Models: A Review and its Applications

1 Introduction

2 Related Work

2.2 Novelty and Contributions

3 Evolution of YOLO Algorithms

3.1 YOLO (V1)

3.2 YOLO (V2)

3.3 YOLO (V3)

3.4 YOLO (V4)

3.5 Scaled YOLO V4

3.6 PP-YOLO

3.7 YOLO (V5)

3.10 PP-YOLOV2

3.11 YOLO (V6)

3.12 YOLO (V7)

3.13 YOLO (V8)

4 Training Parameters, Datasets, and Evaluation Metrics

4.1.1 Multi-scale Training in YOLO

4.1.2 Attention Mechanisms in YOLO

4.1.3 Non-maximum Suppression

4.1.4 Activation Functions

4.2 Datasets

4.3 Evaluation Metrics

5 Comparison Analysis of YOLO in Different Aspects

6 Challenges and Future Directions

7 Conclusion

Data Availability

Abbreviations

Author information

Contributions

Corresponding authors

Ethics declarations

Ethical Approval

Additional information

Rights and permissions

About this article

Share this article

MPE-YOLO: enhanced small target detection in aerial imaging

Similar content being viewed by others

Improved GBS-YOLOv5 algorithm based on YOLOv5 applied to UAV intelligent traffic

Centralised visual processing center for remote sensing target detection

Lightweight aerial image object detection algorithm based on improved YOLOv5s

Background and related works

Methodology

Multilevel feature integrator

Perception enhancement convolution

Enhanced Scope-C2f

Experiments

Evaluation criteria

Ablation atudy

Comparative experiments

Visual analytics

Generalization study

Conclusions

Data availability

Acknowledgements

Author information

Contributions

Corresponding author

Additional information

Rights and permissions

About this article

Share this article

Quick links

Information

Initiatives

Article Menu

JSmol Viewer

1. Introduction