2.1. YOLOv7-GX Model
The YOLOv7 algorithm, as a typical representative of one-stage target detection algorithms, is based on deep neural networks for object recognition and localization, and has a fast running speed for real-time systems. YOLOv7 performs well in the range from 5FPS to 160FPS, and outperforms in both speed and accuracy YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, deformable DETR, DINO-5scale-R50, and many other object detectors. The main contributions of YOLOv7 include the introduction of model reparameterization into the network architecture, the adoption of the cross-grid search label-assignment strategy of YOLOv5, and the adoption of the matching strategy of YOLOX [
12]. In addition, a more efficient ELAN network architecture is introduced. Meanwhile, YOLOv7 proposes an auxiliary head-training method, which improves the accuracy by increasing the training cost without affecting the inference time, since the auxiliary head only appears during the training process, as shown in
Figure 1 for the model structure of YOLOv7.
Although the YOLOv7 algorithm has achieved remarkable success in terms of speed and accuracy and has become an important milestone in the field of one-stage target detection, it still faces certain challenges when dealing with a large number of dense and very small-sized targets. Especially in the defect detection dataset of photovoltaic panels involved in this study, the feature information of dense and small targets is often severely lost after the multilayer convolutional network of YOLOv7, which directly affects the model’s ability to detect tiny defects, such as hot spots.
As a typical fault type in PV panels, hot spots appear as tiny bright spots in the original image, occupying a very limited number of pixels. As demonstrated by the area indicated by the blue arrow in
Figure 2, these tiny bright spots represent the presence of hot-spot faults. Under the original YOLOv7 architecture, since its design focuses on balancing the detection speed and the overall target-recognition accuracy, it is easy to ignore or lose the feature information of this type of very small target when performing the deep convolutional processing, which leads to the omission of small-sized faults, such as hot spots.
To address this problem, the improved algorithm needs to enhance its ability to capture small target features while maintaining the original advantages of YOLOv7 to ensure accurate recognition of dense small targets. This involves adjusting the network architecture, optimizing the feature-fusion strategy, or introducing a more refined feature extraction mechanism. The sensitivity of the network to small-sized targets can be improved by tuning the parameters of the convolutional layers or introducing special attention mechanisms. In addition, the optimization of the feature-fusion strategy, such as the use of multi-scale feature fusion, can also effectively enhance the model’s ability to detect dense small targets.
In summary, although the YOLOv7 algorithm performs well in many aspects, it still has some limitations when dealing with specific types of dense-small-target-detection tasks. Therefore, targeted improvements to YOLOv7 to enhance its performance in specific application scenarios such as the defect detection of PV panels is an important direction of research in this paper. Through these improvements, the applicability and effectiveness of the YOLOv7 algorithm in practical applications can be further extended, especially in occasions where the accurate detection of dense small targets is required.
Standard convolution (SConv) is commonly used in deep learning models to extract image features by applying different convolution kernels to multiple channels at the same time, thus enabling the model to capture rich spatial information. However, the number of parameters required for standard convolution increases rapidly with the increase in network depth and the demand for feature extraction, resulting in computational complexity and network operation speed being the main factors limiting its application.
In contrast, depth-separable convolution (DWConv) is widely used in various network architectures. It first applies deep convolutions to each input channel independently [
13], and then combines the results of these deep convolutions by 1 × 1 convolution (also known as dot convolution). This approach can significantly reduce the number of parameters and the computational burden of the model and speed up the inference. However, while reducing parameters, depth-separable convolution may also lose some of the semantic information between channels, affecting the overall accuracy of the model.
To overcome these limitations, GSConv is proposed as a novel convolutional structure [
14], as shown in
Figure 3, which skillfully blends the advantages of standard convolution and depth-separable convolution. GSConv effectively addresses the limitations of depth-separable convolution (DSC) in preserving the channel-information processing, which is a key drawback of DSC. While DSC reduces computational cost, it tends to separate channel information, which leads to a significant reduction in feature extraction and fusion capabilities, especially detrimental when detecting small targets. This separation hinders the ability to capture the complex details required for the model to accurately recognize small objects. In contrast, GSConv combines standard convolution (SC), DSC, and blending operations, allowing the model to reduce computational requirements while retaining information about interactions between channels. The incorporation of SC operations in GSConv ensures that depth-separated information is combined with channel-intensive features, compensating for the loss of detail encountered when using DSC alone. This design significantly improves the model’s ability to identify and accurately localize small targets, ensuring that critical spatial and feature-level details are not lost during the convolution operation. In addition, the mixing operation within GSConv ensures a homogeneous mixing of information, allowing SC-extracted features to effectively permeate the output of the DSC operation. This process ensures a more comprehensive feature representation and further enhances the model’s sensitivity to small-scale details, which is critical for detecting small targets.
By optimizing the balance between computational efficiency and model accuracy while preserving inter-channel linkages, GSConv is particularly suitable for application scenarios that require efficient real-time processing and highly accurate detection, such as PV-panel defect detection. It is able to effectively represent the fault deformation and overlap on PV panels due to variations in shooting angles, heights, and environmental conditions, providing an effective means to improve the generalization ability and robustness of the model. The introduction of this convolutional structure opens up new possibilities for the application of deep learning models in areas such as PV-panel defect detection, allowing the model to maintain a low computational complexity while effectively improving detection performance.
Replacing all the convolutions of the model with GSConv may significantly increase the number of layers of the model, which leads to an increase in the inference time for the PV panels. In the YOLOv7 network, the backbone layer requires enough convolution operations to extract the defect information on the PV panels. Therefore, we chose to replace the convolution operation only at the neck layer. By performing convolutional substitution at the neck layer, redundant and repetitive information can be reduced to improve the efficiency and performance of the model [
15]. Such an optimization strategy can reduce the computational complexity of the model without sacrificing the accuracy of the model, making the model more feasible in resource-constrained environments [
16]. Through extensive experiments, we redesigned the neck structure and proposed a new GhostSlimFPN network structure, as shown in
Figure 4. This structure replaces the previous top-down and bottom-up pyramid structure in YOLOv7, which greatly compresses the number of layers of the model while ensuring that there is no loss of model accuracy, which not only reduces the computational cost, but also maintains the connectivity between the channels as much as possible [
17].
2.2. Custom Convolution with GAM Attention Mechanism
The GAM attention mechanism is a method that can amplify global interaction features with reduced information dispersion [
18]. A sequential channel-space attention mechanism was adopted and the CBAM sub-module was redesigned [
19]. In order to effectively integrate the GAM attention mechanism into the model, while considering the computational efficiency of the model and the feasibility of practical applications, we skillfully integrated the GAM into a 1 × 1 convolutional block to form a new type of convolutional block, Conv_ATT, as shown in
Figure 5. This design not only retains the advantages of 1 × 1 convolution in reducing the number of parameters and the computational burden, but also introduces the GAM attention mechanism, which enables the model to focus on and zoom in on the key features in the image more accurately, and effectively reduces the dispersion problem of the information in the process of transmission.
By integrating the GAM attention mechanism in a 1 × 1 convolutional block, Conv_ATT is able to achieve an efficient extraction and enhancement of the global features of an image, which is particularly important in tasks requiring fine characterization such as PV-panel defect detection. PV-panel defects, such as tiny occlusions and hot spots, often occupy a very small area in the image and are not easy to clearly distinguish, and the introduction of the GAM mechanism can help the model better recognize these subtle differences, thus improving the accuracy and reliability of detection.
In deep learning models, especially those used for image recognition and target detection, the introduction of the attention mechanism has become one of the most important methods to enhance the performance of the model. The GAM attention mechanism effectively enhances the model’s ability to understand and capture the global features of an image through the design of its channel-attention sub-module and spatial-attention sub-module.
In the GAM attention mechanism (e.g.,
Figure 6), given an input feature map
an intermediate feature map
and an output feature map F3, are defined as
where
and
denote channel-attention and spatial-attention feature mapping, respectively; and ⨂ denotes the multiplication operation by element.
The channel-attention sub-module effectively preserves the three-dimensional information of the input features by employing a three-dimensional arrangement, as shown in
Figure 7. This design enables the module to strengthen the cross-dimensional channel-space dependencies while preserving the spatial information. By applying a two-layer MLP (multilayer perceptron), the sub-module not only amplifies the inter-channel correlations, but also enhances the model’s ability to perceive different spatial location information. The encoder–decoder structure adopted by the MLP further optimizes the use of parameters with the help of the reduction ratio r, which ensures the efficient transfer and processing of information.
The spatial-attention sub-module focuses on capturing and fusing spatial information. By using two convolutional layers, this sub-module effectively fuses spatial information from different locations, enhancing the model’s understanding of the spatial features of the image. Similar to the channel-attention sub-module, the spatial-attention sub-module also employs a reduced ratio r to maintain the efficiency and accuracy of information processing. Notably, in order to avoid the possible loss of information from the maximum pooling operation, this sub-module removes the pooling step to further preserve the feature mapping.
In order to balance the number of parameters of the model and the computational efficiency, the spatial attention sub-module employs group convolution with channel shuffle in models such as ResNet50. This design aims to optimize the model’s parameter usage while maintaining or improving the model’s ability to capture spatial features through the group processing of group convolution and the channel shuffling of channel shuffle.
With these well-designed sub-modules, the GAM attention mechanism provides an effective way for deep learning models to enhance their understanding of image channels and spatial information. The introduction of this mechanism will undoubtedly ensure the efficiency of the model while significantly improving its performance in complex image-processing tasks, especially in the fields of target detection and image recognition. The spatial-attention sub-module without group convolution is shown in
Figure 8.
At the same time, we will reconstruct the ELAN and SPPCSPC modules. We obtain the ELANA module by replacing the last 1 × 1 convolution in the ELAN module with a custom convolution Conv_ATT as in
Figure 9. In the same way, we can obtain the SPPCSPC_ATT module, as in
Figure 10.
2.3. Loss Function
YOLOv7 incorporates a boundary regression loss function known as CIOU that considers three critical geometric aspects: the intersecting area, the center-point distance, and the aspect ratio between predicted and actual bounding boxes. The CIOU metric evaluates the shared area between the predicted and actual boxes, calculates the Euclidean distance between their centers, and considers the angle difference related to their aspect ratios against the IOU metric [
20]. The formula for the CIOU loss is expressed as follows:
Nevertheless, the utilization of the arctan function in the last term of the loss, intended as a penalty for aspect ratio variance, introduces a couple of issues impacting the speed of convergence and the stability of the . The sensitivity of this term to outliers can cause significant swings in the value range of the loss function, undermining its effectiveness. Furthermore, the output range of the arctan function does not align with the normalization standards set for the loss function. The need to incorporate additional coefficients for the numerical normalization of this penalty term also contributes to a rise in computational demand.
Based on the above, we designed a more efficient loss function, the
regression loss function. The
loss function serves as an improvement of the overlap metric between the bounding boxes. The
loss function introduces an additional parameter tuning on top of the
loss, which also takes into account the shape factor of the bounding boxes through the convex diagonal loss. Specifically, the
loss function introduces two additional parameters,
and
, which are calculated by applying Sigmoid functions to the width and height of the bounding box. By adjusting the terms in the calculation formula, the
loss function provides a more accurate measure of the bounding box overlap. The formula is as follows:
From Equations (6)–(8), it can be seen that the loss function introduces the and parameters in the parameter-tuning section, which is obtained by calculating the Sigmoid function for the width and height of the bounding box. These two parameters are used to adjust the terms in the equation when calculating the to provide a more accurate loss metric.
The proposal of the XIOU loss function not only effectively improves the problems in the CIOU loss, but also provides a more accurate and robust bounding box regression loss-calculation method for the target-detection model. By applying the XIOU loss function in advanced target-detection models such as YOLOv7, the performance of the model in various detection tasks can be further improved, especially in the precise localization and the handling of scenarios with large variations in target aspect ratios, which show significant advantages.
2.4. Weighting Strategies
In order to solve the sample imbalance problem in the target-detection task, we introduced the methods of category-weight calculation and image-weight calculation in the implementation of the YOLOv7 model. These two methods can effectively adjust the model’s focus on different categories and images, thus improving the accuracy and performance of target detection. In category-weight calculation, we assigned appropriate weights to different categories based on the distribution of each category in the training data [
21]. By analyzing the training-label data, we counted the number of occurrences of each category in the training data to obtain the frequency information of the categories. In order to avoid the impact of certain categories not appearing in the training data on the weight calculation, we processed the categories with zero occurrences. Specifically, we replaced the weights of these empty categories with a default value to ensure that they play an appropriate role in the weight calculation. In addition, this is done to improve the detection accuracy of the “hotspot”, which is a small target and difficult to detect in this dataset sample. In order to improve the detection accuracy of “hot spot”, we adjusted the weight of the category “hot spot” by doubling its original weight to increase its importance in the training process.
Equation (9) is the category of frequency statistics, where
is the number of times the
ith category appears in the training set and
is the
ith category. Equation (10) deals with the empty categories; for the categories that do not appear, their frequency is set to 1. Equation (11) is the category-weight adjustment; for a specific category “hot spot”, we doubled its weight by
. Equation (12) is the calculation of the inverse frequency weights. Equation (13) is the weight normalization, where
is the total number of categories.
In addition to category-weight computation, we also introduced the method of image-weight computation to better handle the sample-imbalance problem at the image level. By analyzing the distribution of each category in each image in the training data, the weights of each image were calculated. These image weights can be used to adjust how much attention the model pays to different images, thus balancing the importance of different images in the training process [
22]. Specifically, we calculated a category-weighted sum for each image based on the category weights and the number of occurrences of each category in each image. Then, by processing and normalizing the category-weighted sum appropriately, we obtained the final weight values for each image. With the introduction of category-weight calculation and image-weight calculation, we are able to better deal with the sample-imbalance problem during the training process and improve the performance and accuracy of the target-detection algorithm. These methods were applied in the YOLOv7-GX model and achieved significant improvements.