1 Introduction

Electricity is the most important issue in sustainable energy development worldwide, and it is also currently one of the most concerned energy issues around the world. Insulators of transmission line are indispensable insulation equipment in power system [34]. The rapid automatic detection and positioning of defective insulators are helpful to repair the equipment for the first time and improve the stability of the power system. With the continuous promotion of intelligent inspection of the power grid, unmanned aerial vehicle has been widely used to replace labor-intensive inspection [24]. However, manual inspection is still required for the equipment defects in the images, and it is easy to lead to miss and false detection [30]. Power system introduces artificial intelligence technology to achieve automatic defect identification, such as target detection and classification approaches [4, 42], to solve the above problems. Although the algorithms based on detection and classification [31] have achieved good results in other fields, such as IoT security [2, 12], medical diagnosis [27], etc., their performance is not yet able to fully meet the requirements of practical applications in the field of power-line detection.

In recent years, State Grid Co., Ltd. has undertaken the task of evaluating artificial intelligence (AI)-based algorithms for detecting common defects in power system, focusing on 8–9 typical issues [44]. The findings from these assessments reveal that the accuracy of these detection algorithms is generally unsatisfactory, with limited generalization capabilities, making practical application challenging. A significant factor contributing to the suboptimal performance of existing power defect detection algorithms is the scarcity of adequate defect samples for training models [1]. Moreover, the infrequency of certain power equipment defects is an inherent issue, which poses challenges to the broader adoption and advancement of AI technologies in this field.

Traditional data augmentation methods already exist in other fields to solve the problem of insufficient samples, such as mirror flip [37, 40], image rotation [33], and affine transformation [1]. The above methods can improve the performance of classification models to different degrees [39]. The color jittering, affine transformation, and image fusion are also used extensively by target detection algorithms [5]. Traditional data augmentation methods, while effective in enhancing model performance through introducing variability, are limited by their inability to generate new, unseen samples. In contrast, GAN-based augmentation techniques excel in synthesizing novel, realistic data, thereby not only improving the robustness and generalization capabilities of models but also offering a deeper, more nuanced expansion of the data distribution, crucial for complex problem-solving in various domains.

Therefore, to reduce the energy consumption of data collection and the difficulty of obtaining data, this paper proposes a style transfer method based on improved Star Generative Adversarial Network 2 (StarGAN2) [8] for small-sample self-exploding insulator defect data to solve the above problems. The method can be trained with the help of non-defective samples and reduces reliance on defect samples. Also, the method enables diverse sample outputs. By introducing identity loss, the insulator image content is globally constrained to be invariant to ensure the integrity of the defect semantics. The perceptual loss is introduced to preserve image texture details and increase the generation realism. The method does not require any additional manual labeling and can achieve end-to-end data generation. Unlike previous studies, which focus more on the contribution of the generated samples to the classification model, this paper aims to utilize the generated style transfer samples to improve target detection performance.

Then, the main contributions of this paper are as follows.

  • This paper introduces a groundbreaking StarGAN2-based style transfer method that significantly reduces the reliance on defective samples for insulator detection tasks by training with non-defective samples.

  • This innovative approach maintains defect semantics integrity through identity loss integration and enhances sample realism by preserving texture details via perceptual loss.

  • Empirically validated, our method not only proves its reliability but also significantly improves DNN and SSD models’ detection performance on real sample test sets, as evidenced by substantial increases in Ap and \(Ap^{75}\) metrices.

This paper is organized as follows. Section 1 briefly describes the background of the study, the problem, and the solution. Section 2 delves into related work, highlighting a range of sophisticated methods based on Generative Adversarial Networks (GANs). Section 3 introduces the StarGAN2 style transfer model, detailing its framework and applications. In Section 4, we meticulously describe the experimental setup and analyze the results. Finally, Section 5 offers a summary of our findings and suggests directions for future research.

2 Related Work

Generative Adversarial Network (GAN) [13] has also been used for data augmentation by its excellent generative power. Shorten et al. [32] used GAN-based approaches to generate data and combined it with geometric transformations to improve the classification performance effectively. And Thanaka et al. [38] also used Deep Convolutional Generative Adversarial Networks (DCGAN) to generate data, and the performance of the model trained on synthetic data was similar to that using the original data. Wang et al. [41] also used GAN to generate palm print images to improve recognition rates. Although the above augmentation methods can improve model performance, training GAN to generate target distribution samples from the latent space requires a large amount of data. Once more data are available, data augmentation using GAN is less effective than geometric transformation [35].

GAN-based style transfer methods are also applied for data generation [6]. Hammami et al. [15] used CycleGAN [9] to generate data for training to improve the accuracy of the Yolo detector. Yang et al. [46] proposed an unsupervised domain-adaptive SAR ship detection method that leverages CycleGAN-SCA for cross-domain feature interaction and data contribution balance. Yan et al. [45] introduce ADE-CycleGAN, a detail enhanced image dehazing CycleGAN network that retains detail information during the defogging process. Niu et al. [26] used Surface Defect-Generation Adversarial Network (SDGAN) to convert non-defective samples into defective samples to make up for the problem of training samples. There are many other similar methods for generating surface defects; He et al. [16] proposed a defect image generation method for defect detection, based on progressive generative adversarial networks, introduces a D2 adversarial loss function, cyclic consistency loss function, a data augmentation module, and a self-attention mechanism. Wang et al. [43] used Star Generative Adversarial Network (StarGAN) [7] to generate different facial expression images to improve the expression recognition rate. Kanghyeok et al. [20] modified the StarGAN approach to address the limitations of StarGAN in dealing with large-scale domain translations as well as its inability to capture subtle feature variations by integrating training-independent classifiers and data augmentation techniques. A et al. [3] proposed Instance GAN (IGAN), for modeling complex distributions on datasets such as ImageNet and COCO-Stuff, to learn the distribution around each data point in order to generate images that are close to photo-realistic. Pan et al. [28] proposed Multi-Instance GAN to solve the Multi-Instance Multi-Label problem of overlapping signal waveform recognition, which utilizes the principle of adversarial learning to approximate the distribution of the training data to achieve better performance.

Currently, many advanced GAN-based methods are used in various fields. Qin et al. [29] focus on improving CycleGAN for laser-visible face image translation by introducing an adversarial loss function and modifying the feature extraction module to enhance the translation performance. Meftah et al. [25] explored GAN-based models for emotional speech transitions in English, evaluating their scalability and versatility under different speakers and moods. Kang et al. [18] propose GigaGAN, a model designed for large-scale text-to-image generation, trained on a large number of web-crawled text and image pairs. Gupta et al. [14] focuses on generating novel 3D models from image data base on 3DGAN, which focuses on generating novel 3D models from image data. It unifies radiance fields and implicit surfaces to learn an underlying distribution from 2D images of objects in a class, allowing for new object generation within this class. However, there are relatively few studies on GAN-based insulator defect generation. In general, current GAN-based data generation methods either rely on a large number of samples or have insufficient sample generation. However, the advent of StarGAN solves the appeal limitation, whose significant advantage is its ability to realize multi-domain image translation in a single model, which means that it can use a unified framework to convert input images into multiple output images with different styles or attributes. Thus, the StarGAN-based approach can be perfectly suited for current scenarios where insulator defect samples are scarce.

Table 1 Data enhancement methods based on different GAN algorithms and their brief descriptions

3 Method

This section discusses the StarGAN2 style transfer model in Sect. 3.1, the loss function of StarGAN2 in Sect. 3.2, and finally proposes the improvements in Sect. 3.3.

3.1 StarGAN2 Style Transfer Model

StarGAN2 includes four parts, Generator G, Discriminator D, Mapping Network M, and Style Encoder E. The Generator is used to generate the image, the Discriminator is used to identify the true and fake image, the Mapping network generates the style code of the target domain, and Style Encoder extracts the style code of the reference image. Figure 1 shows the Generator and Discriminator module of StarGAN2, containing the adversarial and cyclic generation of samples.

Fig. 1
figure 1

StarGAN2 generates samples by style transfer. Given an input image \(X^a\), which belongs to the image domain a, and a reference style code \(S^b\), the output \(G(x^a,s^b)\) with \(S^b\) style can be obtained by inputting \(S^b\) and \(X^a\) into the generator G. The \(S^b\) can be generated by the latent variable z via the Mapping network M or by the reference image via the Style Encoder E. When the reference image style is extracted using the Style encoder E, \(S^b\) can reflect the style of the reference image

StarGAN2 necessitates the preliminary establishment of two distinct sets of image domains prior to training, which are essentially collections of visually distinguishable images. As illustrated in Fig. 1, \(X^a\) is categorized within the image style domain a, characterized by yellow background styles, whereas \(X^b\) is associated with the image style domain b, distinguished by green background styles. StarGAN2 introduces a novel concept of a style code, enabling multiple cross-domain and varied style transformations. In the context of generating defect samples for the specific task of insulator self-explosion, the style represents global information. From a comprehensive viewpoint, samples exhibiting local defects are minimally different from those without defects. Consequently, StarGAN2 can undergo training utilizing an extensive collection of non-defective samples. Upon completion of the training phase, the model is capable of transferring defect samples based on the style code of non-defective samples, thereby enhancing the authenticity and diversity of the generated samples.

3.2 Loss Function of StarGAN2

StarGAN2 facilitates style transfer through the formulation of various loss functions. Specifically, the adversarial loss guarantees the authenticity of the generated samples. Cycle consistency loss is employed for the transfer of unpaired images, ensuring that an image translated from one domain to another can be translated back to the original domain, thereby preserving content integrity. Diversity-sensitive loss encourages the generation of diverse outputs within the same domain, enhancing the variability of generated images. Finally, style reconstruction loss ensures that the Generator accurately interprets and applies the style code, maintaining consistency in style transfer.

In the following equations, \(G\), \(D\), \(E\), and \(M\) denote the Generator, Discriminator, Style Encoder, and Mapping Network, respectively. The input \(x^a\) represents an image in domain \(a\), while \(x^b\) represents an image in domain \(b\). Image domains \(a\) and \(b\) are associated with the respective domain labels \(0\) and \(1\). The variable \(z\) represents randomly generated latent variables, with \(z \in Z\), where \(Z\) is a Gaussian-distributed latent space. The variable \(s\) denotes a fixed-dimensional image style code, generated either by \(M\) or \(E\). As illustrated in Fig. 2a, by supplying \(M\) with a latent variable \(z\) and a target domain label \(1\), a style code for domain \(b\), \(s^b\) can be obtained. Similarly, a style code for domain a, \(s^a\), is obtained when the domain label 0 is provided. Conversely, as shown in Fig. 2b, by providing E with a reference image and a target domain label, target domain style codes \(s^a\) and \(s^b\) can be derived. Unlike s generated by M, s obtained from E reflects the style of the reference image, enabling the generation of multiple samples from a single image based on different reference images.

Fig. 2
figure 2

The process of generating style codes. a Generating different style codes by M. b Generating different style codes by E

Fig. 3
figure 3

Style diversity loss and Style reconstruction loss. Style diversity loss enables the generator to generate samples with distinct styles, while style reconstruction loss ensures that the generated samples align with the styles of the real samples

Equation (1) takes the transfer from domain a to domain b as an example, and it is similar when performing the transfer from domain b to domain a. The adversarial loss (Fig. 1) is

$$\begin{aligned} L_{\textrm{adv}}=\log D_{y^{a}}\left( x^{a}\right) +\log \left( 1-D_{y^{b}}\left( G\left( x^{a}, s^{b}\right) \right) \right) , \end{aligned}$$
(1)

where \(y^a\) and \(y^b\) are the labels of domain a and domain b, respectively; \(x^a\) and \(s^b\) are input into G; we can get the target domain image \(G(x^a,s^b)\). The D expects the discriminant result of the real image \(x^a\) to be 1 and the discriminant result of the generated image \(G(x^a,s^b)\) to be 0, while the G expects the Discriminant result of the generated image to be 1. It is worth noting that because the image and its domain label are input into the D simultaneously, the D can give different domain output image authenticity judgments. Then, the cycle consistency loss (Fig. 1) is

$$\begin{aligned} L_{\textrm{cyc}}=\left\| x^{a}-G\left( G\left( x^{a}, s^{b}\right) , s^{a}\right) \right\| _{1}, \end{aligned}$$
(2)

where \(s^b\) are the alternating style codes produced by M or E. G can generate the output \(G(x^a,s^b)\) under reference style \(s^b\). \(s^a\) is the style code of input \(x^a\) extracted by E. The loss function expects the G to regenerate the output as the cycle consistent output \(G(G(x^a,s^b),s^a)\) in the original input \(x^a\), minimize the distance between the input and the consistent cycle output. Then, the style diversity loss (Fig. 3) is

$$\begin{aligned} L_{\textrm{ds}}=-\left\| G\left( x^{a}, s_{1}^{b}\right) -G\left( x^{a}, s_{2}^{b}\right) \right\| _{1}, \end{aligned}$$
(3)

where M maps \(s_1^b\) and \(s_2^b\) to different style codes by latent \(z_1\) and \(z_2\), and the loss function expects to rely on different style codes to generate images that differ as much as possible, which ensures the diversity of the generated image style. The style reconstruction loss (Fig. 3) is

$$\begin{aligned} L_{\text{ sty } }=\left\| s^{b}-E_{y^{b}}\left( G\left( x^{a}, s^{b}\right) \right) \right\| _{1}. \end{aligned}$$
(4)
Fig. 4
figure 4

Identity loss. The purpose of identity loss is to minimize the difference between the generated image and the input image, so that the generated image can maintain the identity of the original image

where \(G(x^a,s^b)\) is the target style image, E can encode the generated image as style encode \(E_{y^b}(G(x^a,s^b))\). The loss function expects the transferred image based on style \(s^b\) to still extract style code similar to itself. This forces the G to use style code.

3.3 Improvements

Unlike the Multimodal Unsupervised Image-to-Image Translation (MUNIT) approach, which segregates content and style codes through an encoding–decoding process, the StarGAN2 method extracts style codes without distinguishing between the image’s style and content. This distinction results in StarGAN2’s reliance on a global setting for loss construction. Consequently, the extracted style codes encapsulate not only stylistic attributes such as lighting and color but also morphological and textural information of the object. This methodology, while innovative, leads to the amalgamation of content information from the reference image during style transfer. Specifically, in the task of generating defective insulator samples, StarGAN2 may inadvertently produce artifacts, such as glass flakes, thereby compromising the integrity of defect semantics. Although the integration of cycle consistency loss aims to preserve content across domain transformations, it presupposes a bidirectional mapping relationship between image domains, a condition that is challenging to satisfy in practice. To address this issue and ensure the preservation of original data sample content, we introduce an identity loss component, as depicted in Fig. 4 and defined as

$$\begin{aligned} L_{\textrm{id}} = \Vert x^{a} - G(x^{a}, s^{a})\Vert _{1}, \end{aligned}$$
(5)

where \(s^a\) is the style code of the input image \(x^a\) extracted by E. Input the image \(x^a\) with its own style code \(s^a\) into the G to get the output \(G(x^a,s^a)\). Identity loss calculates the L1 distance between \(x^a\) and \(G(x^a,s^a)\). Because the G receives a similar target style to the input, the G updates fewer parameters. The identity loss is used to constrain the consistency of the input and output images, so G can change the content of the input image less often, thus ensuring that the defect sample remains a defect sample after the transformation.

As the generated image size \((512 \times 512)\) increases, the generation difficulty gradually increases, the image becomes too smooth and loses many texture details, and the image realism is significantly reduced. The cycle consistent generation part in the style transfer process is similar to the super-resolution image reconstruction work. As shown in Fig. 5, \(x^a\) is fed into G to get \(G(x^a,s^b)\), and \(G(x^a,s^b)\) is fed into G to get \(G(G(x^a,s^b),s^a)\), which should be consistent with the detail-rich original input. Equally, in super-resolution work of images, CNN networks are applied to encode and decode the images, and the generated results is approximated with the high-resolution target through the mean square error loss [10]. Therefore, we introduce the methods from super-resolution work into our method to improve the quality of the generated images.

Fig. 5
figure 5

Perceptual loss. The deep image features extracted by perceptual loss constructed using VGG network have stronger abstraction ability and richer semantic information, which helps to improve the quality and realism of the generated images

In super-resolution work, it is challenging to recover high-frequency information of the image, such as texture, using mean-squared error. Also, due to the inherent uncertainty of this high-frequency information, it is not easy to find the exact solution for mean-squared error and more accessible to find the pixel-mean solution, resulting in poorer image quality. Research has elucidated that within pre-trained VGG networks, lower level features predominantly replicate the precise pixel values of the input images [11], whereas the higher level features are adept at encapsulating abstract elements such as the contours and textures, which are essential for understanding the image’s content. SRGAN [21], leveraging this insight, incorporates a pre-trained VGG network to construct a perceptual loss function. This function, by extracting deep image features from the higher levels of the VGG network, significantly enhances the quality of generated images. Perceptual loss, by focusing on these abstract features rather than pixel-to-pixel accuracy, facilitates a more nuanced and contextually rich image generation process. Analogously to the SRGAN method, we can optimize the G by adding similar losses to increase the detail of the output image. The perceptual loss(Fig. 5) is described as

$$\begin{aligned} L_{\textrm{per}}=\left\| V\left( x^{a}\right) -V\left( G\left( G\left( x^{a}, s^{b}\right) , s^{a}\right) \right) \right\| _{1}, \end{aligned}$$
(6)

where V is the pre-trained VGG network. \(G(G(x^a,s^b),s^a)\) is the result of secondary generation, and the input is \(x^a\). As shown in Fig. 5, we minimize the feature differences of \(x^a\) and \(V(G(G(x^a,s^b),s^a))\) after VGG extraction.

4 Experiment

The experimental data in this paper use real insulator samples in the power system, which contain a large number of non-defective samples, a small number of defective samples, the type of defect is typical insulator self-exploding. The data are taken in multiple seasons and can be divided into two image domains. The data taken in spring and summer are divided into the same set of image domains, and the overall visual characteristics of the images are the green style with abundant vegetation; the data taken in autumn and winter are divided into the same set of image domains, and the overall visual characteristics of the images are the yellow style with sparse vegetation. The data composition is as follows.

Table 2 Data distribution

To verify the enhancement brought by identity loss and perceptual loss on the generated samples, we perform ablation comparison experiments in Sect. 4.1. To verify the authenticity of the generated samples, we conduct sample replacement experiments based on Deep Neural Network (DNN) [36] and Single Shot Multibox Detector (SSD) [23] algorithms in Sect. 4.2. To verify the effectiveness of the samples generated by the method as a means of data augmentation, we conduct DNN and SSD sample augmentation experiments in Sect. 4.3. To distinguish from traditional data augmentation methods, we conduct comparative experiments based on DNN data augmentation methods in Sect. 4.4.

In the ablation comparison experiment, the Frechet Inception Distance (FID) [17] and Learned Perceptual Image Patch Similarity (LPIPS) [19] metrics to examine the effects of identity loss and perceptual loss on the quality of generated samples. FID is used to measure the difference between the generated image and the real image, the lower its value, the more similar the generated image is to the real image. Therefore, FID is defined as

$$\begin{aligned} FID = ||\mu _x - \mu _g||^2 + Tr(\Sigma _x + \Sigma _g - 2(\Sigma _x\Sigma _g)^{\frac{1}{2}}), \end{aligned}$$
(7)

where \(\mu _x\) and \(\mu _g\) are the mean vectors of the real image set and the generated image set in a certain feature space, respectively. \(\Sigma _x\) and \(\Sigma _g\) are the covariance matrices of the real image set and the generated image set in the feature space, respectively. \(Tr\) is the trace operation, which is used to compute the sum of the diagonal elements of the square matrix. And LPIPS measures the visual similarity between images, reflecting human visual perception differences. a lower LPIPS indicates that the images are visually closer together, which is defined as

$$\begin{aligned} LPIPS(x, y) = \sum _{l} w_l \cdot \frac{1}{H_lW_l} \sum _{h,w} ||\phi _l(x)_{h,w}- \phi _l(y)_{h,w}||^2, \end{aligned}$$
(8)

where \(x\) and \(y\) represent the two compared images respectively. \(l\) represents the layer in the convolutional network. \(w_l\) is the weight of the first \(l\) layer. \(H_l\) and \(W_l\) represent the height and width of the \(l\) layer. \(\phi _l(x)_{h,w}\) and \(\phi _l(y)_{h,w}\) are the feature vectors of the position \((h, w)\) of the image \(x\) and \(y\) in the feature map of the \(l\) layer, respectively. In the sample replacement experiments, sample augmentation experiments, and comparison of data augmentation methods experiments, the average precision (Ap), \(Ap^{50}\), and \(Ap^{75}\) metrics were applied to evaluate those experiments’ results. In the classification experiments based on DNN and SSD, for each class, Ap is the average of the area under the curve of the change in precision with recall, which is defined as

$$\begin{aligned} AP = \int _{0}^{1} p(r) \, dr. \end{aligned}$$
(9)

\(p(r) = \frac{TP}{TP + FP}\), which is defined as the ratio of correct positive samples detected (TP) to all samples detected (\(TP+FP\)). \(Ap^{50}\) is the AP calculated at a specific IoU (Intersection over Union) threshold of 0.5, where IoU is the ratio of the area of overlap between the predicted bounding box and the true bounding box to the area of their union, and is used to assess the accuracy of the prediction. \(Ap^{75}\) ditto.

The style transfer model was trained using Pytorch 1.4 with two RTX 2080 Ti (11 G) GPU. The model input size is adjusted to \(512\times 512\), and Batch Size is set to 2. The coefficient of each loss weight is 1. Using Adam optimizer, it is set to 0.99, where the learning rate of G, D, and E is set to 0.0001, and the learning rate of M is 0.000001 [8]. Specifically, the learning rate determines how quickly a model updates its parameters in response to the error it observes. A higher learning rate can lead to faster convergence but may overshoot the optimal solution, while a lower learning rate ensures more stable convergence but requires more training iterations. For G and E, a carefully chosen learning rate ensures effective style transfer and encoding. For D, it balances discrimination accuracy. For M, a lower learning rate might be used to fine-tune style mapping with greater precision.

VGG network is a pre-trained model on Imagenet, and we refer to SRGAN, which does not use the Bn layer while removing the full connection layer, so the network can input multi-size images and select the fourth layer pool of convolution block 5 as the output. The convolution block is a deep structure, can effectively extract the high-level semantic features of images.

4.1 Ablation Comparison Experiments

Ablation experiments were performed to verify the validity of each newly added loss. Current GAN-based sample generation quality assessment commonly uses the FID distance, and generation diversity assessment usually uses the LPIPS score. The FID uses a pre-trained Inceptionv3 network to extract features and calculates the difference between the generated sample features and the real target sample features. A smaller value indicates a better quality of the generated samples. LPIPS score uses a pre-trained Alexnet network to calculate the similarity between two generated images, and a higher score indicates a richer diversity of the generated samples. It is worth noting that identity loss and perceptual loss can globally constrain the content information of the image, so non-defective samples can be used for training. After the model is trained, the defect samples and the reference image samples are needed to generate the samples. Each image of the test set was generated based on 10 different reference images to obtain a quantitative evaluation, the results of which are shown as follows: As shown in Fig. 6, each newly added loss can reduce the FID score, indicating that the quality of the generated samples has improved. Meanwhile, higher LPIPS scores can be obtained using our method, indicating that the diversity of generated samples also has increased.

Fig. 6
figure 6

a Scores for FID. b Scores for LPIPS. A lower FID value indicates that the generated image is closer to the real image and performs better, and higher LPIPS value indicates richer diversity of generated samples

We also qualitatively compare the performance of each loss, and the model generation results for the same number of iterations are shown in Fig. 7. The first column source is the input defect sample; the second column Ref is the reference image. When only the StarGAN2 method is used (Star), insulator glass pieces appear at the defects, and the defect semantics is destroyed; when the identity loss is added alone ( Star+Id), the overall image deformation is reduced, and almost no insulator glass pieces are generated at the defects, but the image tends to become smooth; when the perceptual loss is added alone (Star+Per), the overall texture detail of the image is improved, but the second row and fifth column, the fourth row and fifth column images, some insulator glass pieces are also generated at the defects; when both losses are added simultaneously (Star+all), the generated image both retains the semantic morphology of the defects better and the texture detail of the image is also improved.

Fig. 7
figure 7

Ablation map multiple generation results

Fig. 8
figure 8

Ablation map multiple generation results

In the Fig. 8, the first column is the reference non-defective sample, the first row is the input defect sample, and the rest are the generated defect samples. It can be seen that each row of generated samples retains the content information of the input image and incorporates the style of the reference image, while the defect semantics is preserved.

4.2 Sample Replacement Experiments Based on DNN and SSD

To verify the authenticity of the generated samples, we conducted sample replacement experiments based on two mainstream single-stage detection models, DNN and SSD. From Table 2 data distribution, we can see that there are 274 defect samples, of which 154 are in green style and 120 in yellow style. After the style transfer model training is completed, we obtain the style transfer model between the yellow style domain and the green style domain, generating the defect images insulators under these two styles based on different reference images. The defect source image generates a sample after style transfer at a ratio of 1:1 based on the real reference image, and all defect samples are not used as reference images. The generated samples are labeled and converted to VOC format by Labelme, and the samples are divided into training set, validation set, and test set according to the ratio of 0.65, 0.1, and 0.25. The input image size of DNN is set to \(416\times 416\) size, and SSD is set to \(300\times 300\) size.

Moreover, the above methods follow the mainstream data augmentation methods for target detection (image shift, mirror flip, color jithering) online. Using Ap0.5:0.95 (IOU is set to 10 groups in increments of 0.05 from 0.5 to 0.95, the average of the average accuracy rate, which is abbreviated as Ap from now on), \(Ap^{50}\) (IOU=0.5), and \(Ap^{75}\) (IOU=0.75) as the test set evaluation metrics. To prevent training overfitting, we used Keras ReduceLROnPlateau with the Earlystopping method to halve the learning rate when the validation set loss experienced three epochs without decreasing. The training was stopped when the validation set loss experienced ten epochs without updating the minimum value. The experimental results are as follows.

Table 3 Sample replacement experiment

As shown in Table 3, the first group of DNN achieved the highest score when trained using real data and tested with real data. In the second group, after replacing the real training set with synthetic samples, \(Ap^{50}\) dropped slightly, \(Ap^{75}\) dropped 18.3%, and Ap dropped 4.8%, indicating that the distribution of synthetic samples matches the real sample distribution to some extent. The third group was trained using real data and tested with synthetic data. \(Ap^{50}\) still obtained 81.2% of the score, \(Ap^{75}\) obtained 38.2%, and Ap took 42.6%, indicating that the generated samples have a certain degree of realism. The fourth group was trained using synthetic data and tested using synthetic data. \(Ap^{50}\) improved 5%, \(Ap^{75}\) improved 1.6%, and Ap improved 1.7% compared to the third group. For the SSD algorithm, we still get similar data results. Overall, the highest scores are recorded when both training and testing utilize real data, setting a performance benchmark. A notable observation is the drop in performance, especially in \(Ap^{75}\), when generated data are used for training, indicating a disparity between generated and real data distributions. However, the generated data’s efficacy is evident when used in both training and testing, showing an improvement over scenarios where training and testing datasets differ. This highlights the adaptability of models to consistent data environments. The comparative analysis between DNN and SSD models reveals a consistent trend where models trained with generated data and tested with real data perform better than those trained with real data and tested with generated data, suggesting generated data’s utility in training but also its limitations in fully replicating real data characteristics.

Fig. 9
figure 9

The effect of target detection. The DNN detection model trained with real samples is used to test some of the effects of the generated samples

4.3 Sample Augmentation Experiment Based on DNN and SSD

To verify the effectiveness of generating sample augmentation training, we conduct the following experimental configuration: First, for 274 real samples, 25% are randomly selected as the test set, and all augmentation experiments are tested using this fixed test set; second, for the remaining samples, we generate 1–6 times of data based on the reference image and expand the generated data into the remaining defect samples to form the data experiment set; Finally, due to the inconsistency in the number of training sets caused by data augmentation, the above-mentioned remaining images were also generated 1–6 times by rotating the images at different angles to constitute the data control group. For using the rotational augmentation method, which is due to the more minor difference between the rotated images and the real images compared to other data augmentation means, and when rotational augmentation is used, the online augmentation does not use mirror flip.

For the training set, 10% is randomly selected for validation and 90% for training. It is worth noting that the synthetic samples are also randomized as validation data. We also use DNN, SSD detection algorithm for validation, online data augmentation, and ReduceLROnPlateau with the Earlystopping method to prevent overfitting and reduce Randomness. Ap, \(Ap^{50}\), \(Ap^{75}\) results were obtained as shown in Fig. 9, where \(R1+-R6+\) is the training control group for rotational augmentation 1–6 times; \(S1+-S6+\) is the experimental training group for style transfer augmentation 1–6 times.

Fig. 10
figure 10

Experimental results based on DNN, SSD sample augmentation

\(Ap^{50}\) scores are shown in Fig. 10a and b. When IOU=0.5 is set, both style transfer augmentation and rotational augmentation can improve the \(Ap^{50}\) scores of DNN, SSD detection model, while the improvement is unstable and small, and to some extent, it also leads to a decrease in model scores.

\(Ap^{75}\) scores are shown in Fig. 10c and d. When setting IOU=0.75, the detection difficulty is enhanced, while style transfer augmentation can substantially improve the evaluation scores of both detection models, with the most significant improvement of 13.37% for DNN, which is 7.5% higher than the rotational augmentation method; and the most significant improvement of 8.37% for SSD, which is 3.95% higher than the rotational augmentation method. The control group, which underwent rotational augmentation, also improved the evaluation score of the SSD, but the improvement was significantly lower than that of the augmentation method with style transfer. As for the DNN, rotational augmentation makes the model performance high and low, and its improvement is much lower than that of the style transfer method.

Ap scores are shown in Fig. 10e and f. When setting the IOU in increments of 0.05 from 0.5 to 0.95 and obtaining the average accuracy mean Ap for the 10 groups, DNN has the most significant improvement of 5.85% using style transfer augmentation and 3.53% over rotational augmentation; SSD has the most significant improvement of 4.16% and 1.24% over rotational augmentation. Using the style transfer augmentation method, the overall Ap scores of both algorithms improved significantly, and the improvement was more significant than that of the control group.

We also found that when the style transfer augmentation multiplier was 4–6 times, it could lead to higher Ap scores for the two models mentioned above; when the augmentation multiplier was 3–5 times, it could lead to higher \(Ap^{75}\) scores, while the augmentation of different multipliers was not stable for the improvement of \(Ap^{50}\) scores, and might even make the scores drop.

The presented data compellingly demonstrate the efficacy of the style transfer enhancement approach in augmenting the accuracy of DNN and SSD detection models. The impact is particularly pronounced in the \(Ap^{75}\) scores, where the style transfer approach outperforms rotational enhancement. The study elucidates that the optimal multipliers for enhancement vary, indicating a nuanced relationship between the strength of enhancement and model performance. However, it is noted that the benefits in \(Ap^{50}\) scores are less consistent, occasionally leading to a decrease in performance, underscoring the complex dynamics of model enhancement through style transfer augmentation.

4.4 Comparison of Data Augmentation Methods

To further verify the difference between the style transfer augmentation and other methods, it is compared with image shift + image random flip (IS+IF) and color jittering (CJ) augmentation methods for experiments. According to C, choosing a threefold sample expansion, DNN can obtain a higher performance improvement.

Table 4 Data augmentation and comparison experiment

As shown in Table 4, \(Ap^{50}\), \(Ap^{75}\), and Ap achieved 87.97%, 54.25%, and 48.73% when no data augmentation was used. When only IS+IF, CJ, and S3+ were used, the indexes improved to different degrees. When we combined \(IS+IF\) or CJ, respectively, with \(S^{3+}\), there was an improvement over using them alone. Using all three together, \(Ap^{50}\), \(Ap^{75}\), and Ap achieved the best results of 94.98%, 68.78%, and 57.82%. As can be seen, the style transfer augmentation in this paper nicely complements the other data augmentation methods. We not only compared with traditional methods, but also with advanced methods based on deep learning in recent years. The results show that our method exhibits better results under the \(Ap^{50}\) and Ap metrics. This comparison validates the effectiveness of our approach.

4.5 Analysis of Experimental Results

The method of this paper innovatively trains with the help of real non-defective insulator samples, which can expand a small number of insulator defect samples in a large number. Since the original StarGAN2 method cannot distinguish the difference between style and content, it leads to the generation of defect samples easy to combine the content information of the reference image, so that the defect semantics are changed. Through Sect. 4.1 ablation comparison experiments, identity loss can strengthen the morphological constraints of generated samples and ensure the semantic integrity of defects. However, it will reduce the ability of the Generator to generate texture details of images. Extracting deep semantic features through VGG to construct perceptual loss can increase the Generator’s ability to generate texture details and improve the realism of generated samples. Meanwhile, the improvement measures also increase the generative diversity. Through Sect. 4.2 sample replacement experiments based on DNN and SSD algorithms, the accuracy obtained by the model trained with real samples, tested with synthetic samples, is lower than that obtained by the model trained with synthetic samples, tested with real samples. This is because the synthetic samples retain the semantic information of the real image and combine the reference image’s stylistic features. Thus, the set of defects to which the synthetic samples belong has a more extensive range and better generalization. Through Sect. 4.3 DNN and SSD sample augmentation experiments, augmentation training of style transfer samples can significantly improve the Ap and \(Ap^{75}\) scores of the model on the real test set. This is because using diverse augmented data with perturbations to train the model can improve the weaknesses in the decision boundary of the model and improve the robustness of the model, thus expanding the detection range. By comparing the Sect. 4.4 data augmentation methods, the style transfer augmentation methods do not contribute in the same direction as image shift + image random flip and color jittering to the detection model performance, and there is no conflict between them.

5 Conclusion and Future Work

In this paper, we adeptly tackled the paucity of defective power insulator samples by developing a novel style transfer methodology based on StarGAN2, significantly curtailing the energy expenditure inherent in manual sample collection. Our approach integrates identity loss, preserving the integrity of defect semantics, and perceptual loss, augmenting the quality of generated samples. This method, negating the need for additional manual labeling and facilitating end-to-end training, demonstrates promising potential for broader applicability in power defect detection. Empirical enhancements in Ap and \(Ap^{75}\) scores of DNN and SSD models attest to the method’s efficacy, enriching conventional data augmentation techniques. Nonetheless, its performance in complex scenarios and detailed reconstruction necessitates further exploration, directing future research endeavors.

However, there are still shortcomings in the current method, which is less effective in transferring samples with overlapping insulators, while it is challenging to reconstruct images with finer pixel details. The proposed StarGAN2 style transfer method demonstrates substantial efficacy in generating artificial insulator defect samples for electric power systems, and its application to other contexts may encounter certain limitations. For instance, the method’s performance could vary when adapting to different types of defects, including applications for corrosion detection of metallic parts of power systems, cable damage identification, and structural integrity detection of electrical towers, particularly in industries with vastly different operational and environmental conditions. Future work is warranted to explore and mitigate these limitations, ensuring the broader applicability and adaptability of our approach in varying real-world scenarios.