1. Introduction
Is seeing really believing? The answer is no. With the rapid advancements in AI-generated content (AIGC) technology within the realm of image processing, individuals can now produce highly realistic images using generative models. However, a generated model is akin to a double-edged sword. While it offers numerous conveniences in areas such as image editing [
1], image repair [
2], and image fusion [
3], when AI-tampered images are used with malicious intent, people are unable to discern their authenticity through visual inspection alone. Furthermore, the traditional detection models struggle to accurately assess their authenticity attributes, let alone pinpoint the manipulated areas. As illustrated in
Figure 1a, the traditional detectors frequently misclassify real and forged images, leading humans to face unprecedented security risks in the domains of information and cognitive security.
The most prevalent image generation methods can be categorized into three main types: Variational Auto-Encoder (VAE) [
4], Generative Adversarial Networks (GANs) [
5], and Diffusion Models (DMs) [
6]. These generative models fundamentally learn the distribution of images from extensive training datasets to produce similar images. Initially, the generative models, such as DDPM [
7] and GDDIM [
8], focused on generating complete images by learning the distribution of the images, where all the pixels were considered to be synthetic. However, with advancements in the generative models, the recent methods have evolved to facilitate partial image editing, known as “inpainting”. Examples include Inpaint Anything [
3] and HD-painter [
2]. As depicted in
Figure 1b, tampered images now contain both real and synthetic pixels, with the attacker’s primary focus being on the manipulated region. Contrary to the complex and intricate requirements of the traditional manual image tampering methods, the generative-model-based partial editing methods offer significant advantages by simplifying the tampering process based on the input instructions. This approach is gradually becoming the mainstream method for image tampering, posing greater challenges for the identification of synthetic images.
Indeed, the detection and localization of forged images have always constituted a pivotal research area within the domain of artificial intelligence security. When an image is found to be forged, individuals not only aspire to identify its forgery but also aim to further refine the process by pinpointing the specific forged region within the image. This enables the comprehension of the attacker’s intentions, ultimately facilitating the mitigation of the adverse effects stemming from such forged information.
Previous studies [
9,
10,
11,
12,
13] have indeed attained notable accomplishments in the domain of forged image detection and localization. However, a majority of these methodologies primarily concentrate on the traditional artificial tampering techniques, such as replication and splicing, or are restricted to categorizing image authenticity. Consequently, their efficacy in detecting and locating the latest AI-generated forged images is significantly diminished. Hence, apart from the legal constraints imposed on the generation models, it becomes imperative to enhance the detection and localization capabilities of forged images at the technical level.
This paper addresses the challenges posed by AI-generated forged images. We propose a progressive layered network, based on UNet, for the refined detection and localization of forged images. Initially, to facilitate progressive detection and localization, we re-classified the forged images from our previously established AI-generated forged image dataset and assigned more reasonable multi-level labels. The specific hierarchical classification and multi-level labels are depicted in
Figure 1c. Specifically, from the first to the fourth layer, we gradually subdivide the image attributes, transitioning from the coarse-grained category of authenticity to fine-grained forgery methods. For instance, the forgery image in
Figure 1b is assigned a multi-level label of ’Forgery -> partial tampering -> DM -> INDM’.
Furthermore, it has been observed in [
11,
14,
15] that images generated by different forgery methods exhibit distinct frequency domain deviations, which can serve as forgery fingerprints for detection tasks. At the same time, we have also noticed that the generation process of fake images is closely related to noise. Moreover, the quality of the generated models varies, often leading to color fluctuations in the forgery regions. Therefore, we draw inspiration from the previous methods [
10,
12,
16] to jointly utilize multi-type image features from both the spatial and frequency domains. In our method, the spatial domain features specifically include RGB features that capture abnormal fluctuations in the color space and noise features that mine the noise level of the image. The frequency domain features are captured using multiple Laplacian operators to capture the image frequency fluctuations. In addition, to learn richer feature representations, we use a dual-branch attention fusion module to fuse the spatial features. In the dual-branch attention fusion module, we introduce external attention [
17] to handle the positional relationships of the spatial features. At the same time, we utilize detection and localization results as thresholds to filter those channel features that are strongly related to the detection performance. We apply these channel thresholds to the spatial features to enhance those features that are strongly related to the task and suppress those features that are weakly related to the task.
Subsequently, due to variations in the size of the tampering region within the generated model, utilizing a fixed-resolution feature map poses the challenge of restricting the feature field of view for forged regions of different scales. Therefore, to mitigate the adverse effects of the scale variations in the forged region on detection and localization, we employ three sets of densely interconnected hierarchical networks. These networks facilitate comprehensive information exchange between the features of varying resolutions through multiple upsampling and downsampling operations, enabling the preservation of both the local and global features and addressing the issue of a limited feature field of view.
Finally, we adopt a hierarchical and progressive approach for detection and localization. To establish dependencies between feature maps of different resolutions, we integrate a multi-scale feature interaction module into the UNet [
18] network structure. Using the decoder, we fuse low-resolution feature maps with high-resolution feature maps from bottom to top. Additionally, we leverage the detection and localization results from the low-resolution feature maps as priors to guide the detection and localization process at higher resolutions. The experimental results demonstrate the effectiveness of this approach as the coarse detection and localization outcomes from the low-resolution feature maps prove beneficial when used as priors for guiding the detection and localization at higher resolutions. In the skip connections of our UNet structure, we introduce a convolutional block attention module with soft-threshold constraints (t-CBAM) to capture rich contextual dependencies. The threshold selection is achieved by multiplying the channel and spatial average pooling with the channel attention weights and spatial attention weights, respectively. With our proposed model, we have achieved significant improvements in the detection and localization accuracy of AI-tampered images, surpassing the baseline methods.
The contributions of this article are the following:
This article combines external attention and channel attention as a dual-branch attention feature enhancement module, using the feedback results of the detection and localization as dynamic thresholds to enhance the strongly related features and suppress the weakly related features.
This article proposes a combination of a hierarchical network and UNet network structure with soft-threshold attention, and it establishes hierarchical dependency relations.
This article proposes a hierarchical and progressive forged image detection method called HPUNet, which successfully achieves the accurate detection and localization of AI-generated forged images and further improves the accuracy of detection and localization compared to the baseline methods.
This work extends from our previous research [
19] in several key aspects. Firstly, instead of merely discussing whether an image is generated from text, we have assigned more reasonable multi-level labels to our AI-generated forged image dataset. This approach ensures that the hierarchical detection results are more aligned with human cognitive laws. Secondly, we have introduced an external attention mechanism to optimize the spatial attention process of the features. Additionally, we utilize detection results as dynamic thresholds to constrain the dual-branch feature fusion within the context feature enhancement module. This enhancement strategy aims to amplify the task-relevant features while suppressing the weakly relevant ones. Thirdly, we have incorporated the UNet network structure, leveraging the decoder to establish connections between feature maps of different resolutions. Furthermore, we have introduced a soft-threshold dual-attention mechanism in the skip connections to retain the main semantic features and eliminate irrelevant ones.
5. Conclusions
In this paper, we propose a progressive UNet-based network for the detection and localization of AI-generated forged images, achieving further improvements compared to the baseline methods. This approach utilizes spatial domain image noise features and RGB features, combined with frequency domain features, to capture richer forgery traces. Firstly, the image features are extracted through a context feature enhancement module and fused in a dual-branch attention fusion module, incorporating multiple types of image features. Then, the detection results are used as dynamic thresholds to enhance the task-relevant features and suppress the task-irrelevant features. Subsequently, a multi-branch feature interaction network is employed to enable information exchange between features of different resolutions, addressing the limited field of view of the feature maps caused by inconsistent forgery image sizes. Finally, in a hierarchical network structure, the decoder correlates feature maps of different resolutions, and a soft-threshold dual-attention feature enhancement mechanism is introduced in the skip connections to preserve the main features. The model progressively improves the higher-level results under the guidance of the lower-level results in a hierarchical manner.
We have conducted extensive experiments to demonstrate the effectiveness of our method for detecting and locating AI-generated fake images. We uniformly evaluated HPUNet and the baseline model on the AITfake dataset, and our method achieved the best scores for both the fake image detection and localization tasks. Subsequently, we conducted cross-dataset validation experiments on four open datasets: CoCoGLIDE, HIFI-IFDL, GenImage, and Casia. The experimental results show that HPUNet has better stability than the baseline model. We also conducted sufficient ablation experiments to verify the effectiveness of HPUNet from multiple perspectives.