1. Introduction
In the process of image acquisition and transmission, the original image is often affected by the noise introduced by the system equipment and transmission channel, which leads to the loss of effective information about the image, and then affects the subsequent image analysis and processing, such as image segmentation, target recognition, edge extraction, etc. Accordingly, image denoising has become a classic problem and a popular research topic in the area of vision applications and image processes. Efficient image-denoising algorithms remove the noise while ensuring that the structural information of the processed image is not altered, which helps in other image-processing tasks, and is further used in remote sensing, medical imaging, surveillance, and other fields [
1,
2].
The current image-denoising algorithms can be classified into two main types, i.e., conventional denoising algorithms and deep learning-based denoising algorithms. The conventional methods mainly use the structural properties of the image itself for denoising, such as denoising algorithms based on the theory of nonlocal self-similarity of the image, denoising algorithms based on sparse representations, and so on. There are ones using filters such as Gaussian filtering methods [
3,
4], bilateral filtering methods [
5,
6], and median filtering methods [
7,
8]. Nonlocal self-similarity algorithms take advantage of the fact that the image blocks in a natural image are similar to each other, and search for image blocks similar to the image block centered on the current pixel in the whole image, and process their similar blocks. The representative ones are nonlocal means (NLMs) [
9,
10,
11], three-dimensional block-matched filtering (BM3D) [
12], and the weighted nuclear paradigm minimization (WNNM) [
13]. Classical sparse representation-based denoising methods include the dictionary learning algorithm (KSVD) [
14], and nonlocal centralized sparse representation (NCSR) [
15,
16]. However, such methods need to find the a priori information of the image first, and then use optimization algorithms to solve the model iteratively. Therefore, the complex optimization process of traditional denoising methods takes a lot of time and computational cost and also requires manual parameter adjustment, which is not very generalizable. Traditional methods are also prone to image blurring and detail loss problems.
With the enhanced performance and computational power of various types of computers, researchers have gradually introduced deep learning methods into the field of image processing. Deep learning has been applied broadly in the area of computer vision [
17,
18,
19,
20,
21,
22]. The main ideas of the deep learning denoising algorithm are to use a large number of noisy and clean image pairs and to perform deep neural network denoising of these training data using end-to-end learning with excellent performance. Schmidt and Roth proposed a cascade of shrinkage fields (CSF) [
23] approach to unify random field-based models and unfold semi-quadratic optimization algorithms into a single learning framework. Chen [
24] et al. proposed a trainable nonlinear reaction–diffusion (TNRD) model. Burger [
25] et al. implemented image denoising using a multilayer perceptron (MLP) approach. ZHANG et al. [
26] raised a deep denoising convolutional neural network DnCNN, which for the first time applied batch normalization [
27] and residual learning [
28] to the field of image denoising, and was able to handle uniform Gaussian noise effectively. Subsequently, ZHANG et al. [
29] proposed an FFDNET method for image denoising, which took the noise level and the noisy image as joint inputs and trained a model to process the noisy image under different noise levels. To further optimize the denoising performance of the neural network, Tian [
30] offered an enhanced convolutional network, ECNDNet, by combining the dilation convolution with ordinary convolution, which further improved the sensory field of the network. The author of [
31] introduced residual optimization based on a convolutional neural network, which addressed the progressive disappearance of the gradient during the propagation of convolutional neural networks when the number of layers was greater. The author of [
32] set a baseline depth denoising initially by training a flexible and efficient CNN denoiser, which was inserted as a module into an iterative HQS-based algorithm that could solve various image restoration problems. The author of [
33] proposed a robust deformation denoising CNN that could exploit morphable learned kernels and stamped convolutional architectures to extract more typical noise features. Study [
34] was a mixture of denoising models based on a network of transformer encoders and convolutional decoders that achieved state-of-the-art denoising performance on real images at relatively low computational cost. The author of [
35] designed a dual network with a sparse mechanism that extracts complementary features to recover clear images that could act on real noisy images.
Although the above deep learning-based denoising algorithms have produced good results, there are still problems. The edge and texture information of the image is very important for the recovery of the image, but the denoising network treats all the acquired information in the network equally and does not focus on the edge and texture information of the input image, which results in the poor recovery of the denoised image in the edge region. Therefore, how to extract the edge as well as texture features of the image from the limited features is the difficulty of the subsequent denoising network. To address the above problems, Hu et al. [
36] proposed a channel attention mechanism to learn the correlation between channels. Woo et al. [
37] proposed CBAM to better learn the correlation between feature maps from channels and spatial locations. These two attention mechanisms generated weights through global pooling operations and convolution. Yang [
38] proposed SimAM (a simple parameter-free attention module) to learn the correlation between channels and spatial correlations at each position of the feature map without the need for parameters, using statistical laws. In addition to single-layer convolutional neural networks, BRDnet [
39] with a two-layer neural network structure has also been proposed, which increased the width of the network by combining two networks to obtain more features and improved the training speed and training effectiveness by applying batch renormalization, residual learning, and dilation convolution simultaneously. Z. Cai et al. [
40] proposed a two-stage image denoising model in which the input image was first processed with a specialized denoiser, and then the resulting intermediate denoised image was passed to a kernel prediction network that estimated the denoising kernel for each pixel. The robustness of the method to noise parameters superseded comparable blind denoisers while approaching state-of-the-art denoising quality for camera sensor noise.
Based on the previous research, this paper presents a parallel denoising network with nonparametric attention and multiscale feature fusion (NAMFPDNet). The main work is as follows:
- (1)
Aiming at the recovered image with blurred edge information and unclear image texture, a dual-branch image denoising network (NAMFPDNet) based on the residual denoising network is proposed based on the nonparticipant attention mechanism and multiscale feature fusion.
- (2)
A dual-branch deep feature extraction module was designed, in which the upper branch adopted the densely connected block to extract the local features of image noise, and the lower branch combined the ordinary convolution with the dilated convolution to form the residual block, which extracted the global information of image noise and strengthened the feature extraction capability of the network. Compared with the single-branch network structure, the dual-branch network not only solved the problem of insufficient feature extraction by the single-branch network model but also solved the problem of the saturation of the deep CNN performance.
- (3)
We used SimAm, a parameter-free attention mechanism. A parameter-free attention module was designed to focus on critical regions in important channels in the feature map from both spatial and channel aspects so that the network could recover clear edges as well as texture details.
- (4)
We designed a multiscale feature fusion module that deeply fuses global and local features using three convolutional layers of different scale sizes. Compared with the traditional single-scale convolution operation, the multiscale feature fusion method could better recover the image contour information and texture information.
2. Theory and Methodology
The paper designs a parallel image denoising network based on nonparametric attention and multiscale feature fusion (NAMFPDNet). The network continued the idea of DnCNN [
24] by using residual learning combined with batch normalization. The denoising network structure is shown in
Figure 1. The input of this network was the noisy image
and the output was the residual image
learned by the NAMFPDNet. The purpose of this algorithm was to learn the residual
that approximated the noise
, i.e.,
, and then remove the residual
from the noisy image
.
The entire network first applied a Conv3×3 + ReLU to perform an initial sampling operation on the image . The size of the convolution kernel was 3 × 3. There were 1 input channel and 64 output channels. The initial feature extraction module was used to extract the initial features Then, was passed through the parallel feature extraction module (PFEM), which adopted the upper branch network and the lower branch network in parallel to obtain the depth features of an image, and the features were spliced to obtain the deep features Then, was passed through a multiscale feature fusion module (MFM) for feature fusion to obtain the fused feature The fused feature was passed through the nonparametric attention module (NAM), which focused on the critical regions in the important channels in the feature map both spatially and channel-wise, so that the network could recover clear edges as well as texture details to obtain the enhanced feature .
The shallow feature
and the enhanced deep feature
were merged using a long skip connection and then passed to the residual reconstruction module. The residual reconstruction module applied only one layer of convolution to reconstruct the noise residuals, the size of the convolution kernel was 3 × 3, the number of input channels was 64, and the number of output channels was 1. The noise residual map
learned by the network was shown as shown in Equation (1). Long jump connections integrated shallow and deep feature information in the network, which was beneficial to stabilize the training of the network and improve the denoising performance. Finally, the original image was used to subtract
to obtain the denoised image.
2.1. Parallel Feature Extraction Module
The parallel feature extraction module (PFEM) employed a network structure in which the upper branch network and the lower branch network were connected in parallel. This is shown in
Figure 2.
The upper branch network in this paper used a similar connectivity approach as DenseNet, which mainly consisted of three tightly connected blocks (TCBs) connected in series to extract the local features of the noisy image. This local feature was processed by the later nonparametric attention mechanism (NAM) module and using the local residual connections. NAM learned the correlation of each position from the spatial and channel positions for the extracted features, and thus adaptively changed the weight of each position, which was multiplied with the extracted features, to achieve the focus on the important features in the local specialization and suppression of the invalid features. The output of the upper branch network was
The structure of TCB is shown in
Figure 3, and the input of the normalization layer in TCB came from the output of all the previous convolutional layers. This dense connection not only solved the gradient vanishing problem, but also brought powerful feature extraction capability and enhanced feature propagation. The TCB consisted of a total of five convolutional layers with a convolutional kernel size of 3 × 3, and the parameter settings of the convolutional layers are shown in
Table 1. A convolutional layer of a 1 × 1 size was used to reduce the number of channels at the end of the TCB, and the number of channels of the output feature map was 64, thus effectively reducing the computation.
The down-branch network in this paper used four dilated convolutional residual blocks (DCRBs) in series, which were then subjected to NAM for feature enhancement. The purpose of the design of the lower branch network was to compensate for the damage to the image information structure and the loss of noise information in the first branch through different structures. The output of the down-branch network was
The structure of the DCRB is shown in
Figure 4. The DCRB mainly consisted of a series of dilated convolutions with dilatation rates of 1, 2, and 3, respectively. The size of the convolution kernel was 3 × 3, with 64 convolution kernel numbers. The network specific parameter settings are shown in
Table 2.
Combining the dilation convolution with ordinary convolution, forming a sparse structure, expanded the sensory field of the network without additional learning parameters, which solved the problem of saturating the network feature extraction caused by using a single-sized convolutional kernel for deep networks, and effectively improved the performance of denoising networks. Different dilation rates prevented the lattice effect brought about by a single dilation rate. Local residuals were also added inside the DCRB to further enhance the feature extraction capability of the module, thus improving the model performance.
2.2. The Nonparametric Attention Module
Yang [
38] proposed a simple attention module (SimAm) by statistical laws, and SimAm proposed a 3d attention module based on human visual neurons focusing on both spatial as well as channel attention. Specifically, an energy function was optimized based on many neuroscience theories to find the importance of each neuron. By designing the linear separability between the target neuron and other neurons in the same channel, it was determined whether the neuron should be attended to or not. By deriving a closed-form solution of the energy function, the minimum energy of the neuron was obtained as shown in Equation (2).
where
;
.
indicates the input feature map, , and indicates the number of neurons in the same channel. denotes the mean of all neurons within the same channel. represents the energy variance of the neurons within the same channel. was taken as . Lower energy meant that the neuron was more deserving of attention than other neurons. Thus, the importance of a neuron can be obtained by .
Since attention was achieved by weighting, the formula for SimAm is shown in Equation (3),
Based on SimAm theory, the specific steps for the implementation of the nonparametric attention module (NAM) designed in this paper were as follows:
Input: represents the input feature map,
Output: enhanced feature map .
Step 1: Calculate the mean value of over the channel dimension. This meant squeezing the feature map along the spatial direction to find the mean value on each .
Step 2: Calculate the square of the mean error for each position in the same channel, obtaining .
Step 3: Compute the variance of over the channel dimensions. That meant that was summed over each dimension and divided by , where . Then, obtain the result t , as the channel attention information.
Step 4: Calculate the amount of energy per pixel by using .
Step 5: Enhance the feature map using the function. This implied that the obtained was calculated using Equation (3) as the augmented feature map.
To implement NAM, a custom neural network layer, named NAM, inherited from the nnet.layer.Layer base class was defined. In the forward propagation function of this class, the NAM was implemented. Therefore, the forward propagation method for this layer based on the above steps is shown in
Figure 5 below.
Compared with the existing channel attention mechanism SE and mixed attention mechanism CBAM, although the SE and CBAM can greatly improve the accuracy of the network, the network would generate more parameters because its implementation depends on the Full Connectivity Layer and Pooling Layers for weight allocation. For SimAm, on the other hand, it can provide neural networks with three-dimensional attention weights without adding any network parameters, as shown in
Table 3. This feature enabled SimAm to greatly reduce the complexity and computational cost of the model while maintaining high performance.
2.3. Multiscale Feature Fusion Module
This paper uses multi-feature fusion to extract image features at different scales. The structure of the multiscale feature fusion block (MFM) is shown in
Figure 6.
and
indicate the feature maps’ output from the upper and lower branch networks, respectively. The feature maps of the two branches were first feature-concatenated, and then feature extraction was carried out by parallel convolutional layers of convolutional kernel sizes 1 × 1, 3 × 3, and 5 × 5, respectively, with the number of convolutional kernels being 64. The extracted features were finally summed and fused. As compared to traditional single-scale convolutional operations, the multiscale feature fusion method can better recover the image contour information and texture information.