1. Introduction
Communication through voice is the most natural and efficient way [
1]; however, noise often affects voice quality, damages hearing, and reduces intelligibility. The speech enhancement algorithm is a crucial way to solve this problem. It can improve speech quality and intelligibility by removing noise in speech audio. Currently, most of the existing speech enhancement algorithms are designed for speech audios with a high Signal-to-Noise Ratio (SNR) and have achieved relatively ideal performance in most scenarios with less noise [
2]. However, the performance of existing algorithms is limited in scenes with high noise interference and lower SNR, such as forging factories, racing cars, extreme sports, and even battlefields. Speech enhancement tasks for low SNR scenarios are challenging [
3], and related research is still in its infancy. In this paper, we propose a model for low-SNR scenarios.
With the development of deep learning technology, great progress has been made in different fields, such as computer vision(CV), natural language processing (NLP), and audio processing. In the field of speech enhancement, Lu [
4] proposed a speech enhancement method based on a deep neural network in 2013, and, since then, deep learning-based methods have become mainstream. Among many network architectures, the UNET [
5] architecture is widely used in speech enhancement tasks. It was originally used in image segmentation, which requires neural networks to perform semantic classification of images pixel by pixel, which shares similarities with speech separation and speech enhancement. Stoller et al. [
6] use the UNET to achieve speech separation, Macartney et al. [
7] use the UNET-based network to achieve speech enhancement on the audio time–domain signal and achieved state-of-the-art (SOTA) at that time. Hao et al. [
8] designed UNET-GAN to achieve low-SNR speech enhancement in the time domain. Luo et al. [
9] extend the UNET network to the time–frequency domain and perform speech enhancement in the complex audio spectrum. After then, DCCRN [
10] introduces the complex feature fusion module into the network to better realize speech time–frequency domain correlation modeling.
The speech signal can receive its corresponding time–frequency spectrum through the short-time Fourier transform (STFT) algorithm [
11]. Most speech enhancement algorithms are implemented based on the time–frequency spectrum, which differs from the image enhancement: the time–frequency spectrum is a sequence with causality on the time axis, but images are not time series. Therefore, most speech enhancement models will introduce a recurrent neural network (RNN) as a sequence model to extract the spectral features’ temporal correlation [
12]. As a Markov-style sequence model [
13], RNN focuses more on modeling the correlation between adjacent elements of the sequence and has strong position awareness, which can better model the correlation between elements and their positions. Those RNN-based models perform well when enhancing speech audios with high SNR. However, when processing low-SNR speech audios, their performance is limited because RNN-based models are challenging to learn long-distance dependencies between sequence elements. At the same time, the model has causal lines when processing sequences and sequence elements that need an input into the model bit by bit, which results in poor parallelizability and high calculation costs [
14].
A transformer [
14] is a sequence model based on the attention mechanism and can model long-distance dependencies between sequence elements while having good parallelism. The transformer has achieved SOTA in the NLP [
15] and CV [
16] fields. In this work, we use a transformer module to solve the problem faced by RNNs. However, in the speech enhancement task, the traditional transformer is difficult to deal with complex features. The full attention mechanism is difficult to track the adjacent elements of the sequence, and its computational complexity is also high. The sequence model based on the the attention mechanism has a weak ability to perceive elements and their corresponding positions, and the speech enhancement task requires the model to have a strong enough ability to perceive the positions of the elements. Limited by the above problems, traditional transformer models could be improved in low SNR speech enhancement tasks.
In this paper, we propose a complex sparse transformer (CST) model to improve the transformer in low-SNR speech enhancement tasks. Our contributions are as follows:
We propose a transformer-based speech enhancement model suitable for low-SNR scenarios.
We extend the transformer operator to the complex domain so that it can efficiently model the correlation between the elements of the real and imaginary parts of the complex sequence features.
We improved the transformer module using a sparse attention mask, which has better performance in speech enhancement tasks, while having a lower computational cost.
We design the pre-layer positional embedding to enhance the positioning awareness of the transformer model in speech enhancement tasks.
2. Related Work
The methods based on neural networks are the current mainstream solution in speech enhancement. The neural network speech enhancement algorithm can be divided into generative and discriminative schemes [
17]. The generative model realizes speech enhancement by directly predicting clean speech or using a generation paradigm. Typical methods, such as wave-UNET [
7], are the earliest use of the UNET network to generate clean speech from noisy input. SEGAN [
18] is a generative adversarial network (GAN) that aims to generate clean speech from noisy speech, and its generator part uses the time–domain UNET structure; the discriminator’s task is judging whether the voice audio generated by the generator is the clean speech corresponding to the noisy speech. Hao et al. [
8] design the UNETGAN for low-SNR scenarios; they use dilated convolution to fuse signal temporal features. When using the GAN method to generate clean speech directly, the discriminant indicators are usually tricky to consider the actual speech quality evaluation indicators, and MetricGAN [
19] is proposed for this problem so that the model can take into account one or more indicators in the speech quality evaluation metrics to improve the model’s performance from the evaluation metrics’ perspective. Unlike the above models, Soni et al. [
20] propose MMSE-GAN; they train a GAN to generate time–frequency spectral masks. Donahue et al. [
21] designed a speech enhancement algorithm on the log–Mel spectrum to focus on improving the quality of downstream speech recognition tasks and achieved a 7% word error rate (WER) performance gain by directly generating the log–Mel spectrum corresponding to clean speech.
The effect of the additive background noise on the speech signal can be regarded as the masking of the speech signal by the noise in the time–frequency spectrum. Thus, the speech enhancement model can predict the mask of the noise signal in the time–frequency spectrum or the amplitude spectrum to achieve speech enhancement; such models are called discriminative models. Typical masks include magnitude spectral masks, ideal binary mask (IBM) [
22], ideal ratio mask (IRM) [
23], and time–frequency spectral masks, complex ratio mask (CRM) [
24]. IBM regards the denoising problem as a binary classification problem, using 1 or 0 to mark whether the energy of clean speech dominates the spectral unit. Heymann et al. [
25] propose a deep neural network method to predict the IBM. Predicting the IBM is relatively simple; however, it is difficult to accurately describe the proportional energy relationship between clean speech and noise signals in a specific time–frequency unit in complex acoustic scenarios. In order to better distinguish noise in speech audios, Tu et al. [
26] propose IRM, which uses a scale to describe the energy ratio relationship between clean speech and noise signals in a spectral unit. Compared with the IBM-based model, the IRM-based model can better model the spectral features of clean speech and eliminate noise more naturally. Strake et al. [
27] proposed a two-stage speech enhancement scheme combined with IBM and IRM for low SNR scenarios. In the first stage, a UNET model is used to predict IBM to reduce noise roughly, and a partial convolution model is used in the second stage to repair the loss of speech signal. Both IBM and IRM are masks for the amplitude spectrum of the audio signal. The models that deal with the amplitude spectrum usually assume that the audio signal’s amplitude contains more information than the phase, so the noise reduction is achieved by limiting or eliminating the amplitude of some time–frequency units. To effectively utilize the information in the time–frequency spectrum, Williamson et al. [
24] proposes CRM, which describes the proportions of the real part and the imaginary step of the clean speech in the noise-containing frequency signal in the time spectrum unit. Compared to IBM and IRM, the speech enhancement performance through predicting the CRM is better when using the same model structure. To better predict the CRM, Luo et al. [
9] proposed DCUNET, which uses the Complex UNET structure to model complex features. Based on DCUNET, DCCRN [
10] designed a complex feature fusion module based on long short-term memory (LSTM), which improved the performance of [
9]. Among the discriminative models, predictive CRM has become the mainstream method in this field.
RNN-based models such as LSTM and the Gated Recurrent Unit (GRU) are widely used in UNET-based speech enhancement models [
28]. However, it is difficult for RNNs to model the long-distance dependence between sequence elements, which limits RNNs’ performance in several tasks, including speech enhancement. In 2017, Vaswani et al. [
14] proposed an attention-based transformer model, which can efficiently model long-distance dependencies and, therefore, achieved good results in many fields. Based on the transformer model, the Complex Transformer [
29] is defined and applied to the music generation task, which achieves good performance. However, achieving the expected performance improvement in practice is difficult when the transformer structure is directly applied to the speech enhancement task. Kim et al. [
30] combine the transformer with a Gaussian model to achieve better performance. Yu et al. [
31] add Loca LSTM modules before the multi-head attention to achieve the performance improvement of the transform. Currently, there is no speech enhancement algorithm based on a transformer designed for low-SNR scenarios.
3. Method
In this work, we assume that the most important factor affecting speech quality is the additive background noise, so the noisy speech can be expressed as:
Among them,
,
, and
represent the
i frame of the noisy speech, clean speech, and noise, respectively. We aim to estimate
from
y as close as possible to the origin
s.
In this work, we propose the CST network, which uses the complex convolution and deconvolution modules in the form of UNET as the codec to extract and restore the high-level features of the spectrum of speech; it uses the CST module to fuse and edit the complex audio features; the background noise removal is achieved by predicting the CRM. The basic structure of this model is shown in
Figure 1. The first part of this chapter will introduce the basic architecture of this model, and the second part will introduce our CST module.
3.1. Model Basic Structure
Our model acts on the complex time–frequency spectrum and uses complex convolution and deconvolution modules to extract and restore the time–frequency spectrum features of speech audios. The basic structure of the complex convolution module is shown in
Figure 2; the deconvolution module is similar to the convolution module. Those modules accept complex feature inputs and output complex features. The complex convolution operation can be expressed as:
In the formula, X and Y are complex features representing the model’s input and output, respectively. are real convolution and imaginary convolution, respectively. The complex convolution and deconvolution modules consist of a complex convolution (deconvolution) layer, a PRelu layer, and a Complex Batch Normalization layer. The codec based on the complex residual module is distributed in an inverted pyramid, and the skip-layer links from the basic structure of the UNET architecture network.
The model achieves speech enhancement by predicting the CRM, which removes noise by predicting and learning to scale the noise component on the complex time spectrum. This process can be expressed as follows:
In the formula,
is the complex ratio mask,
X is the model’s input, and
is the model’s prediction of the spectrum of the clean speech signal. After the inverse short-time Fourier transform, the corresponding time domain signal can be obtained. This model uses the Scaled Invariant SNR (SI-SNR) as the loss function, which can be expressed as:
where
represents the vector dot product,
represents the l2 norm,
represents the audio time domain signal predicted by the model, and
s represents the clean speech signal.
3.2. Complex Sparse Transformer
The CST module is designed to fuse and edit the complex feature of speech audios extracted by the complex UNET codec. To allow the transformer structure to process complex features, we regard the transformer as an overall operator and use complex multiplication rules to fuse the real and imaginary features of the complex field. The process can be expressed as follows:
Figure 3 shows the complex domain transformer structure. Each complex transformer module layer contains two transferormer sub-modules,
and
, representing the real number network and the imaginary number network, respectively. From past research, there is another way to expend the transformer for complex features, which uses the ternary complex number operation rules to calculate the attention matrix; our early experiments show our way is more effective.
Short-range correlations between speech signals are stronger than long-range, so it is necessary to introduce this prior knowledge into the model. The RNN-based models receive the elements in the sequence one by one; in this process, the information farther away from the current node will be more forgotten, and such a structure conforms to the characteristic of the speech signal. However, the complete attention transformer will not be affected by the distance between elements when generating the attention matrix, which can model the correlation between elements at any distance. However, at the same time, it will not be able to allocate enough attention between elements at short-distance correlations. To enhance the transformer module, we introduce a sparse attention mask [
32] to balance the model’s attention to the relationship between long-range and short-range dependencies. The attention computation process with a mask can be expressed as:
where
n is the length of the sequence
V, and the mask matrix
is a binary matrix with values of 0 and 1. When
, the
ith element of
V sequence will generate attention to the
jth element; otherwise, if
the
ith element of the
V sequence will not notice the
jth element.
Figure 4 shows the attention mask we use.
Our attention mask mainly consists of global, neighbor, and random attention. Global attention corresponds to a group of particular elements at the head of the sequence, they can be associated with elements in the whole sequence within the attention calculation, and elements in the whole sequence can also be associated with them. In order to avoid head elements in the sequence feature process undertaking different training tasks from other elements, the sequence elements corresponding to the global attention are trainable parameters in this model that will be concatenated into the head of the input features before the calculation of the attention mechanism, and will be discarded after the calculation. Neighbor attention means that each element can associate with the fixed multi-elements around it. Random attention is randomly distributed in the attention mask to accelerate the establishment of association relationships. Our sparse attention mask reduces the space complexity of the attention matrix from
to
, which greatly reduces the computational cost of the model during training and inference, and the experimental results are discussed in
Section 4.3.
The transformer model is poorer than RNN and Temporal Convolutional Network (TCN) in time sequence modeling tasks. Zeng et al. [
33] believe that the main reason is that the transformer has a weak ability to perceive the position and sequence of elements in the sequence, so there is a modeling bottleneck. In this paper, our technical route to achieve speech enhancement is to remove the noise contained in the audio by predicting the mask, and this task has high requirements on the model’s positioning awareness and modeling ability. Therefore, we introduce the form of pre-layer positional embedding to enhance the location awareness of the transformer model. To cooperate with the new structure of the layer-by-layer injection of positional embedding, we adjusted the transformer’s internal structure. We put the layer normalization module in front to avoid the deviation of feature statistics introduced by the additive positional. The module structure is shown in
Figure 5. Regarding selecting positional embedding schemes, we implement a scheme based on sine functions and a scheme based on trainable parameters, the differences between the two are discussed in
Section 4.3.
The convolution module extracts the audio features from the frequency dimension, and the energy of the speech signal is mainly concentrated in the low-frequency band. However, in the speech enhancement task, especially the low-SNR speech enhancement task, the low-frequency audio signals may be masked and confused by the noise. Therefore, we hope the model can fuse convolutional features from different channels of convolutional features and assign weights to achieve more effective feature preprocessing for different kinds of noisy audio. To achieve this goal, we use the Squeeze-and-Excitation [
34] module as a channel attention for feature enhancement in the channel dimension. The basic structure of the channel attention module is shown in
Figure 6.
4. Experiment
In this paper, we use the MUSAN [
35] clean speech and noise dataset, CSTR’s VCTK Corpus [
36] (Centre for Speech Technology Voice Cloning Toolkit) clean speech dataset, NoiseX92 [
37] noise dataset, and Babble noise data [
38] for experiments. The MUSAN dataset contains 60 h of speech and 48 h of music and noise data. We use all clean speech data and treat all 48 h of noise and music data as noise data. We divide the speech and noise data in the MUSAN dataset into training, verification, and test sets according to the ratio of 8:1:1 and ensure that the three do not intersect. In order to improve data diversity and avoid overfitting, we randomly sample a piece of voice data from the training set during each training and verification process, a piece of random noise with the same length as the noisy speech, and use a random SNR between [−15 dB and 15 dB] to dynamically synthesize the noisy speech; for the test set, we randomly select a piece of noise data for each clean speech in the test set and select a random SNR between [−15, −10, −5, and 0] dB to synthesize each noisy speech. VCTK is a large English speech dataset that includes more than 400 sentences by 100 speakers’ and is widely used in speech synthesis and speech enhancement tasks. The NoiseX92 dataset contains a set of stationary noises collected in real scenes, such as factory floors, engine rooms, inside vehicles, etc. The Babble contains an hour of noise data from MyNoise that simulates the noisy environment of a cafe and has more complex components. It mainly includes messy and irregular human voices and cafe-specific cutlery, knife and fork collision, etc. The noise components are more complex and similar to speech signals in energy distribution which causes it to be more likely to be confused with speech signals.
We set up three different test sets to evaluate the performance of the model. The MUSAN test set includes a variety of environmental noises and various types of music. We use this dataset to evaluate the noise reduction ability of the model for different types of single-type noise. The Babble test set is based on the noise data in the Babble dataset and the clean speech in MUSAN. Since the Babble noise is similar in energy distribution to speech signals, we use this dataset to evaluate the model’s anti-interference ability. The VCTK test set is synthesized from VCTK clean speech and NoiseX92 noise data. This test set is used to evaluate the ability of the model to remove stationary noise. All three test sets are synthesized with the same SNR distribution. Typical audio magnitude spectra in the two test sets are shown in
Figure 7.
We use Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) as objective metrics to evaluate the performance of all models. PESQ is recommended by the ITU Telecommunication Standardization Sector (ITU-T), which is a widely-used full-reference speech quality metric with values ranging from −0.5 to 4.5. A higher PESQ value indicates better speech quality. STOI, on the other hand, is used to measure speech intelligibility, with values ranging from 0 to 1. A higher STOI value indicates better intelligibility. To ensure compatibility with these metrics, we divide long audio samples into 10-second segments and discard segments containing less than 50% speech signals.
In the codec of our models, we use five layers of complex convolution and deconvolution modules separately. Each processing audio length is 8192 frames. In the short-time Fourier transform, we use a Hanning window with a length of 512 and a moving width of 128. In training, each batch contains 16 voice clips, the optimizer is adamW [
39], the learning rate is 0.0004, the beta1 is 0.9, and the beta2 is 0.999. The training platform uses a single RTX3090.
We designed three experiments in this work. In the first experiment, we compared our model with other well-established speech enhancement models. In the second experiment, we explored the effect of the number of layers on the model’s performance. In the last experiment, we investigated the effectiveness of the improvements we implemented to the model through ablation experiments.
4.1. Compare with Other Models
In this section, we compared our models with four typical speech enhancement models, DCCRN [
10], CRN [
40], SEGAN [
18], and SEFORMER [
31]. Regarding model structure, the basic structure of DCCRN, CRN, and SEGAN is the UNET structure, and SEFORMER is the transformer decoder structure. Specifically, DCCRN uses a complex UNET network to enhance speech by predicting the CRM, which is one of the SOTAs in the DNS challenge board list. CRN is a real number model that realizes speech enhancement by predicting the IRM, and this model is a widely recognized classic model in the field of speech enhancement. SEGAN is a GAN-based generative speech enhancement model that achieves speech enhancement by generating target audio from noisy audio. SEFORMER is a pure transformer structure speech enhancement model improved for speech enhancement tasks. This model uses the Local-LSTM structure to solve the problem of the lack of location awareness of the transformer model, thus endowing the transformer model with the ability to handle speech tasks. SEFORMER is a discriminative model for speech enhancement by predicting IRM. All the above models are implemented as described in the original text. In order to enhance the performance of the SERFORMER model, four layers of SEFORMER were used in our experiments for comparison.
The comparative experimental results in the Musan, Babble, and VCTK test sets are shown in
Table 1,
Table 2 and
Table 3, respectively. In the table, Noisy represents noisy speech, and DCCRN, CRN, SEGAN and SEFORMER represent three models; CST-8 and CST-16 are the models proposed in this paper, representing eight layers and sixteen layers, respectively.
The experimental results in three test sets show that our proposed model, CST-8 and CST-16, outperforms the other three models in terms of both PESQ and STOI average metrics. In the MUSAN test set, our model improved the PESQ by 29.6% and the STOI by 9.7% compared to the noisy speech and outperformed DCCRN, the best-performing model among the control group, by 4.1% and 1.8%, respectively. In the Babble test set, our model improved by 11.9% on PESQ and 4.7% on STOI compared to the noisy speech. Our model also outperformed DCCRN by 2.8% on PESQ and 1.1% on STOI. In the VCTK test set, our model has dramatically improved the speech quality compared with the original noise. Regarding speech clarity, our model has improved by 21.88% compared with the original speech on PESQ. In terms of the intelligibility, compared with noisy speech, it improves by 2.29% on STOI. Among the comparison models, DCCRN has the best overall performance. Compared with this model, our model has improved by 7.53% on PESQ, but it is comparable to STOI.
In the Babble test set −15 dB task, our model has a decline in clarity and intelligibility compared to the original noisy speech; SEFORMER has the best inference results in this task and surpasses our speech audio and noisy in the intelligibility metric. In the VCTK test set −15 dB task, although our speech has improved the most in terms of clarity compared with other models, it is not as good as SEFORMER in terms of intelligibility and has declined compared to the original Noisy. On the VCTK −10 dB task, our model performs slightly worse than SEFORMER in terms of speech intelligibility.
Overall, this model performs better than other models in SNR scenarios and is more suitable for scenarios with various types of noise. However, in the case of an extremely low signal-to-noise ratio, the adaptability of this model to the noise type is speech babble, or else a low-frequency stationary noise scene is relatively insufficient.
4.2. Effect of Layers
The number of layers in a deep learning model can significantly impact its performance. As the number of layers increases, the model’s ability to represent complex relationships increases, allowing it to fit more intricate mapping functions. However, the relationship between the number of layers and performance is not always linear. When the number of layers reaches a certain point, adding more layers may no longer provide performance gains and can even decrease performance. In this experiment, we explored the effect of the number of transformer model layers on the performance and provided recommendations for using this model.
We use a sparse transformer model to model audio features to establish dependencies between elements in the sequence; at least two layers are required to associate with every two elements. Therefore, we use a 4-layer transformer as a starting point and experiment with 4-, 8-, 12-, and 16-layer transformer models on MUSAN, Babble, and VCTK test sets. The results are shown in
Table 4,
Table 5 and
Table 6, respectively.
As shown in
Table 4, in the MUSAN test set, the four-layer model already has good basic performance. Compared with the original noisy speech, the speech quality has been greatly improved, the PESQ has been improved by 25.02% and 8.31% on STOI. Compared with the four-layer model, the eight-layer model also has an obvious improvement in speech quality, among which the improvement in PESQ is 3.59%, and the improvement in STOI is 1.20%. However, compared with the 8th layer, the 12th layer only has a small improvement in speech clarity, which is shown as an increase of 0.80% in the PESQ, and the 16th layer and the 12th layer have the same speech quality. Regarding speech clarity, the 12-layer model has the best performance, which is 30.54% higher than the noisy speech, and 4.41% higher than the 4-layer model. Regarding intelligibility, the 16-layer model has the best performance, which is 9.74% higher than the noisy speech, and 1.32% higher than the 4-layer model.
As shown in
Table 5, the four-layer model also has a strong basic performance in the Babble test set. It has improved by 8.99% in PESQ compared with noisy speech and 3.02% in STOI. Compared with the four-layer model, the speech quality of the eight-layer model has slightly improved: it has improved PESQ by 2.67% and STOI by 1.69%. However, the 12-layer and 16-layer models failed to improve the performance over the 8-layer model in this task. In this test set, the 8-layer model has the best performance. Compared with noisy speech, PESQ has increased by 11.91% and STOI has increased by 4.76%.
As shown in
Table 6, in the VCTk test set, the 4-, 8-, 12-, and 16-layer models have no significant difference in performance. In terms of speech intelligibility, the best model is the 12-layer model, which is relatively noisy speech. It has improved by 22.01% on PESQ, and the best intelligibility is the four-layer model, which has improved by 2.29% on STOI compared to noisy speech.
Overall, in the scenes with diverse noise types, more layers mean better performance, but increasing the number of layers beyond 12 layers no longer yields significant performance gain. For the scene where speech babble is the prominent background noise, the eight-layer model is almost at the performance ceiling. For stationary noise scenes, the four-layer model can produce better results, and the performance of increasing the number of layers does not improve significantly.
4.3. Ablation Experiments
We designed ablation experiments to explore the impact of sparse attention masks, pre-layer positional embedding, and channel attention on model performance. The previous experiment has proved that when the number of layers of the model is more than eight, increasing the number of layers can no longer significantly improve the model’s performance. Therefore, we only conduct research based on the eight-layer transformer in the ablation experiment. The results of the ablation experiments are shown in
Table 7,
Table 8 and
Table 9.
In the above tables, NO-TF indicates a complex UNET model without any feature fusion module in the middle, TF indicates the full-attention transformer model, +MASK indicates the transformer model with a sparse attention mask, +SIN and +TRAIN indicate the addition of a sine function for pre-layer positional embedding and trainable parameters as positional embedding, and +CA indicates the introduction of channel attention.
As shown in
Table 7, in the MUSAN test set, compared with NO-TF, after adding a Complex Transformer as a feature fusion module, the speech quality has a significant improvement in clarity and intelligibility, among which PESQ has increased by 4.22%; the STOI increased by 2.68%. Compared with TF, after adding the sparse attention mask, the speech intelligibility decreases slightly, and the intelligibility improves slightly; the introduction of the pre-layer positional embedding and channel attention improves both PESQ and STOI indicators. In this test set, the best combination in terms of speech intelligibility is TF + MASK + SIN + CA, and its PESQ has increased by 4.47% compared with NO-TF; compared with TF, it has increased by 1.24%; in terms of speech intelligibility, the best combination is TF + MASK + TRAIN + CA, its STOI increased by 3.83% compared with NO-TF, and increased by 1.12% compared with TF.
As shown in
Table 8, in the Babble test set, compared with NO-TF, the speech quality has been significantly improved after the introduction of the feature fusion module, of which the PESQ has increased by 4.22%; the STOI has increased by 2.68%. After adding the sparse attention mask, the speech quality is almost unchanged. The introduction of pre-layer positional embedding and channel attention has slightly improved the speech quality. In this test set, the best combination is TF + MASK + TRAIN + CA; compared with NO-TF, it has improved by 4.47% on PESQ, and on STOI increased by 3.83%. Compared with TF, it has increased by 1.24% on PESQ and 1.12% on STOI.
As shown in
Table 9, in the VCTK test set, compared with NO-TF, the introduction of the feature fusion module has improved speech clarity, and the speech intelligibility is comparable. After adding the attention mask, both the clarity and intelligibility are improved. Compared with NO-TF, the PESQ is improved by 2.94%, and the STOI is improved by 0.42%. The introduction of layer-wise positional encoding and channel attention in this test set has no significant performance improvement. In this task, the best combination is TF-MASK, which improves the PESQ by 1.41% and the STOI by 0.67% compared to TF.
In general, after introducing CST as a feature fusion module, the model’s performance has been improved in the three test sets, and the performance improvement in the MUSAN and Babble test sets corresponding to non-stationary noise is relatively more significant. In different CST implementations, compared with the full attention model, the model using sparse attention has improved speech intelligibility in non-stationary noise scenes and improved speech quality in stationary noise scenes. We believe this proves our hypothesis in
Section 3.2: short-range correlations of speech signals are stronger than long-range correlations. On the other hand, as shown in
Table 10, the introduction of the sparse attention matrix significantly reduces the computational cost of the model. The complexity of the attention matrix is reduced from
to
, and IT reduced THE GPU memory usage during training by 50%. The introduction of pre-layer positional embedding and channel attention strengthens the model’s ability to model non-stationary noise, but there is no significant improvement in stationary noise scenarios; in terms of pre-layer positional embedding scheme selection, the trainable parameter scheme has a relatively better Effect.
5. Conclusions
We designed the CST model for speech enhancement tasks in low-SNR scenarios in this work. Compared with previous RNN-based network models and transformer-based models, our model can better model long-distance dependencies between audio features. Compared with the original transformer structure, the attention distribution in our model is more suitable for processing speech features and has a lower computational cost; the pre-layer positional embedding and the channel attention modules strengthen the transformer’s speech signal processing capabilities. The experimental results show that our model can better remove the background noise contained in the audio in the low-SNR scene and has a remarkable improvement in the STOI and PESQ metrics.
At present, our model uses a multi-layer transformer structure. Although the sparse attention mechanism reduces the computational cost compared with the original transformer, the computational cost is still too high for devices with limited computing power, such as edge terminals. In future work, we will reduce the calculation amount of the model and compress the model size using distillation, quantization, and pruning to achieve efficient reasoning on devices with limited computing power.