Our proposed forced aligner pipeline, illustrated in
Figure 1, comprises three essential components: (1) a phoneme recogniser based on
wav2vec 2.0; (2) a preliminary
unsupervised phoneme segmenter based on
UnsupSeg; and (3) a novel forced aligner. The first two components were employed to process audio inputs. The third component introduced a groundbreaking forced alignment method—a crucial part of our proposed pipeline. Details about each component are found in the following sub-sections.
2.1.1. Phoneme Recognition
Wav2vec 2.0 is a self-supervised end-to-end model comprised of convolutional and transformer layers. The model encodes raw audio inputs
into latent speech representations
, …,
for
T time-steps through a multi-layer convolutional feature encoder
. The speech representations are then fed to a transformer-masked network
to build contextualised representations
, …,
. Meanwhile, the latent speech representation output is discretised to
, …,
via a quantisation module
. The quantised representations represent the targets in the self-supervised learning objective [
31]. The feature encoder is composed of seven convolutional blocks with 512 channels, strides of (5, 2, 2, 2, 2, 2, 2) and kernel widths of (10, 3, 3, 3, 3, 2, 2). The network contains 24 blocks, 1024 dimensions, 4096 inner dimensions, and 16 attention heads. The complete architecture of this model is shown in Figure 1 of the original paper [
31].
We fine-tuned a pre-trained
wav2vec 2.0 acoustic model based on the wav2vec2-xlsr-1b model, which is available in Hugging Face
wav2vec 2.0 implementation. The initial step is pre-processing datasets. In the Hugging Face platform, the
datasets library [
41] is employed to efficiently load and pre-process our datasets. This library leverages a mapping function that enables batch loading and multi-threading, resulting in a significant reduction in dataset processing time. Additionally, this library conveniently includes various public datasets, such as TIMIT, with ready-to-use scripts provided for easy access. However, the TORGO dataset is not part of the library’s offerings. As a result, a similar script was developed to efficiently load the TORGO dataset. In the script, each audio sample in both the TORGO-TD and TORGO-SSD groups was treated as an individual instance. The number of instances in those groups has been tabulated in
Table 3. Each instance is associated with several attributes, as detailed in the subsequent list. Attributes like File, Text, and Phonetic Detail are deemed essential, while others are considered optional. During the pre-processing phase, we have excluded these optional attributes to streamline our data-handling process.
File: Path to the corresponding audio file.
Text: The corresponding transcription for the audio file.
Phonetic Detail: The corresponding phonetic transcription for the audio file representing as <BEGIN_SAMPLE><END_SAMPLE><PHONEME>. BEGIN_SAMPLE is the beginning integer sample number for the segment and END_SAMPLE is the ending integer sample number for the segment. PHONEME is a term used in phonetics to represent a single unit of phonetic transcriptions, typically using the ARPABET phonetic symbols.
Word Detail: The word-level transcription for the audio file representing as <BEGIN_SAMPLE><END_SAMPLE><WORD>. BEGIN_SAMPLE is the beginning integer sample number for the segment and END_SAMPLE is the ending integer sample number for the segment. WORD is a single word from the orthography.
Pre-processing data for fine-tuning wav2vec 2.0 includes creating a tokeniser, feature extractor, processor, and data collator. In this study, the tokeniser was a dictionary mapping phonemes into numerical representations. The 45 unique ARPABET phonemes in the TORGO dataset were collected in a vocabulary list and then converted into an enumerated dictionary. Because there were some limitations of the current version of Hugging Face, some multi-character ARPABET phonemes, such as “aa”, “ay”, and “zh”, cannot be represented in the dictionary. Therefore, we encoded phonemes to Unicode emojis starting from U+1F600. A Hugging Face wav2vec 2.0 tokeniser was created from the Unicode to a numeric dictionary. To extract sequential features from input speech, a feature extractor was declared with: feature size = 1, sampling rate = 16 kHz, padding value = 0, and normalise = False. The processor combined the tokeniser and the feature extractor to pre-process our datasets. Additionally, a data collator was created to collate a batch of data into a format suitable for model training. Due to the input length of wav2vec 2.0 model being significantly longer than the output length, we dynamically padded the training batches to the longest sample in their batch instead of the overall longest sample. It is beneficial for improving the fine-tuning efficiency.
Finally, we fine-tuned a large-scale pre-trained model named
wav2vec2-xls-r-1b on a disordered speech dataset. Compared with our preliminary work [
40], we utilised a large-scale pre-trained model named
wav2vec2-xls-r-1b. It was pre-trained on 436k hours of unlabelled speech sampling at 16 kHz in 128 languages. During the pre-training process, the model learned latent representations of many languages. However, all of these representations was not useful until further training the model on a “down-stream” task. Therefore, we fine-tuned the learned representations on labelled data and added a randomly initialised output layer on top of the Transformer to predict phonemes. During the fine-tuning process, the model has been optimised by minimising a CTC loss [
21]. The loss was obtained from the PER metric by comparing the difference between predictions generated by the fine-tuned model and the ground truth provided by the TORGO dataset. PER is the metric derived from Levenshtein distance which is a string metric for measuring the difference including substitution, insertion, deletion, and correction between two sequences. We used an epochs of 50, batch size of 8, and learning rate of
, which was warmed up for the first 10% of the training.
2.1.2. Unsupervised Phoneme Segmentation
UnsupSeg The unsupervised segmentation model named
UnsupSeg has been utilised to identify phoneme boundaries in raw waveform data [
39].
UnsupSeg is a convolutional neural network that directly operates on the raw waveform of the speech signal. A feature extractor transforms the input waveform into a sequence of latent vectors via
. The network
f learns to identify spectral changes in the signal using the Noise-Contrastive Estimation principle [
42], which is a technique for learning representations by contrasting positive and negative examples. The feature encoder is comprised of five blocks of 1D strided convolution, followed by Batch-Normalisation and a Leaky ReLU [
43] nonlinear activation function. The network
f has kernel sizes of (10, 8, 4, 4, 4), strides of (5, 4, 2, 2, 2) and 256 channels per layer. The complete architecture of this model is depicted in Figure 1 of the original paper [
39].
The model is trained in a self-supervised manner, meaning that it does not require any human annotations in the form of target boundaries or phonetic transcriptions. We trained the model on TIMIT with the following parameters: learning rate = , epochs = 200, batch size = 8. For TORGO dataset, we explored different hyper-parameters. The UnsupSeg model achieved the best performance, r-val is equal to 0.65, using the same settings as in the TIMIT dataset. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries.
Voice Activity Detection Voice activity detection (VAD) was incorporated to eliminate extraneous segments during silent periods. This public implementation produced a frame-based voice activity probability sequence, which was represented by 0 (non-speech) or 1 (speech). To incorporate this output with the
UnsupSeg model, we developed Algorithm A1 (in
Appendix A) to convert the probability sequence into rising and failing edge pairs. Any segments found within the region of non-speech were subsequently deleted. The entire process has been visually represented in
Figure 2. To accurately and effectively detect voice activities in disordered speech, we conducted experiments using different parameter values. Taking the trade-off between training efficiency and accuracy into account, our implementation achieved a substantial performance with the following parameters: number of FFT points
, window length = 0.025, hope length = 0.01, threshold = 0.5.
2.1.3. Forced Alignment
The forced alignment algorithm was developed to combine the outputs of
wav2vec 2.0 and
UnsupSeg models. As shown in
Figure 3, (a) is the recognised tokens and weak positional information provided by
wav2vec 2.0 and (c) is the unlabelled segments produced by
UnsupSeg. We utilised the recognised phoneme within each segment to annotate that segment. For instance, when a segment spans from time
to
and contains the label
L, we assign the label
L to the entire segment. However, some segments may not have phonemes or may have several conflicting phonemes. In
Figure 3, we depicted the segments with conflicting phonemes, highlighting the challenge we faced. This issue was successfully addressed by our novel algorithm, which, instead of directly assigning the recognised phoneme to that segment, utilised the nearest neighbor approach to determine the class region (boundaries) for each phoneme. The class region for each phoneme is described by
calculated by using the midpoint of two successive phonemes
Here,
denotes region
i belonging to class
,
x denotes the item to be classified,
denotes the time of label impulse
i, and
denotes the time of any segment that is not
or
.
However, the phonemes recognised by
wav2vec 2.0 might be located closer to either the start or end of the true segment. To address this, we introduced a bias factor to calculate the class region boundaries as shown in Equation (
3). The bias factor
allows us to adjust the boundary position, bringing it closer (for
) or moving it farther (for
) from the uppermost segment:
As can be seen, the mid-point boundary is a special case of the biased boundary when
. The bias allows us to adjust the boundary more specific to the data.
After obtaining the class regions, as shown in
Figure 4b, we conducted a comparison of the overlapping sections between each class region and the corresponding segment. The phoneme’s class region with the greatest overlap was selected as the label for that segment, as illustrated in
Figure 5. Within the segment spanning from 0.54 s to 0.63 s, the class regions of three phonemes, including “t”, “r”, “ey”, overlap with it. Upon calculating the overlap sections, the phoneme “r” is the dominant and, therefore, determined as the final label for this segment.
To further increase the accuracy of the predictions, we applied post-processing methods to remove overfitted phonemes through the following steps:
Get the word spoken with CTC collapse.
Calculate transitions based on every two letters, i.e., cat = (c-a, a-t).
Scan through the labelled segments and amalgamate every two labels that are the same but are not a permissible transition.
Cleaning helps merge several successive duplicate segments that result from over-fitting. It preserved successive duplicate segments in places where this is expected behaviour. Words with expected behaviour are ones that have two similar successive sounds in its true pronunciation, such as a word “ca-ck-ck-al” (cackal).
We implemented the above strategy in two different ways. The first method,
soft cleaning (see Algorithm 1), was implemented such that
tt can be considered as a local clean. It scanned a sequence of segments, and when it found each transition, it moved onto the next transition. It also moved onto the next transition when it found a duplicate. The limitation of this was that only the first duplicate segment pair would be amalgamated. The benefit was that it would amalgamate segments even when there was a permissible transition elsewhere in the sequence, so duplicates would only be amalgamated where
wav2vec 2.0 specified that they could be.
Algorithm 1 Soft Clean |
- 1:
procedure Soft Clean(segList, wavPath) - 2:
segList is a list of segments which have a start, stop and phone label - 3:
transitionsList is a new List - 4:
segList ← copy segList by value - 5:
Tokens ← predict wavPath with W2V2 + CTC Collapse - 6:
for ii in range of 0 to (length of Tokens − 1) do - 7:
Append tuple (Tokens[ii], tokens[ii+1]) to transitionsList - 8:
end for - 9:
index ← 0 - 10:
for jj in range of 0 to (length of transitionsList − 1) do - 11:
Found ← false - 12:
LimitReached ← false - 13:
while Found is False and LimitReached is False do - 14:
if Length of segList ≤ index then - 15:
LimitReached ← true - 16:
else - 17:
SegFrom ← segList[index] - 18:
SegTo ← segList[index + 1] - 19:
if segFrom[label] == segTo[label] and segFrom[label] == transitionsList[index][label] then - 20:
segList[index] ← Tuple (segList[index][label], segList[index][start], segList[index+1][stop]) - 21:
else if segFrom[label] == transitionsList[jj][label] and segTo[label] == transitionsList[jj][1] then - 22:
found ← True - 23:
index ← index + 1 - 24:
else - 25:
index ← index + 1 - 26:
Break - 27:
end if - 28:
end if - 29:
end while - 30:
end for - 31:
return segList - 32:
end procedure
|
The second method,
hard cleaning (see Algorithm 2, did not take into account where the transition happened in the sequence. If
wav2vec 2.0 specified that a duplicate transition (i.e., “ah” → “ah”) was allowed to occur at the end of the sequence, but the cleaning segment found one at the start of the sequence, it would amalgamate it automatically.
Algorithm 2 Hard Clean |
- 1:
procedure hard Clean(segList, wavPath) - 2:
segList is a list of segments which have a start, stop and phone label - 3:
transitionsList is a new List - 4:
segList ← copy segList by value - 5:
Tokens ← predict wavPath with W2V2 + CTC Collapse - 6:
for ii in range of 0 to (length of Tokens − 1) do - 7:
Append tuple (Tokens[ii], tokens[ii+1]) to transitionsList - 8:
end for - 9:
ceiling ← length of segList − 2 - 10:
jj ← 0 - 11:
finished ← false - 12:
while Finished is False do - 13:
if jj ≤ ceiling then - 14:
if segList[jj][label] equal to segList[jj+1][label] then - 15:
if tuple (segList[jj][label], segList[jj+1][label]) not in transitionsList then - 16:
newSeg ← tuple (segList[jj][0], segList[jj][1], segList[jj + 1][2]) - 17:
segList[jj] ← newSeg - 18:
remove segList[jj+1] from segList - 19:
ceiling ← ceiling − 1 - 20:
j ← j − 1 - 21:
end if - 22:
end if - 23:
end if - 24:
end while - 25:
return segList - 26:
end procedure
|