Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Yang, Dong; Koriyama, Tomoki; Saito, Yuki; Saeki, Takaaki; Xin, Detai; Saruwatari, Hiroshi

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2302.13652 (eess)

[Submitted on 27 Feb 2023]

Title:Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Authors:Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

View PDF

Abstract:Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-speaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embedding to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multi-speaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

Comments:	Accepted by ICASSP2023
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2302.13652 [eess.AS]
	(or arXiv:2302.13652v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2302.13652

Submission history

From: Dong Yang [view email]
[v1] Mon, 27 Feb 2023 10:40:41 UTC (123 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators