Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Aggarwal, Vatsal; Cotescu, Marius; Prateek, Nishant; Lorenzo-Trueba, Jaime; Barra-Chicote, Roberto

Computer Science > Machine Learning

arXiv:1911.12760 (cs)

[Submitted on 28 Nov 2019 (v1), last revised 17 Feb 2020 (this version, v2)]

Title:Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Authors:Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, Jaime Lorenzo-Trueba, Roberto Barra-Chicote

View PDF

Abstract:We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

Comments:	Accepted to ICASSP 2020
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Cite as:	arXiv:1911.12760 [cs.LG]
	(or arXiv:1911.12760v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1911.12760

Submission history

From: Vatsal Aggarwal [view email]
[v1] Thu, 28 Nov 2019 15:57:14 UTC (988 KB)
[v2] Mon, 17 Feb 2020 13:56:04 UTC (988 KB)

Computer Science > Machine Learning

Title:Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators