short-paper

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Authors:

Buzhou TangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 5

Article No.: 75, Pages 1 - 14

https://doi.org/10.1145/3649501

Published: 10 May 2024 Publication History

Abstract

Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this article focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding. To address these issues, this article proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding. Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page.

References

[1]

Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. Neural voice cloning with a few samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.

[2]

Yanyao Bian, Changbin Chen, Yongguo Kang, and Zhenglin Pan. 2019. Multi-reference Tacotron by intercross training for style disentangling, transfer and control in speech synthesis. arXiv:1904.02373. Retrieved from https://arxiv.org/abs/1904.02373

[3]

Weicheng Cai, Jinkun Chen, and Ming Li. 2018. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey’18), 74–81. Retrieved from https://www.isca-archive.org/odyssey_2018/cai18_odyssey.html

[4]

Zexin Cai, Chuxiong Zhang, and Ming Li. 2020. From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint. In Proceedings of Interspeech, 3974–3978. Retrieved from https://www.isca-archive.org/interspeech_2020/cai20c_interspeech.html

[5]

Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, Sandra Maria Aluisio, and Moacir Antonelli Ponti. 2021. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. In Proceedings of Interspeech, 3645–3949. Retrieved from https://www.isca-archive.org/interspeech_2021/casanova21b_interspeech.pdf

[6]

Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. HiFiSinger: Towards high-fidelity neural singing voice synthesis. arXiv:2009.01776. Retrieved from https://arxiv.org/abs/2009.01776

[7]

Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In Proceedings of the International Conference on Machine Learning. PMLR, 1779–1788.

[8]

Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, and Hung-yi Lee. 2021. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8588–8592.

[9]

Seungwoo Choi, Seungju Han, Dongyoung Kim, and Sungjoo Ha. 2020. Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. In Proceedings of Interspeech, 2007–2011. https://www.isca-archive.org/interspeech_2020/choi20c_interspeech.html

[10]

Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. 2019. One-shot voice conversion by separating speaker and content representations with instance normalization. In Proceedings of Interspeech, 664–668. https://www.isca-archive.org/interspeech_2019/chou19_interspeech.html

[11]

Erica Cooper, Cheng-I. Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi. 2020. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6184–6188.

[12]

Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, and Jianwu Dang. 2022. Using multiple reference audios and style embedding constraints for speech synthesis. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7912–7916.

[13]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv:2005.08100. Retrieved from https://arxiv.org/abs/2005.08100

[14]

Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, and Hung-yi Lee. 2022. Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 1558–1571.

Digital Library

[15]

Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. InProceedings of the 32nd International Conference on Neural Information Processing Systems.

[16]

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. 2020. Libri-light: A benchmark for ASR with limited or no supervision. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7669–7673.

[17]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, Conference Track Proceedings.

[18]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. InProceedings of the 34th International Conference on Neural Information Processing Systems. 17022–17033.

[19]

Da-Rong Liu, Chi-Yu Yang, Szu-Lin Wu, and Hung-Yi Lee. 2018. Improving unsupervised style transfer in end-to-end speech synthesis with end-to-end speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 640–647.

[20]

Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, and Sheng Zhao. 2021. DelightfulTTS: The microsoft speech synthesis system for blizzard challenge 2021. arXiv:2110.12612. Retrieved from https://arxiv.org/abs/2110.12612

[21]

Shuang Ma, Daniel Mcduff, and Yale Song. 2018. Neural TTS stylization with adversarial and collaborative games. In Proceedings of the International Conference on Learning Representations.

[22]

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the Interspeech. Vol. 2017, 498–502.

[23]

Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. 2021. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proceedings of the International Conference on Machine Learning. PMLR, 7748–7759.

[24]

Henry B. Moss, Vatsal Aggarwal, Nishant Prateek, Javier González, and Roberto Barra-Chicote. 2020. BOFFIN TTS: Few-shot speaker adaptation by Bayesian optimization. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7639–7643.

[25]

Eliya Nachmani and Lior Wolf. 2019. Unsupervised polyglot text-to-speech. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7055–7059.

[26]

Guo-Jun Qi. 2020. Loss-sensitive generative adversarial networks on Lipschitz densities. International Journal of Computer Vision 128, 5 (2020), 1118–1140.

Digital Library

[27]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of the 1st International Conference on Learning Representations (ICLR’21). https://openreview.net/forum?id=piLPYqxtWuA

[28]

Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2022. Revisiting over-smoothness in text to speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8197–8213. Retrieved from https://aclanthology.org/2022.acl-long.564

[29]

Morgane Riviere, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsupervised pretraining transfers well across languages. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7414–7418.

[30]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018a. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.

Digital Library

[31]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018b. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.

Digital Library

[32]

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. 2021. AISHELL-3: A multi-speaker Mandarin TTS corpus and the baselines. In Proceedings of Interspeech. Retrieved from https://sites.duke.edu/dkusmiip/files/2022/11/shi21c_interspeech.pdf

[33]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5329–5333.

Digital Library

[34]

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022. Retrieved from https://arxiv.org/abs/1607.08022

[35]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. InProceedings of the 31st International Conference on Neural Information Processing Systems.

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.

[37]

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879–4883.

Digital Library

[38]

Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. VQMIVC: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In Proceedings of Interspeech, 1344–1348. https://www.isca-archive.org/interspeech_2021/wang21n_interspeech.pdf

[39]

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech, 4006–4010. https://www.isca-archive.org/interspeech_2017/wang17n_interspeech.html

[40]

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J.-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018a. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 5180–5189.

[41]

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J.-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018b. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 5180–5189.

[42]

Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. 2022. AdaSpeech 4: Adaptive text to speech in zero-shot scenarios. In The Proceedings of Interspeech, 2568–2572. https://www.isca-archive.org/interspeech_2022/wu22f_interspeech.html

[43]

Detai Xin, Tatsuya Komatsu, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2021. Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6608–6612.

[44]

Liumeng Xue, Shifeng Pan, Lei He, Lei Xie, and Frank K. Soong. 2021. Cycle consistent network for end-to-end style transfer TTS training. Neural Networks 140 (2021), 223–236.

[45]

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. In Proceedings of Interspeech, 1526–1530. https://www.isca-archive.org/interspeech_2019/zen19_interspeech.html

Index Terms

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning
Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still ...
Voice Cloning for Voice Disorders: Impact of Phonetic Content
Text, Speech, and Dialogue
Abstract
Organic dysphonia can lead to vocal impairments. Recording patients’ impaired voice could allow them to use voice cloning systems. Voice cloning, being the process of producing speech matching a target speaker voice, given textual input and an ...
Analysis and modeling of F0 contours for cantonese text-to-speech

For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 5

May 2024

297 pages

EISSN:2375-4702

DOI:10.1145/3613584

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2024

Online AM: 30 March 2024

Accepted: 17 December 2023

Revised: 27 July 2023

Received: 17 July 2022

Published in TALLIP Volume 23, Issue 5

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
278
Total Downloads

Downloads (Last 12 months)278
Downloads (Last 6 weeks)38

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents