skip to main content
short-paper

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Published: 10 May 2024 Publication History

Abstract

Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this article focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding. To address these issues, this article proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding. Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page.

References

[1]
Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. Neural voice cloning with a few samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems.
[2]
Yanyao Bian, Changbin Chen, Yongguo Kang, and Zhenglin Pan. 2019. Multi-reference Tacotron by intercross training for style disentangling, transfer and control in speech synthesis. arXiv:1904.02373. Retrieved from https://arxiv.org/abs/1904.02373
[3]
Weicheng Cai, Jinkun Chen, and Ming Li. 2018. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Proceedings of the Speaker and Language Recognition Workshop (Odyssey’18), 74–81. Retrieved from https://www.isca-archive.org/odyssey_2018/cai18_odyssey.html
[4]
Zexin Cai, Chuxiong Zhang, and Ming Li. 2020. From speaker verification to multispeaker speech synthesis, deep transfer with feedback constraint. In Proceedings of Interspeech, 3974–3978. Retrieved from https://www.isca-archive.org/interspeech_2020/cai20c_interspeech.html
[5]
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, Sandra Maria Aluisio, and Moacir Antonelli Ponti. 2021. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. In Proceedings of Interspeech, 3645–3949. Retrieved from https://www.isca-archive.org/interspeech_2021/casanova21b_interspeech.pdf
[6]
Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. 2020. HiFiSinger: Towards high-fidelity neural singing voice synthesis. arXiv:2009.01776. Retrieved from https://arxiv.org/abs/2009.01776
[7]
Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. Club: A contrastive log-ratio upper bound of mutual information. In Proceedings of the International Conference on Machine Learning. PMLR, 1779–1788.
[8]
Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, and Hung-yi Lee. 2021. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8588–8592.
[9]
Seungwoo Choi, Seungju Han, Dongyoung Kim, and Sungjoo Ha. 2020. Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. In Proceedings of Interspeech, 2007–2011. https://www.isca-archive.org/interspeech_2020/choi20c_interspeech.html
[10]
Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. 2019. One-shot voice conversion by separating speaker and content representations with instance normalization. In Proceedings of Interspeech, 664–668. https://www.isca-archive.org/interspeech_2019/chou19_interspeech.html
[11]
Erica Cooper, Cheng-I. Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi. 2020. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6184–6188.
[12]
Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, and Jianwu Dang. 2022. Using multiple reference audios and style embedding constraints for speech synthesis. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7912–7916.
[13]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv:2005.08100. Retrieved from https://arxiv.org/abs/2005.08100
[14]
Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, and Hung-yi Lee. 2022. Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 1558–1571.
[15]
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. InProceedings of the 32nd International Conference on Neural Information Processing Systems.
[16]
Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. 2020. Libri-light: A benchmark for ASR with limited or no supervision. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7669–7673.
[17]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, Conference Track Proceedings.
[18]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. InProceedings of the 34th International Conference on Neural Information Processing Systems. 17022–17033.
[19]
Da-Rong Liu, Chi-Yu Yang, Szu-Lin Wu, and Hung-Yi Lee. 2018. Improving unsupervised style transfer in end-to-end speech synthesis with end-to-end speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 640–647.
[20]
Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, and Sheng Zhao. 2021. DelightfulTTS: The microsoft speech synthesis system for blizzard challenge 2021. arXiv:2110.12612. Retrieved from https://arxiv.org/abs/2110.12612
[21]
Shuang Ma, Daniel Mcduff, and Yale Song. 2018. Neural TTS stylization with adversarial and collaborative games. In Proceedings of the International Conference on Learning Representations.
[22]
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Proceedings of the Interspeech. Vol. 2017, 498–502.
[23]
Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang. 2021. Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In Proceedings of the International Conference on Machine Learning. PMLR, 7748–7759.
[24]
Henry B. Moss, Vatsal Aggarwal, Nishant Prateek, Javier González, and Roberto Barra-Chicote. 2020. BOFFIN TTS: Few-shot speaker adaptation by Bayesian optimization. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7639–7643.
[25]
Eliya Nachmani and Lior Wolf. 2019. Unsupervised polyglot text-to-speech. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7055–7059.
[26]
Guo-Jun Qi. 2020. Loss-sensitive generative adversarial networks on Lipschitz densities. International Journal of Computer Vision 128, 5 (2020), 1118–1140.
[27]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of the 1st International Conference on Learning Representations (ICLR’21). https://openreview.net/forum?id=piLPYqxtWuA
[28]
Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2022. Revisiting over-smoothness in text to speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8197–8213. Retrieved from https://aclanthology.org/2022.acl-long.564
[29]
Morgane Riviere, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsupervised pretraining transfers well across languages. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7414–7418.
[30]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018a. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.
[31]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018b. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.
[32]
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. 2021. AISHELL-3: A multi-speaker Mandarin TTS corpus and the baselines. In Proceedings of Interspeech. Retrieved from https://sites.duke.edu/dkusmiip/files/2022/11/shi21c_interspeech.pdf
[33]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5329–5333.
[34]
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022. Retrieved from https://arxiv.org/abs/1607.08022
[35]
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. InProceedings of the 31st International Conference on Neural Information Processing Systems.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.
[37]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879–4883.
[38]
Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. 2021. VQMIVC: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In Proceedings of Interspeech, 1344–1348. https://www.isca-archive.org/interspeech_2021/wang21n_interspeech.pdf
[39]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech, 4006–4010. https://www.isca-archive.org/interspeech_2017/wang17n_interspeech.html
[40]
Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J.-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018a. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 5180–5189.
[41]
Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J.-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A. Saurous. 2018b. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 5180–5189.
[42]
Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. 2022. AdaSpeech 4: Adaptive text to speech in zero-shot scenarios. In The Proceedings of Interspeech, 2568–2572. https://www.isca-archive.org/interspeech_2022/wu22f_interspeech.html
[43]
Detai Xin, Tatsuya Komatsu, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2021. Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6608–6612.
[44]
Liumeng Xue, Shifeng Pan, Lei He, Lei Xie, and Frank K. Soong. 2021. Cycle consistent network for end-to-end style transfer TTS training. Neural Networks 140 (2021), 223–236.
[45]
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. In Proceedings of Interspeech, 1526–1530. https://www.isca-archive.org/interspeech_2019/zen19_interspeech.html

Index Terms

  1. MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 5
    May 2024
    297 pages
    EISSN:2375-4702
    DOI:10.1145/3613584
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 May 2024
    Online AM: 30 March 2024
    Accepted: 17 December 2023
    Revised: 27 July 2023
    Received: 17 July 2022
    Published in TALLIP Volume 23, Issue 5

    Check for updates

    Author Tags

    1. Multiple reference audios
    2. Zero-shot
    3. Text-to-speech
    4. Mutual information

    Qualifiers

    • Short-paper

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 278
      Total Downloads
    • Downloads (Last 12 months)278
    • Downloads (Last 6 weeks)38
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media