Audio Generation with Multiple Conditional Diffusion Model

Authors

  • Zhifang Guo Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Jianguo Mao Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Rui Tao Toshiba China R&D Center, Beijing, China University of Chinese Academy of Sciences, Beijing, China
  • Long Yan Toshiba China R&D Center, Beijing, China
  • Kazushige Ouchi Toshiba China R&D Center, Beijing, China
  • Hong Liu Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
  • Xiangdong Wang Bejing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v38i16.29773

Keywords:

NLP: Language Grounding & Multi-modal NLP, NLP: Generation

Abstract

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation.

Published

2024-03-24

How to Cite

Guo, Z., Mao, J., Tao, R., Yan, L., Ouchi, K., Liu, H., & Wang, X. (2024). Audio Generation with Multiple Conditional Diffusion Model. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18153-18161. https://doi.org/10.1609/aaai.v38i16.29773

Issue

Section

AAAI Technical Track on Natural Language Processing I