A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
Abstract
:1. Introduction
2. Materials
2.1. Convolutional Neural Network
2.1.1. U-Net
2.1.2. Deep Residual Network-18 (ResNet-18)
2.1.3. ShuffleNet-V2
2.1.4. Temporal Convolutional Network (TCN)
2.2. Training Target
3. Methods
3.1. Audio Separation Module
3.2. Visual Module
3.2.1. Face Module
3.2.2. Lip Module
3.3. Training Loss
4. Experiments
4.1. Dataset
4.2. Setup
4.3. Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cherry, E.C. Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 1953, 25, 975–979. [Google Scholar] [CrossRef]
- Wang, D.; Brown, G.J. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Trans. Neural Netw. 2008, 19, 199. [Google Scholar]
- Brown, G.J.; Cooke, M. Computational auditory scene analysis. Comput. Speech Lang. 1994, 8, 297–336. [Google Scholar] [CrossRef]
- Smaragdis, P. Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1–12. [Google Scholar] [CrossRef]
- Hershey, J.R.; Rennie, S.J.; Olsen, P.A.; Kristjansson, T.T. Superhuman multi-talker speech recognition: A graphical modeling approach. Comput. Speech Lang. 2010, 24, 45–66. [Google Scholar] [CrossRef]
- Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of the 41th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 31–35. [Google Scholar]
- Chen, Z.; Luo, Y.; Mesgarani, N. Deep Attractor Network for Single-microphone Speech Separation. arXiv 2017, arXiv:1611.08930v2. [Google Scholar]
- Luo, Y.; Chen, Z.; Mesgarani, N. Speaker-Independent Speech Separation with Deep Attractor Network. arXiv 2018, arXiv:1707.03634v3. [Google Scholar] [CrossRef]
- Han, C.; Luo, Y.; Mesgarani, N. Online Deep Attractor Network for Real-time Single-channel Speech Separation. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 361–365. [Google Scholar]
- Isik, Y.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-Channel Multi-Speaker Separation Using Deep Clustering. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 545–549. [Google Scholar]
- Huang, P.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 2136–2147. [Google Scholar] [CrossRef]
- Shi, x.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. arXiv 2015, arXiv:1506.04214. [Google Scholar]
- Stoller, D.; Ewert, S.; Dixon, S. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. arXiv 2018, arXiv:1806.03185. [Google Scholar]
- Chen, J.; Wang, D. Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 2017, 141, 4705–4714. [Google Scholar] [CrossRef] [PubMed]
- Luo, Y.; Mesgarani, N. Tasnet: Time-domain audio separation network for real-time, single-channel speech separation. arXiv 2017, arXiv:1711.00541. [Google Scholar]
- Luo, Y.; Mesgarani, N. Conv-TasNet Surpassing Ideal TimeFrequency Magnitude_Masking for Speech Separation. arXiv 2018, arXiv:1809.07454. [Google Scholar]
- Arango-Sánchez, J.A.; Arias-Londoño, J.D. An enhanced Conv-TasNet model for speech separation using a speaker distancebased loss function. arXiv 2022, arXiv:2205.13657. [Google Scholar]
- Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. arXiv 2016, arXiv:1611.05358. [Google Scholar]
- Partan, S.; Marler, P. Communication Goes Multimodal. Science 1999, 283, 1272–1273. [Google Scholar] [CrossRef]
- Golumbic, E.Z.; Cogan, G.B.; Schroeder, C.E.; Poeppel, D. Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. J. Neurosci. 2013, 33, 1417–1426. [Google Scholar] [CrossRef]
- Mesgarani, N.; Chang, E.F. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 2012, 485, 233–236. [Google Scholar] [CrossRef]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 2018, 37, 1–11. [Google Scholar] [CrossRef]
- Liu, Y.; Wei, Y. Multi-Modal Speech Separation Based on Two-Stage Feature Fusion. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; pp. 800–805. [Google Scholar]
- Wu, J.; Xu, Y.; Zhang, S.; Chen, L.; Yu, M.; Xie, L.; Yu, D. Time domain audio visual speech separation. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 667–673. [Google Scholar]
- Lu, R.; Duan, Z.; Zhang, C. Listen and Look: Audio-Visual Matching Assisted Speech Source Separation. IEEE Signal Processing Lett. 2015, 25, 1315–1319. [Google Scholar] [CrossRef]
- Ito, K.; Yamamoto, M.; Nagamatsu, K. Audio-visual speech enhancement method conditioned in the lip motion and speaker-discriminative embeddings. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2011; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
- Deng, Y.; Wei, Y. Vision-Guided Speaker Embedding Based Speech Separation. In Proceedings of the 2022 15th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 5 November 2022; pp. 1–6. [Google Scholar]
- Makishima, N.; Ihori, M.; Takashima, A.; Tanaka, T.; Orihashi, S.; Masumura, R. Audio Visual Speech Separation Using Cross-Modal Correspondence Loss. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2011; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
- Li, C.; Qian, Y. Deep audio-visual speech separation with attention mechanism. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted Res-UNet for High-quality Retina Vessel Segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education, Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083v2. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. arXiv 2018, arXiv:1807.11164. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
- Wang, D.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef]
- Roman, N.; Wang, D.; Brown, G.J. Speech segregation based on sound localization. J. Acoust. Soc. Am. 2003, 114, 2236–2252. [Google Scholar] [CrossRef] [PubMed]
- Srinivasan, S.; Roman, N.; Wang, D.L. Binary and ratio timefrequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
- Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 483–492. [Google Scholar] [CrossRef]
- Mangalam, K.; Salzamann, M. On Compressing U-net Using Knowledge Distillation. arXiv 2018, arXiv:1812.00249. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization-Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
- Gao, R.; Grauman, K. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 15490–15500. [Google Scholar]
- Martinez, B.; Ma, P.; Petridis, S.; Pantic, M. Lipreading using temporal convolutional networks. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Barron, J.T. A General and Adaptive Robust Loss Function. arXiv 2017, arXiv:1701.03077. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. arXiv 2015, arXiv:1503.03832. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time-frequencyweighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (pesq)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; pp. 749–752. [Google Scholar]
Methods | SDR | PESQ | STOI |
---|---|---|---|
No attention | 8.47 | 2.51 | 0.78 |
Add SE | 8.45 | 2.52 | 0.78 |
Add CBAM | 8.96 | 2.53 | 0.82 |
Methods | SDR | PESQ | STOI |
---|---|---|---|
No-residual | 8.96 | 2.53 | 0.82 |
Two-residual | 9.41 | 2.56 | 0.82 |
Six-residual | 10.1 | 2.60 | 0.85 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, G.; Fu, M.; Sun, M.; Liu, X.; Zheng, B. A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model. Sensors 2023, 23, 8770. https://doi.org/10.3390/s23218770
Li G, Fu M, Sun M, Liu X, Zheng B. A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model. Sensors. 2023; 23(21):8770. https://doi.org/10.3390/s23218770
Chicago/Turabian StyleLi, Guizhu, Min Fu, Mengnan Sun, Xuefeng Liu, and Bing Zheng. 2023. "A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model" Sensors 23, no. 21: 8770. https://doi.org/10.3390/s23218770
APA StyleLi, G., Fu, M., Sun, M., Liu, X., & Zheng, B. (2023). A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model. Sensors, 23(21), 8770. https://doi.org/10.3390/s23218770