skip to main content
10.1145/3485730.3485945acmconferencesArticle/Chapter ViewAbstractPublication PagessensysConference Proceedingsconference-collections
research-article

Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

Published: 15 November 2021 Publication History

Abstract

With the advance in automatic speech recognition, voice user interface has gained popularity recently. Since the COVID-19 pandemic, VUI is increasingly preferred in online communication due to its non-contact. Additionally, various ambient noise impedes the public applications of voice user interfaces due to the requirement of audio-only speech recognition methods for a high signal-to-noise ratio. In this paper, we present Wavoice, the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is that we model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Furthermore, we elaborate on two novel modules into the neural attention mechanism for multi-modal signals fusion, and result in accurate speech recognition. Extensive experiments verify Wavoice's effectiveness under various conditions with the character recognition error rate below 1% in a range of 7 meters. Wavoice outperforms existing audio-only speech recognition methods with lower character error rate and word error rate. The evaluation in complex scenes validates the robustness of Wavoice.

References

[1]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018).
[2]
Omer Saad Alkhafaf, Mousa K. Wali, and Ali H. Al-Timemy. 2020. Improved Prosthetic Hand Control with Synchronous Use of Voice Recognition and Inertial Measurements. IOP Conference Series: Materials Science and Engineering 745 (2020), 012088.
[3]
Yasin Almalioglu, Mehmet Turan, Chris Xiaoxuan Lu, Niki Trigoni, and Andrew Markham. 2020. Milli-rio: Ego-motion estimation with low-cost millimetre-wave radar. IEEE Sensors Journal 21, 3 (2020), 3314--3323.
[4]
Amazon.com. 2021. https://www.amazon.com/echo/, title = Amazon echo.
[5]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173--182.
[6]
Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal. 2018. The fifth'CHiME'speech separation and recognition challenge: dataset, task and baselines. arXiv preprint arXiv:1803.10609 (2018).
[7]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of International Conference on Acoustics, Speech and Signal Processing.
[8]
Shuo Chang, Yifan Zhang, Fan Zhang, Xiaotong Zhao, Sai Huang, Zhiyong Feng, and Zhiqing Wei. 2020. Spatial Attention fusion for obstacle detection using mmwave radar and vision sensor. Sensors 20, 4 (2020), 956.
[9]
Fuming Chen, Sheng Li, Chuantao Li, Miao Liu, Zhao Li, Huijun Xue, Xijing Jing, and Jianqi Wang. 2016. A novel method for speech acquisition and enhancement by 94 GHz millimeter-wave sensor. Sensors 16, 1 (2016), 50.
[10]
Hyeong-Seok Choi, Jang-Hyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee. 2019. Phase-Aware Speech Enhancement with Deep Complex U-Net. In Proceedings of International Conference on Learning Representations.
[11]
Livija Cveticanin. 2012. Review on Mathematical and Mechanical Models of the Vocal Cord. J. Appl. Math. 2012 (2012), 928591:1-928591:18.
[12]
Joseph St Cyr, Joshua Vanderpool, Yu Chen, and Xiaohua Li. 2020. HODET: Hybrid object detection and tracking using mmWave radar and visual sensors. In Proceedings of the Sensors and Systems for Space Applications XIII, Vol. 11422. International Society for Optics and Photonics, 114220I.
[13]
Chris Donahue, Bo Li, and Rohit Prabhavalkar. 2018. Exploring speech enhancement with generative adversarial networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
[14]
Ming Gao, Feng Lin, Weiye Xu, Muertikepu Nuermaimaiti, Jinsong Han, Wenyao Xu, and Kui Ren. 2020. Deaf-aid: mobile IoT communication exploiting stealthy speaker-to-gyroscope channel. In Proceedings of the Annual International Conference on Mobile Computing and Networking. ACM.
[15]
Gmtd. 2021. GM-A906 microphone. [online].
[16]
Google. 2019. https://www.androidcentral.com/how-does-googles-soli-chipwork, title = Here's how the Pixel 4's Soli radar works and why Motion Sense has so much potential,.
[17]
Google. 2021. https://store.google.com/product/google_home/, title = Google home.
[18]
Google. 2021. ok-google.io. https://ok-google.io/
[19]
Nishu Gupta et al. 2021. A Novel Voice Controlled Robotic Vehicle For Smart City Applications. In Journal of Physics: Conference Series, Vol. 1817. IOP Publishing, 012016.
[20]
Mary Harper. 2015. The automatic speech recogition in reverberant environments (ASpIRE) challenge. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 547--554.
[21]
H. G. Hirsch and C. Ehrlicher. 1995. Noise estimation techniques for robust speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.
[22]
Hong Hong, Heng Zhao, Zhengyu Peng, Hui Li, Chen Gu, Changzhi Li, and Xiaohua Zhu. 2016. Time-varying vocal folds vibration detection using a 24 GHz portable auditory radar. Sensors 16, 8 (2016), 1181.
[23]
Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross Attention Network for Few-shot Classification. In Proceedings of Advances in Neural Information Processing Systems.
[24]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In Proceedings of Conference on Computer Vision and Pattern Recognition.
[25]
Apple Inc. 2021. https://www.apple.com/au/siri/, title = Siri - Apple.
[26]
Christopher I Jarvis, Kevin Van Zandvoort, Amy Gimma, Kiesha Prem, Petra Klepac, G James Rubin, and W John Edmunds. 2020. Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK. BMC medicine 18 (2020), 1--10.
[27]
Kaustubh Kalgaonkar, Rongquiang Hu, and Bhiksha Raj. 2007. Ultrasonic doppler sensor for voice activity detection. IEEE Signal Processing Letters 14, 10 (2007), 754--757.
[28]
Keisuke Kinoshita, Marc Delcroix, Sharon Gannot, Emanuël AP Habets, Reinhold Haeb-Umbach, Walter Kellermann, Volker Leutnant, Roland Maas, Tomohiro Nakatani, Bhiksha Raj, et al. 2016. A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing 2016, 1 (2016), 1--19.
[29]
Aldebaro Klautau, Nuria González-Prelcic, and Robert W Heath. 2019. LIDAR data for deep learning-based mmWave beam-selection. IEEE Wireless Communications Letters 8, 3 (2019), 909--912.
[30]
Ki-Seung Lee. 2019. Speech enhancement using ultrasonic doppler sonar. Speech Communication 110 (2019), 21--32.
[31]
Huining Li, Chenhan Xu, Aditya Singh Rathore, Zhengxiong Li, Hanbin Zhang, Chen Song, Kun Wang, Lu Su, Feng Lin, Kui Ren, et al. 2020. VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation. In Proceedings of the Conference on Embedded Networked Sensor Systems.
[32]
Quanzhi Li, Qiong Zhang, and Luo Si. 2019. Rumor detection by exploiting user credibility information, attention and multi-task learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
[33]
Sheng Li, Ying Tian, Guohua Lu, Yang Zhang, Hui Jun Xue, Jian-Qi Wang, and Xi-Jing Jing. 2012. A new kind of non-acoustic speech acquisition method based on millimeter waveradar. Progress In Electromagnetics Research 130 (2012), 17--40.
[34]
Sheng Li, Jian-Qi Wang, Ming Niu, Tian Liu, and Xi-Jing Jing. 2008. The enhancement of millimeter wave conduct seech based on perceptual weighting. Progress In Electromagnetics Research 9 (2008), 199--214.
[35]
Zhengxiong Li, Fenglong Ma, Aditya Singh Rathore, Zhuolin Yang, Baicheng Chen, Lu Su, and Wenyao Xu. 2020. Wavespy: Remote and through-wall screen attack via mmwave sensing. In Proceedings of IEEE Symposium on Security and Privacy.
[36]
Philipos C Loizou. 2013. Speech enhancement: theory and practice. CRC press.
[37]
Ningbo Long, Kaiwei Wang, Ruiqi Cheng, Kailun Yang, and Jian Bai. 2018. Fusion of millimeter wave radar and RGB-depth sensors for assisted navigation of the visually impaired. In Proceedings of the Millimetre Wave and Terahertz Sensors and Technology XI, Vol. 10800. 1080006.
[38]
Chris Xiaoxuan Lu, Stefano Rosa, Peijun Zhao, Bing Wang, Changhao Chen, John A Stankovic, Niki Trigoni, and Andrew Markham. 2020. See through smoke: robust indoor mapping with low-cost mmWave radar. In Proceedings of the International Conference on Mobile Systems, Applications, and Services.
[39]
Chris Xiaoxuan Lu, Muhamad Risqi U Saputra, Peijun Zhao, Yasin Almalioglu, Pedro PB de Gusmao, Changhao Chen, Ke Sun, Niki Trigoni, and Andrew Markham. 2020. milliEgo: single-chip mmWave radar aided egomotion estimation via deep sensor fusion. In Proceedings of the Conference on Embedded Networked Sensor Systems.
[40]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of Advances in Neural Information Processing Systems.
[41]
Michal Luria, Guy Hoffman, and Oren Zuckerman. 2017. Comparing social robot, screen and voice interfaces for smart-home control. In Proceedings of the CHI conference on human factors in computing systems.
[42]
Youri Maryn, Floris L Wuyts, and Andrzej Zarowski. 2021. Are Acoustic Markers of Voice and Speech Signals Affected by Nose-and-Mouth-Covering Respiratory Protective Masks? Journal of Voice (2021).
[43]
Giovanni Morrone, Sonia Bergamaschi, Luca Pasa, Luciano Fadiga, Vadim Tikhanoff, and Leonardo Badino. 2019. Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.
[44]
Tomer Moscovich. 2009. Contact Area Interaction with Sliding Widgets. In Proceedings of the Annual ACM Symposium on User Interface Software and Technology.
[45]
Mahesh Kumar Nandwana, Julien Van Hout, Mitchell McLaren, Colleen Richey, Aaron Lawson, and Maria Alejandra Barrios. 2019. The voices from a distance challenge 2019 evaluation plan. arXiv preprint arXiv:1902.10828 (2019).
[46]
Ashutosh Pandey and DeLiang Wang. 2019. A new framework for CNN-based speech enhancement in the time domain. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 7 (2019), 1179--1188.
[47]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[48]
Lionel Sharples Penrose. 1946. The Elementary Statistics of Majority Voting. Journal of the Royal Statistical Society 109, 1 (1946), 53--57.
[49]
Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. 2017. A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Proceedings of the conference of the international speech communication association.
[50]
Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, and Ronan Collobert. [n.d.]. Wav2Letter++: A Fast Open-source Speech Recognition System. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, ICASSP, year = 2019,.
[51]
Pengfei Zhu Peihua Li Wangmeng Zuo Qilong Wang, Banggu Wu and Qinghua Hu. 2020. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
[52]
Hermann Rohling and Ralph Mende. 1996. OS CFAR performance in a 77 GHz radar sensor for car application. In Proceedings of International Radar Conference. IEEE, 109--114.
[53]
Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2018. BackDoor: Sounds that a microphone can record, but that humans can't hear. GetMobile: Mobile Computing and Communications 21, 4 (2018), 25--29.
[54]
Kei Sakaguchi, Thomas Haustein, Sergio Barbarossa, Emilio Calvanese Strinati, Antonio Clemente, Giuseppe Destino, Aarno Pärssinen, Ilgyu Kim, Heesang Chung, Junhyeong Kim, et al. 2017. Where, when, and how mmWave is used in 5G and beyond. IEICE Transactions on Electronics 100, 10 (2017), 790--808.
[55]
SoundAI. 2020. https://www.chinadaily.com.cn/a/202003/13/WS5e6b3fcca31012821727ef88.html, urldate = March 13, 2020, title = Voice-controlled Elevator System Put into Use in Beijing.
[56]
Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the Annual International Conference on Mobile Computing and Networking.
[57]
Lorenzo Tarantino, Philip N. Garner, and Alexandros Lazaridis. 2019. Self-Attention for Speech Emotion Recognition. In Proceedings of the Conference of the International Speech Communication Association.
[58]
Telsa. 2021. https://www.tesla.com/, title = Model S/3/X/Y.
[59]
TI. 2021. DCA1000EVM. https://www.ti.com/tool/DCA1000EVM.
[60]
TI. 2021. IWR1642. https://www.ti.com/tool/IWR1642BOOST.
[61]
TI. 2021. mmWave Studio. https://www.ti.com/tool/MMWAVE-STUDIO.
[62]
Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J. Pal. 2018. Deep Complex Networks. In Proceedings of the International Conference on Learning Representations.
[63]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of Advances in Neural Information Processing Systems.
[64]
Dong Wang, Xiaodong Wang, and Shaohe Lv. 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8 (2019), 1018.
[65]
Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M Ni. 2016. We can hear you with Wi-Fi! IEEE Transactions on Mobile Computing 15, 11 (2016), 2907--2920.
[66]
Jie Wang, Qinhua Gao, Xiaorui Ma, Yunong Zhao, and Yuguang Fang. 2020. Learning to sense: Deep learning for wireless sensing with less training efforts. IEEE Wireless Communications 27, 3 (2020), 156--162.
[67]
Jiwu Wang, Xuewei Hu, and Chengyu Tong. 2021. Urban Community Sustainable Development Patterns under the Influence of COVID-19: A Case Study Based on the Non-Contact Interaction Perspective of Hangzhou City. Sustainability 13, 6 (2021), 3575.
[68]
Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).
[69]
Shinji Watanabe, Michael Mandel, Jon Barker, and Emmanuel Vincent. 2020. Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249 (2020).
[70]
Canalys website. 2020. https://www.canalys.com/newsroom/canalys-global-smart-speaker-market-2021-forecast, urldate = October 22, 2020, title = Global Smart Speaker Market 2021 Forecast.
[71]
Chris Wiltz. 2020. https://www.designnews.com/design-hardware-software/covid-19-giving-touchless-interfaces-chance-make-impression-0, urldate = June 03, 2020, title = COVID-19 Giving Touchless Interfaces a Chance to Make an Impression.
[72]
Mingyang Wu and DeLiang Wang. 2006. A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing 14, 3 (2006), 774--784.
[73]
Henk Wymeersch, Gonzalo Seco-Granados, Giuseppe Destino, Davide Dardari, and Fredrik Tufvesson. 2017. 5G mmWave positioning for vehicular networks. IEEE Wireless Communications 24, 6 (2017), 80--86.
[74]
Xiaomi. [n.d.]. 'Not science fiction': Xiaomi's revolutionary new wireless charging tech can charge your devices remotely. https://www.financialexpress.com/industry/technology/not-science-fiction-xiaomis-revolutionary-new-wireless-charging-tech-can-charge-your-devices-remotely/2181661/.
[75]
Chenhan Xu, Zhengxiong Li, Hanbin Zhang, Aditya Singh Rathore, Huining Li, Chen Song, Kun Wang, and Wenyao Xu. 2019. Waveear: Exploring a mmwave-based noise-resistant speech sensing for voice-user interface. In Proceedings of the Annual International Conference on Mobile Systems, Applications, and Services. 14--26.
[76]
Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence.
[77]
Dong Yu and Li Deng. 2016. AUTOMATIC SPEECH RECOGNITION. Springer.
[78]
Li Yujian and Liu Bo. 2007. A normalized Levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29, 6 (2007), 1091--1095.
[79]
Renyuan Zhang and Siyang Cao. 2019. Extending reliability of mmwave radar tracking and detection via fusion with camera. IEEE Access 7 (2019), 137065--137079.

Cited By

View all

Index Terms

  1. Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems
    November 2021
    686 pages
    ISBN:9781450390972
    DOI:10.1145/3485730
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Speech recognition
    2. mmWave sensing
    3. multimodal fusion
    4. voice user interface

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    Acceptance Rates

    SenSys '21 Paper Acceptance Rate 25 of 139 submissions, 18%;
    Overall Acceptance Rate 174 of 867 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)283
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media