skip to main content
10.1145/3643832.3661890acmconferencesArticle/Chapter ViewAbstractPublication PagesmobisysConference Proceedingsconference-collections
research-article
Open access

Enabling Hands-Free Voice Assistant Activation on Earphones

Published: 04 June 2024 Publication History

Abstract

We present the design and implementation of EarVoice, a lightweight mobile service that enables hands-free voice assistant activation on commodity earphones. EarVoice comprises two design modules: one for joint speech detection and primary user identification that explores the attributes of the air channel and in-body audio pathway to differentiate between the primary user and others nearby; and another for accurate wakeup word enhancement, which employs a "copy, paste, and adapt" approach to reconstruct the missing high-frequency component in speech recordings. To minimize false positives, enhance agility, and preserve privacy, we deploy EarVoice on a dongle where the proposed signal processing algorithms are streamlined with a gating mechanism to permit only the primary user's speech to enter the pairing device (e.g., a smartphone) for wakeup word recognition, preventing unintended disclosure of ambient conversations. We implemented the dongle on a 4-layer PCB board and conducted extensive experiments with 23 participants in both controlled and uncontrolled scenarios. The experiment results show that EarVoice achieves around 90% wakeup word recognition accuracy in stationary scenarios, which is on par with the high-end, multi-sensor fusion-based Airpods Pro earbud. EarVoice's performance drops to 84% on mobile cases, slightly worse than Airpods (around 90%).

References

[1]
activate Google Assistant on headphone [n. d.]. activate the Google Assistant function on your headphones. https://www.sony.com/electronics/support/articles/00202243.
[2]
Tawfiq Ammari, Jofish Kaye, Janice Y Tsai, and Frank Bentley. 2019. Music, search, and IoT: How people (really) use voice assistants. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 3 (2019), 1--28.
[3]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173--182.
[4]
Juliette Blevins and John Goldsmith. 1995. The syllable in phonological theory. 1995 (1995), 206--244.
[5]
Bone conductive microphone [n. d.]. Bone conductive microphone. https://earhugger.com/product/ear-bone-microphone/.
[6]
Bose QuietComfort Earbuds [n. d.]. Bose QuietComfort Earbuds. https://www.bose.com/en_us/support/articles/HC2648/productCodes/qc_earbuds/article.html.
[7]
Michael Braun, Anja Mainz, Ronee Chadowitz, Bastian Pfleging, and Florian Alt. 2019. At your service: Designing voice assistant personalities to improve automotive user interfaces. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--11.
[8]
Arturo Camacho and John G Harris. 2008. A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America (2008).
[9]
Tao Chen, Xiaoran Fan, Yongjie Yang, and Longfei Shangguan. 2022. Towards Remote Auscultation with Commodity Earphones. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems.
[10]
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting inaudible commands into over-the-air voice controlled systems. In Network and Distributed Systems Security (NDSS) Symposium.
[11]
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2023. The Design and Implementation of a Steganographic Communication System over In-Band Acoustical Channels. ACM Transactions on Sensor Networks (2023).
[12]
Tao Chen, Yongjie Yang, Xiaoran Fan, Xiuzhen Guo, Jie Xiong, and Longfei Shangguan. 2024. Exploring the Feasibility of Remote Cardiac Auscultation Using Earphones. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking.
[13]
Common Sources of Noise and Decibel Levels-CDC [n. d.]. Common Sources of Noise and Decibel Levels-CDC. https://www.cdc.gov/nceh/hearing_loss/what_noises_cause_hearing_loss.html.
[14]
Patricio De La Cuadra, Aaron S Master, and Craig Sapp. 2001. Efficient pitch detection techniques for interactive music. In ICMC.
[15]
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).
[16]
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, and Ian McGraw. 2022. Personal VAD 2.0: Optimizing personal voice activity detection for on-device speech recognition. arXiv preprint arXiv:2204.03793 (2022).
[17]
Shaojin Ding, Quan Wang, Shuo-yiin Chang, Li Wan, and Ignacio Lopez Moreno. 2019. Personal VAD: Speaker-conditioned voice activity detection. arXiv preprint arXiv:1908.04284 (2019).
[18]
Heinrich Dinkel, Yefei Chen, Mengyue Wu, and Kai Yu. 2020. Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection. In Proc. Interspeech 2020. 3665--3669.
[19]
Tingchao Fan, Huangwei Wu, Meng Jin, Tao Chen, Longfei Shangguan, Xinbing Wang, and Chenghu Zhou. 2023. Towards Spatial Selection Transmission for Low-end IoT devices with SpotSound. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking.
[20]
Xiaoran Fan, Longfei Shangguan, Siddharth Rupavatharam, Yanyong Zhang, Jie Xiong, Yunfei Ma, and Richard Howard. 2021. HeadFi: bringing intelligence to all headphones. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking.
[21]
Huan Feng, Kassem Fawaz, and Kang G Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. 343--355.
[22]
Andrea Ferlini, Dong Ma, Robert Harle, and Cecilia Mascolo. 2021. EarGate: gait-based user identification with in-ear microphones. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 337--349.
[23]
Sonia Garcia-Salicetti, Charles Beumier, Gérard Chollet, Bernadette Dorizzi, Jean Leroux les Jardins, Jan Lunter, Yang Ni, and Dijana Petrovska-Delacrétaz. 2003. BIOMET: A multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. In Audio-and Video-Based Biometric Person Authentication: 4th International Conference, AVBPA 2003 Guildford, UK, June 9--11, 2003 Proceedings 4. Springer, 845--853.
[24]
Asif A Ghazanfar and Daniel Y Takahashi. 2014. Facial expressions and the evolution of the speech rhythm. Journal of cognitive neuroscience (2014).
[25]
Google STT [n. d.]. Google STT. https://cloud.google.com/speech-to-text.
[26]
Lixing He, Haozheng Hou, Shuyao Shi, Xian Shuai, and Zhenyu Yan. 2023. Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services. 14--27.
[27]
Blanca Hernandez-Ortega and Ivani Ferreira. 2021. How smart experiences build service loyalty: The importance of consumer love for smart voice assistants. Psychology & Marketing 38, 7 (2021), 1122--1139.
[28]
Hirotaka Hiraki, Shusuke Kanazawa, Takahiro Miura, Manabu Yoshida, Masaaki Mochimaru, and Jun Rekimoto. 2023. External noise reduction using WhisperMask, a mask-type wearable microphone. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1--5.
[29]
Changshuo Hu, Xiao Ma, Dong Ma, and Ting Dang. 2023. Lightweight and Non-Invasive User Authentication on Earables. In Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications. 36--41.
[30]
iFLYTEK ASR [n. d.]. iFLYTEK ASR. https://global.xfyun.cn/.
[31]
Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: " Hearing" Your Silent Speech Commands In Ear. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--28.
[32]
Eugenijus Kaniusas. 2007. Acoustical signals of biomechanical systems. World Scientific.
[33]
Fahim Kawsar, Chulhong Min, Akhil Mathur, and Alessandro Montanari. 2018. Earables for personal-scale behavior analytics. IEEE Pervasive Computing 17, 3 (2018), 83--89.
[34]
Linus Kendall, Bidisha Chaudhuri, and Apoorva Bhalla. 2020. Understanding technology as situated practice: everyday use of voice user interfaces among diverse groups of users in urban India. Information Systems Frontiers 22 (2020), 585--605.
[35]
Raymond D Kent and Houri K Vorperian. 2018. Static measurements of vowel formant frequencies and bandwidths: A review. Journal of communication disorders 74 (2018), 74--97.
[36]
Prerna Khanna, Tanmay Srivastava, Shijia Pan, Shubham Jain, and Phuc Nguyen. 2021. JawSense: recognizing unvoiced sound using a low-cost ear-worn system. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 44--49.
[37]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022--17033.
[38]
Valeri Aleksandrovich Kozhevnikov and Liudmila Andreevna Chistovich. 1965. Speech: Articulation and perception. (1965).
[39]
Rui Liu, Cory Cornelius, Reza Rawassizadeh, Ronald Peterson, and David Kotz. 2018. Vocal resonance: Using internal body voice for wearable authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 1--23.
[40]
Dong Ma, Ting Dang, Ming Ding, and Rajesh Balan. 2023. ClearSpeech: Improving Voice Quality of Earbuds Using Both In-Ear and Out-Ear Microphones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (2023).
[41]
Dong Ma, Andrea Ferlini, and Cecilia Mascolo. 2021. Oesense: employing occlusion effect for in-ear human sensing. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 175--187.
[42]
Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien-Ming Huang. 2023. LLM-Powered Conversational Voice Assistants: Interaction Patterns, Opportunities, Challenges, and Design Guidelines. arXiv preprint arXiv:2309.13879 (2023).
[43]
John Makhoul. 1973. Spectral analysis of speech by linear prediction. IEEE Transactions on Audio and Electroacoustics 21, 3 (1973), 140--148.
[44]
Gabriele Meoni, Luca Pilato, and Luca Fanucci. 2018. A low power voice activity detector for portable applications. In 2018 14th conference on Ph. D. research in microelectronics and electronics (PRIME). IEEE, 41--44.
[45]
Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-audible murmur (NAM) recognition. IEICE TRANSACTIONS on Information and Systems 89, 1 (2006), 1--8.
[46]
ohmic technologies [n. d.]. ohmic technologies. https://www.ohmictechnologies.com/.
[47]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[48]
Hyojin Park, Christoph Kayser, Gregor Thut, and Joachim Gross. 2016. Lip movements entrain the observers' low-frequency brain oscillations to facilitate speech intelligibility. Elife (2016).
[49]
Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[50]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
[51]
Swadhin Pradhan, Wei Sun, Ghufran Baig, and Lili Qiu. 2019. Combating replay attacks against voice assistants. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1--26.
[52]
py-webrtcvad [n. d.]. py-webrtcvad. https://github.com/wiseman/py-webrtcvad.
[53]
Yingyong Qi and Robert E Hillman. 1997. Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. The Journal of the Acoustical Society of America 102, 1 (1997), 537--543.
[54]
Yue Qin, Chun Yu, Zhaoheng Li, Mingyuan Zhong, Yukang Yan, and Yuanchun Shi. 2021. ProxiMic: Convenient voice activation via close-to-mic speech detected by a single microphone. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.
[55]
LR Rabiner and BH Juang. 1993. Fundamentals of speech Recognition (Prentice Hall PTR. Upper Saddle River, New Jersey (1993).
[56]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492--28518.
[57]
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).
[58]
Nirupam Roy and Romit Roy Choudhury. 2016. Listening through a vibration motor. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services.
[59]
Safe-Listening Tips [n. d.]. Safe-Listening Tips. https://hearing.health.mil/Prevention/Dangers-of-Loud-Noise/Safe-Listening/.
[60]
Philipp Schilk, Niccolò Polvani, Andrea Ronco, Milos Cernak, and Michele Magno. 2023. In-Ear-Voice: Towards Milli-Watt Audio Enhancement With Bone-Conduction Microphones for In-Ear Sensing Platforms. In Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation.
[61]
Irtaza Shahid, Yang Bai, Nakul Garg, and Nirupam Roy. 2022. Voicefind: Noise-resilient speech recovery in commodity headphones. In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications. 13--18.
[62]
Jiacheng Shang and Jie Wu. 2019. Enabling secure voice input on augmented reality headsets using internal body voice. In 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, 1--9.
[63]
Mayank Sharma, Sandeep Joshi, Tamojit Chatterjee, and Raffay Hamid. 2022. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. Neurocomputing 494 (2022), 116--131.
[64]
Tanmay Srivastava, Prerna Khanna, Shijia Pan, Phuc Nguyen, and Shubham Jain. 2022. MuteIt: Jaw Motion Based Unvoiced Command Recognition Using Earable. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--26.
[65]
Zixiong Su, Shitao Fang, and Jun Rekimoto. 2023. LipLearner: Customizable Silent Speech Interactions on Mobile Devices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--21.
[66]
Phuc Nguyen Tanmay Srivastava, Shijia Pan and Shubham Jain. 2023. Jawthenticate: Microphone-free Speech-based Authentication using Jaw Motion and Facial Vibrations. In Proceedings of the 21th ACM Conference on Embedded Networked Sensor System.
[67]
Ingo R Titze. 1976. On the mechanics of vocal-fold vibration. The Journal of the Acoustical Society of America (1976).
[68]
Ingo R Titze and Eric J Hunter. 2004. Normal vibration frequencies of the vocal ligament. The Journal of the Acoustical Society of America (2004).
[69]
Use gestures to control your Google Assistant on headphones [n. d.]. Use gestures to control your Google Assistant on headphones. https://support.google.com/assistant/answer/7513985?hl=en&co=GENIE.Platform%3DAndroid.
[70]
Use Siri with AirPods (1st or 2nd generation) [n. d.]. Use Siri with AirPods (1st or 2nd generation). https://support.apple.com/guide/airpods/use-siri-with-airpods-1st-or-2nd-generation-dev8779e5576/web.
[71]
Vocal Fold Vibration and Pitch [n. d.]. Vocal Fold Vibration and Pitch. https://med.umn.edu/ent/patient-care/lions-voice-clinic/about-the-voice/how-it-works/physiology.
[72]
Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno. 2018. Speaker diarization with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 5239--5243.
[73]
WH-1000XM4 [n. d.]. WH-1000XM4. https://helpguide.sony.net/mdr/wh1000xm4/v1/en/contents/TP0002754730.html.
[74]
Zhen Xiao, Tao Chen, Yang Liu, and Zhenjiang Li. 2020. Mobile phones know your keystrokes through the sounds from finger's tapping on the screen. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).
[75]
Yukang Yan, Chun Yu, Yingtian Shi, and Minxing Xie. 2019. PrivateTalk: Activating Voice Input with Hand-On-Mouth Gesture Detected by Bluetooth Earphones. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology.
[76]
Yongjie Yang, Tao Chen, Yujing Huang, Xiuzhen Guo, and Longfei Shangguan. 2024. MAF: Exploring Mobile Acoustic Field for Hand-to-Face Gesture Interactions. In Proceedings of CHI.
[77]
Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang. 2019. Fully supervised speaker diarization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6301--6305.
[78]
Ruidong Zhang, Hao Chen, Devansh Agarwal, Richard Jin, Ke Li, François Guimbretière, and Cheng Zhang. 2023. HPSpeech: Silent Speech Interface for Commodity Headphones. In Proceedings of the 2023 International Symposium on Wearable Computers. 60--65.
[79]
Zhaoyan Zhang. 2016. Mechanics of human voice production and control. The journal of the acoustical society of america (2016).
[80]
Shiwen Zhao, Brandt Westing, Shawn Scully, Heri Nieto, Roman Holenstein, Minwoo Jeong, Krishna Sridhar, Brandon Newendorp, Mike Bastian, Sethu Raman, et al. 2019. Raise to Speak: An Accurate, Low-Power Detector for Activating Voice Assistants on Smartwatches. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2736--2744.
[81]
Eberhard Zwicker and Hugo Fastl. 2013. Psychoacoustics: Facts and models. Springer Science & Business Media.

Cited By

View all
  • (2024)EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric VideosProceedings of the 2024 ACM International Symposium on Wearable Computers10.1145/3675095.3676611(40-47)Online publication date: 5-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MOBISYS '24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services
June 2024
778 pages
ISBN:9798400705816
DOI:10.1145/3643832
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2024

Check for updates

Author Tags

  1. voice activation
  2. bone conduction
  3. earable computing

Qualifiers

  • Research-article

Funding Sources

Conference

MOBISYS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 274 of 1,679 submissions, 16%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)818
  • Downloads (Last 6 weeks)170
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric VideosProceedings of the 2024 ACM International Symposium on Wearable Computers10.1145/3675095.3676611(40-47)Online publication date: 5-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media