skip to main content
research-article

Anti-Spoofing Voice Commands: A Generic Wireless Assisted Design

Published: 14 September 2021 Publication History

Abstract

This paper presents an anti-spoofing design to verify whether a voice command is spoken by one live legal user, which supplements existing speech recognition systems and could enable new application potentials when many crucial voice commands need a higher-standard verification in applications. In the literature, verifying the liveness and legality of the command's speaker has been studied separately. However, to accept a voice command from a live legal user, prior solutions cannot be combined directly due to two reasons. First, previous methods have introduced various sensing channels for the liveness detection, while the safety of a sensing channel itself cannot be guaranteed. Second, a direct combination is also vulnerable when an attacker plays a recorded voice command from the legal user and mimics this user to speak the command simultaneously. In this paper, we introduce an anti-spoofing sensing channel to fulfill the design. More importantly, our design provides a generic interface to form the sensing channel, which is compatible to a variety of widely-used signals, including RFID, Wi-Fi and acoustic signals. This offers a flexibility to balance the system cost and verification requirement. We develop a prototype system with three versions by using these sensing signals. We conduct extensive experiments in six different real-world environments under a variety of settings to examine the effectiveness of our design.

References

[1]
Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh, Iljoo Kim, Taekkyung Oh, and Hyoungshick Kim. 2020. Void: A Fast and Light Voice Liveness Detection System. In Proc. of USENIX Security Symposium. 2685--2702.
[2]
RG Bachu, S Kopparthi, B Adapa, and BD Barkana. 2008. Separation of Voiced and Unvoiced Using Zero Crossing Rate and Energy of the Speech Signal. In Proc. of ASEE. 1--7.
[3]
Chao Cai, Rong Zheng, and Menglan Hu. 2019. A Survey on Acoustic Sensing. arXiv preprint arXiv:1901.03450 (2019).
[4]
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems. In Proc. of NDSS Symposium.
[5]
Han Ding, Jinsong Han, Yanyong Zhang, Fu Xiao, Wei Xi, Ge Wang, and Zhiping Jiang. 2018. Preventing Unauthorized Access on Passive Tags. In Proc. of IEEE INFOCOM. 1115--1123.
[6]
Serife Kucur Ergünay, Elie Khoury, Alexandros Lazaridis, and Sebastien Marcel. 2015. On the Vulnerability of Speaker Verification to Realistic Voice Spoofing. In Proc. of IEEE BTAS.
[7]
Huan Feng, Kassem Fawaz, and Kang G. Shin. 2017. Continuous Authentication for Voice Assistants. In In Proc. of ACM Mobicom.
[8]
EPC Global. 2005. Specification for RFID Air Interface. EPC Radio-Frequency Identity Protocols Class-1 Generation-2 UHF RFID Protocol for Communications 860 (2005), 1--94.
[9]
Dai-fei Guo, Wei-Hong Zhu, Zhen-Ming Gao, and Jian-qiang Zhang. 2000. A Study of Wavelet Thresholding Denoising. In Proc. of IEEE ICSP. 329--332.
[10]
Unsoo Ha, Junshan Leng, Alaa Khaddaj, and Fadel Adib. 2020. Food and Liquid Sensing in Practical Environments using RFIDs. In Proc. of USENIX NSDI.
[11]
Chenggao Han and Takeshi Hashimoto. 2016. Coded Constellation Rotated Vector OFDM with Almost Linear Interleaver. In Proc. of IEEE WCNC.
[12]
Chenggao Han, Takeshi Hashimoto, and Naoki Suehiro. 2010. Constellation-Rotated Vector OFDM and Its Performance Analysis over Rayleigh Fading Channels. IEEE Transactions on communications 58, 3 (2010), 828--838.
[13]
Haitham Hassanieh, Jue Wang, Dina Katabi, and Tadayoshi Kohno. 2015. Securing RFIDs by Randomizing the Modulation and Channel. In Proc. of USENIX NSDI.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. of IEEE CVPR. 770--778.
[15]
Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei Xue, Chen Song, Xin Ma, Dimitrios Koutsonikolas, Wenyao Xu, and Lu Su. 2018. Towards Environment Independent Device Free Human Activity Recognition. In Proc. of ACM Mobicom.
[16]
Kiran Joshi, Dinesh Bharadia, Manikanta Kotaru, and Sachin Katti. 2015. WiDeo: Fine-Grained Device-Free Motion Tracing Using RF Backscatter. In Proc. of USENIX NSDI.
[17]
Tomi Kinnunen, Md Sahidullah, Ivan Kukanov, Héctor Delgado, Massimiliano Todisco, Achintya Sarkar, Nicolai Bæk Thomsen, Ville Hautamäki, Nicholas Evans, and Zheng-Hua Tan. 2016. Utterance Verification for Text-Dependent Speaker Recognition: A Comparative Assessment Using the RedDots Corpus. (2016).
[18]
Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj, and Daniel J Inman. 2019. 1D Convolutional Neural Networks and Applications: A Survey. arXiv preprint arXiv:1905.03554 (2019).
[19]
Eleanor Lawson, Jane Stuart-Smith, James M Scobbie, Satsuki Nakai, David Beavan, Fiona Edmonds, Iain Edmonds, Alice Turk, Claire Timmins, J Beck, et al. 2015. Dynamic Dialects: An Articulatory Web Resource for the Study of Accents. (2015).
[20]
Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. 2018. Lippass: Lip Reading-Based User Authentication on Smartphones Leveraging Acoustic Signals. In Proc. of IEEE INFOCOM. 1466--1474.
[21]
Wenguang Mao, Jian He, Huihuang Zheng, Zaiwei Zhang, and Lili Qiu. 2016. High-Precision Acoustic Motion Tracking: Demo. In Proc. of ACM MobiCom.
[22]
Seshashyama Sameeraj Meduri and Rufus Ananth. 2012. A Survey and Evaluation of Voice Activity Detection Algorithms.
[23]
Yan Meng, Zichang Wang, Wei Zhang, Peilin Wu, Haojin Zhu, Xiaohui Liang, and Yao Liu. 2018. Wivo: Enhancing the Security of Boice Control System via Wireless Signal in IOT Environment. In Proc. of ACM MobiHoc. 81--90.
[24]
F. Mavromatis N. Kargas and A. Bletsas. 2015. Fully-Coherent Reader with Commodity SDR for Gen2 FM0 and Computational RFID. IEEE Wireless Communications Letters 4, 6 (2015), 617--620.
[25]
Jayant M Naik. 1990. Speaker Verification: A Tutorial. IEEE Communications Magazine 28, 1 (1990), 42--48.
[26]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi Speech Recognition Toolkit. In Proc. of IEEE 2011 workshop on ASRU.
[27]
Ramjee Prasad. 2004. OFDM for Wireless Communications Systems. Artech House.
[28]
S. Prabhu Raghavendra and Grayver Eugene. 2009. Active Constellation Modification Techniques for OFDM PAR Reduction. In Proc. of IEEE Aerospace conference.
[29]
Yanzhen Ren, Zhong Fang, Dengkai Liu, and Changwen Chen. 2019. Replay Attack Detection Based on Distortion by Loudspeaker for Voice Authentication. Multimedia Tools and Applications 78, 7 (2019), 8383--8396.
[30]
Md. Sahidullah, Dennis Alexander Lehmann Thomsen, Rosa Gonzalez Hautamaki, Tomi Kinnunen, Zhenghua Tan, Robert Parts, and Martti Pitkanen. 2018. Robust Voice Liveness Detection and Speaker Verification Using Throat Microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 1 (2018), 44--56.
[31]
Muhammad Shahzad and Shaohu Zhang. 2018. Augmenting User Identification with WiFi Based Gesture Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 1--27.
[32]
Wei Shang and Maryhelen Stevenson. 2010. Score Normalization in Playback Attack Detection. In Proc. of IEEE ICASSP.
[33]
Sayaka Shiota, Fernando Villavicencio, Junichi Yamagishi, Nobutaka Ono, Isao Echizen, and Tomoko Matsui. 2016. Voice Liveness Detection for Speaker Verification based on a Tandem Single/Double-channel Pop Noise Detector. In Odyssey. 259--263.
[34]
Yi-Sheng Shiu, Shih Yu Chang, Hsiao-Chun Wu, Scott C-H Huang, and Hsiao-Hwa Chen. 2011. Physical Layer Security in Wireless Networks: A Tutorial. IEEE wireless Communications 18, 2 (2011), 66--74.
[35]
Adrian P Simpson. 2001. Dynamic Consequences of Differences in Male and Female Vocal Tract Dimensions. The journal of the Acoustical society of America 109, 5 (2001), 2153--2164.
[36]
Bronson Syiem, Sushanta Kabir Dutta, Juwesh Binong, and Lairenlakpam Joyprakash Singh. 2020. Comparison of Khasi Speech Representations with Different Spectral Features and Hidden Markov States. Journal of Electronic Science and Technology (2020), 100079.
[37]
Yonglong Tian, Guang-He Lee, Hao He, Chen-Yu Hsu, and Dina Katabi. 2018. RF-Based Fall Monitoring Using Convolutional Neural Networks. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 1--24.
[38]
Vibha Tiwari. 2010. MFCC and its Applications in Speaker Recognition. International journal on emerging technologies 1, 1 (2010), 19--22.
[39]
C Van Den Broeck. 1983. On the Relation between White Shot Noise, Gaussian White Noise, and the Dichotomic Markov Process. Journal of Statistical Physics 31, 3 (1983), 467--483.
[40]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized End-to-End Loss for Speaker Verification. In Proc. of IEEE ICASSP. 4879--4883.
[41]
Ge Wang, Haofan Cai, Chen Qian, Jinsong Han, Xin Li, Han Ding, and Jizhong Zhao. 2018. Towards Replay-resilient RFID Authentication. In Proc. of ACM Mobicom.
[42]
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A Saurous, Ron J Weiss, Ye Jia, and Ignacio Lopez Moreno. 2018. Voicefilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. arXiv preprint arXiv:1810.04826 (2018).
[43]
Inc Wikimedia Foundation. 2019. "Voice Frequency". https://en.wikipedia.org/wiki/Voice_frequency.
[44]
Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and Countermeasures for Speaker Verification: A Survey. Speech Communication 66 (2015), 130--153.
[45]
Dong Yu and Li Deng. 2016. AUTOMATIC SPEECH RECOGNITION. Springer.
[46]
Chi Zhan, Dongyu She, Sicheng Zhao, Ming-Ming Cheng, and Jufeng Yang. 2019. Zero-Shot Emotion Recognition via Affective Structural Embedding. In Proc. of IEEE/CVF ICCV.
[47]
Chunlei Zhang, Kazuhito Koishida, and John HL Hansen. 2018. Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 9 (2018), 1633--1644.
[48]
Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication. In Proc. of ACM CCS. 57--71.
[49]
Mingmin Zhao, Yonglong Tian, Hang Zhao, Mohammad Abu Alsheikh, Tianhong Li, Rumen Hristov, Zachary Kabelac, Dina Katabi, and Antonio Torralba. 2018. RF-based 3D Skeletons. In Proc. of ACM SIGCOMM.
[50]
Bing Zhou, Mohammed Elbadry, Ruipeng Gao, and Fan Ye. 2017. BatMapper: Acoustic Sensing Based Indoor Floor Plan Construction Using Smartphones. In Proc. of ACM MobiSys. 42--55.
[51]
Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. 2018. EchoPrint: Two-Factor Authentication using Acoustics and Vision on Smartphones. In Proc. of ACM MobiCom. 321--336.
[52]
Man Zhou, Zhan Qin, Xiu Lin, Shengshan Hu, Qian Wang, and Kui Ren. 2019. Hidden Voice Commands: Attacks and Defenses on the vcs of Autonomous Driving Cars. IEEE Wireless Communications 26, 5 (2019), 128--133.

Cited By

View all

Index Terms

  1. Anti-Spoofing Voice Commands: A Generic Wireless Assisted Design

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
      Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 5, Issue 3
      Sept 2021
      1443 pages
      EISSN:2474-9567
      DOI:10.1145/3486621
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 September 2021
      Published in IMWUT Volume 5, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Voice commands
      2. speaker verification
      3. wireless sensing

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)25
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media