skip to main content
research-article

DNN controlled adaptive front-end for replay attack detection systems

Published: 01 October 2023 Publication History

Highlights

·
Conventional methods fall short in detecting replay spoofing attacks effectively.
·
Auditory-based dynamic filters can detect artefacts in high-quality replayed signals.
·
Deep neural networks can adaptively learn filter traits based on the application.
·
Auditory concepts and deep learning combined may generalize replay countermeasures.

Abstract

Developing robust countermeasures to protect automatic speaker verification systems against replay spoofing attacks is a well-recognized challenge. Current approaches to spoofing detection are generally based on a fixed front-end, typically a time-invariant filter bank, followed by a machine learning back-end. In this paper, we propose a novel approach whereby the front-end comprises an adaptive filter bank with a deep neural network-based controller, which is jointly trained along with a neural network back-end. Specifically, the deep neural network-based adaptive filter controller tunes the selectivity and sensitivity of the front-end filter bank at every frame to capture replay-related artefacts. We demonstrate the effectiveness of the proposed framework in spoofing attack detection on a synthesized dataset and ASVSpoof 2019 and ASVSpoof 2021 challenge datasets in terms of equal error rate and its ability to capture artefacts that differentiate replayed signals from genuine ones in comparison to conventional non-adaptive front-ends.

References

[1]
L.D. Alsteris, K.K. Paliwal, Short-time phase spectrum in speech processing: a review and some experimental results, Digit. Signal Process. 17 (3) (2007) 578–616.
[2]
Alzantot, M., Wang, Z., Srivastava, M.B., 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:--1907.00501.
[3]
E. Ambikairajah, L. Kilmartin, An adaptive cochlear model for speech recognition, in: Second European Conference on Speech Communication and Technology, 1991.
[4]
Australian Taxation Office, 2021. Voice Authentication, Australian Taxation Office. Available at: https://www.ato.gov.au/General/Online-services/Voice-authentication/ (Accessed: 26 August 2023).
[5]
D. Baby, A. Van Den Broucke, S. Verhulst, A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications, Nat. Mach. Intell. 3 (2) (2021) 134–143.
[6]
A. Belhomme, Y. Grenier, R. Badeau, E. Humbert, Anechoic phase estimation from reverberant signals, in: 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), IEEE, 2016, pp. 1–5.
[7]
D.J. Darlington, D.R. Campbell, Sub-band adaptive filtering applied to speech enhancement, in: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, 2, IEEE, 1996, pp. 921–924.
[8]
B. De Brabandere, X. Jia, T. Tuytelaars, L.V. Gool, Dynamic filter networks, in: Advances in neural information processing systems, 2016, pp. 667–675.
[9]
M. Delcroix, K. Kinoshita, A. Ogawa, C. Huemmer, T. Nakatani, Context adaptive neural network based acoustic models for rapid adaptation, in: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 2018, pp. 895–908.
[10]
H. Delgado, et al., ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements, in: Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 296–303.
[11]
H. Dinkel, N. Chen, Y. Qian, K. Yu, End-to-end spoofing detection with raw waveform CLDNNS, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),, IEEE, 2017, pp. 4860–4864.
[12]
S.J. Elliott, C.A. Shera, The cochlea as a smart structure, Smart Mater. Struct. 21 (6) (2012).
[13]
N.W. Evans, T. Kinnunen, J. Yamagishi, Spoofing and countermeasures for automatic speaker verification, Interspeech (2013) 925–929.
[14]
R. Font, J.M. Espın, M.J. Cano, Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 Challenge, Proc. Interspeech 2017 (2017) 7–11.
[15]
Y. Gao, J. Lian, B. Raj, R. Singh, Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems, in: IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 544–551.
[16]
A. Gomez-Alanis, A.M. Peinado, Gonzalez, A.M. Gomez, A gated recurrent convolutional neural network for robust spoofing detection, IEEE/ACM Trans. Audio Speech Lang. Process. 27 (12) (2019) 1985–1999.
[17]
T. Gunendradasan, B. Wickramasinghe, N.P. Le, E. Ambikairajah, J. Epps, Detection of replay-spoofing attacks using frequency modulation features, Proc. Interspeech 2018 (2018) 636–640.
[18]
T. Gunendradasan, E. Ambikairajah, J. Epps, H. Li, An adaptive-Q cochlear model for replay spoofing detection, INTERSPEECH, 2019, pp. 2918–2922.
[19]
R.G. Hautamäki, T. Kinnunen, V. Hautamäki, A.-M. Laukkanen, Automatic versus human speaker verification: the case of voice mimicry, Speech Commun. 72 (2015) 13–31.
[20]
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[21]
A. Hussain, D.R. Campbell, Non-linear processing in cochlear spaced sub-bands using artificial neural networks for multi-microphone adaptive speech enhancement, in: 9th European Signal Processing Conference (EUSIPCO 1998), IEEE, 1998, pp. 1–4.
[22]
M.A. Islam, W.A. Jassim, N.S. Cheok, M.S.A. Zilany, A robust speaker identification system using the responses from a model of the auditory periphery, PLoS One 11 (7) (2016).
[23]
D. Kang, D. Dhar, A. Chan, Incorporating side information by adaptive convolution, Adv. Neur. Inform. Process. Syst. (2017) 3867–3877.
[24]
S.-H. Kim, Y.-H. Park, Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition, Proc. Interspeech 2021 (2021) 66–70.
[25]
T. Kinnunen, et al., The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection, Interspeech (2017) 2–6.
[26]
B. Klein, L. Wolf, Y. Afek, A dynamic convolutional layer for short range weather prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4840–4848.
[27]
J. Kua, J. Epps, E. Ambikairajah, M. Nosratighods, Front-end diversity in fused speaker recognition systems, in: The Proceedings of APSIPA ASC, 2010, pp. 14–17.
[28]
Y.W. Lau, M. Wagner, D. Tran, Vulnerability of speaker verification to voice mimicking, in: Intelligent Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on, IEEE, 2004, pp. 145–148.
[29]
G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, V. Shchemelinin, Audio replay attack detection with deep learning frameworks, Interspeech 2017 (2017) 82–86.
[30]
X. Li, X. Wu, H. Lu, X. Liu, H. Meng, Channel-wise gated res2net: towards robust detection of synthetic speech attacks, Proc. Interspeech (2021) 4314–4318.
[31]
J. Lindberg, M. Blomberg, Vulnerability in speaker verification-a study of technical impostor techniques, in: Sixth European Conference on Speech Communication and Technology, 1999.
[32]
L. Liu, J. He, G. Palm, Effects of phase on the perception of intervocalic stop consonants, Speech Commun. 22 (4) (1997) 403–417.
[33]
R.F. Lyon, Automatic gain control in cochlear mechanics, The Mechanics and Biophysics of Hearing, Springer, 1990, pp. 395–402.
[34]
P. Maragos, T.F. Quatieri, J.F. Kaiser, Speech nonlinearities, modulations, and energy operators, Proc. ICASSP 91 (1991) 421–424.
[35]
K. Nie, G. Stickney, F.-G. Zeng, Encoding frequency modulation to improve cochlear implant performance in noise, IEEE Trans. Biomed. Eng. 52 (1) (2004) 64–73.
[36]
Biometric Authentication | Strong Customer Authentication | Nuance Australia. Nuance Communications, https://www.nuance.com/en-au/omni-channel-customer-engagement/authentication-and-fraud-prevention/biometric-authentication.html. (Accessed 26 Aug. 2023).
[37]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, Pytorch: an imperative style, high-performance deep learning library, Adv. Neural Inf. Process. Syst. 32 (2019) 8024–8035.
[38]
H.A. Patil, M.R. Kamble, T.B. Patel, M. Soni, Novel variable length teager energy separation based instantaneous frequency features for replay detection, Proc. Interspeech 2017 (2017) 12–16.
[39]
Ravanelli, M., Bengio, Y., 2018a. Interpretable convolutional filters with sincnet. In: Interpretability and Robustness in Audio, Speech and Language, NIPS 2018 Workshop.
[40]
Ravanelli, M., Bengio, Y., 2018b. Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop. SLT, IEEE, pp. 1021–1028.
[41]
G. Riegler, S. Schulter, M. Ruther, H. Bischof, Conditioned regression models for non-blind single image super-resolution, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 522–530.
[42]
H. Sailor, M. Kamble, H. Patil, Auditory filterbank learning for temporal modulation features in replay spoof speech detection, Proc. Interspeech 2018 (2018) 666–670.
[43]
H. Seki, K. Yamamoto, T. Akiba, S. Nakagawa, Discriminative learning of filterbank layer within deep neural network based speech recognition for speaker adaptation, IEICE Trans. Inf. Syst. 102 (2) (2019) 364–374.
[44]
K. Sriskandaraja, V. Sethu, E. Ambikairajah, Deep siamese architecture based replay detection for secure voice biometric, Proc. Interspeech 2018 (2018) 671–675.
[45]
H. Tak, J.-W. Jung, J. Patino, M. Todisco, N. Evans, Graph Attention Networks for Anti-Spoofing, Proc. Interspeech (2021) 2356–2360.
[46]
M. Todisco, H. Delgado, N. Evans, A new feature for automatic speaker verification anti-spoofing: Constant Q cepstral coefficients, in: Speaker Odyssey Workshop, Bilbao, Spain, 25, 2016, pp. 249–252.
[47]
M. Todisco, H. Delgado, N. Evans, Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification, Comput. Speech Lang. 45 (2017) 516–535.
[48]
F. Tom, M. Jain, P. Dey, End-to-end audio replay attack detection using deep convolutional networks with attention, Proc. Interspeech 2018 (2018) 681–685.
[49]
L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, J. Mach. Learn Res. 9 (11) (2008).
[50]
Vincent, E., 2016. Roomsimove. Available at: http://homepages.loria.fr/evincent/software/Roomsimove_1.4.zip (Accessed 05/08/2020).
[51]
G. Wan, J. Pan, Q. Wang, J. Gao, Z. Ye, Speaker adaptive training for speech recognition based on attention-over-attention mechanism, INTERSPEECH (2020) 1251–1255.
[52]
X. Wang, et al., ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech, Comput. Speech Lang. (2020).
[53]
B. Wickramasinghe, E. Ambikairajah, J. Epps, Biologically inspired adaptive-Q filterbanks for replay spoofing attack detection, in: Proc. Interspeech 2019, 2019, pp. 2953–2957.
[54]
B. Wickramasinghe, E. Ambikairajah, J. Epps, V. Sethu, H. Li, Auditory inspired spatial differentiation for replay spoofing attack detection, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 6011–6015.
[55]
M. Witkowski, S. Kacprzak, P. Zelasko, K. Kowalczyk, J. Gałka, Audio replay attack detection using high-frequency features, Proc. Interspeech (2017) 27–31. 2017.
[56]
Z. Wu, H. Li, Voice conversion and spoofing attack on speaker verification systems, in: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific, IEEE, 2013, pp. 1–9.
[57]
Z. Wu, S. Gao, E.S. Cling, H. Li, A study on replay attack and anti-spoofing for text-dependent speaker verification, in: Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA), IEEE, 2014, pp. 1–5.
[58]
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li, Spoofing and countermeasures for speaker verification: a survey, Speech Commun. 66 (2015) 130–153.
[59]
J. Yamagishi, et al., ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection, Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 47–54.
[60]
H. Zeinali, et al., Detecting spoofing attacks using vgg and sincnet: but-omilia submission to asvspoof 2019 challenge, INTERSPEECH (2019).
[61]
Q. Zhang, Q. Song, A. Nicolson, T. Lan, H. Li, Temporal convolutional network with frequency dimension adaptive attention for speech enhancement, Proc. Interspeech 2021 (2021) 166–170.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Speech Communication
Speech Communication  Volume 154, Issue C
Oct 2023
97 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 October 2023

Author Tags

  1. Adaptive feature extraction
  2. Anti-spoofing
  3. Auditory system modeling
  4. Deep learning
  5. Automatic speaker verification

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media