skip to main content
research-article
Free access
Just Accepted

TimbreSense: Timbre Abnormality Detection for Bel Canto with Smart Devices

Online AM: 17 December 2024 Publication History

Abstract

With the rise of mobile devices, bel canto practitioners increasingly utilize smart devices as auxiliary tools for improving their singing skills. However, they frequently encounter timbre abnormalities during practice, which, if left unaddressed, can potentially harm their vocal organs. Existing singing assessment systems primarily focus on pitch and melody and lack real-time detection of bel canto timbre abnormalities. Moreover, the diverse vocal habits and timbre compositions among individuals present significant challenges in cross-user recognition of such abnormalities. To address these limitations, we propose TimbreSense, a novel bel canto timbre abnormality detection system. TimbreSense enables real-time detection of the five major timbre abnormalities commonly observed in bel canto singing. We introduce an effective feature extraction pipeline that captures the acoustic characteristics of bel canto singing. By applying temporal average pooling to the Short-Time Fourier Transform (STFT) spectrogram, we reduce redundancy while preserving essential frequency-domain information. Our system leverages a transformer model with self-attention mechanisms to extract correlation and semantic features of overtones in the frequency domain. Additionally, we employ a few-shot learning approach involving pre-training, meta-learning, and fine-tuning to enhance the system’s cross-domain recognition performance while minimizing user usage costs. Experimental results demonstrate the system’s strong cross-user domain recognition performance and real-time capabilities.

References

[1]
Gerrit Bloothooft, Eldrid Bringmann, Marieke Van Cappellen, Jolanda B Van Luipen, and Koen P Thomassen. 1992. Acoustics and perception of overtone singing. The Journal of the Acoustical Society of America 92, 4 (1992), 1827–1836.
[2]
Niklas Blum, Serge Lachapelle, and Harald Alvestrand. 2021. WebRTC: Real-time communication for the open web platform. Commun. ACM 64, 8 (2021), 50–54.
[3]
Pasquale Bottalico, Mark T Marunick, Charles J Nudelman, Jossemia Webster, and Maria Cristina Jackson-Menaldi. 2021. Singing voice quality: The effects of maxillary dental arch and singing style. Journal of Voice 35, 3 (2021), 501.e11–501.e18.
[4]
Manuel Brandner, Paul Armin Bereuter, Sudarsana Reddy Kadiri, and Alois Sontacchi. 2023. Classification of Phonation Modes in Classical Singing Using Modulation Power Spectral Features. IEEE Access 11(2023), 29149–29161.
[5]
Qiao Chen, Wenfeng Zhao, Qin Wang, and Yawen Zhao. 2022. The sustainable development of intangible cultural heritage with AI: Cantonese opera singing genre classification based on CoGCNet model in China. Sustainability 14, 5 (2022), 2923.
[6]
Matthew Derek Cyphert. 2022. The Most Common Vocal Fault in the Baritone Voice. Ph. D. Dissertation. West Virginia University.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929(2020).
[8]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
[9]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096–2030.
[10]
Chitralekha Gupta, Jinhu Li, and Haizhou Li. 2021. Towards reference-independent rhythm assessment of solo singing. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 912–919.
[11]
Miriam Havel, Tanja Kornes, Eddie Weitzberg, Jon O Lundberg, and Johan Sundberg. 2016. Eliminating paranasal sinus resonance and its effects on acoustic properties of the nasal tract. Logopedics Phoniatrics Vocology 41, 1 (2016), 33–40.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[14]
Shell Xu Hu, Da Li, Jan Stühmer, Minyoung Kim, and Timothy M Hospedales. 2022. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9068–9077.
[15]
Sudarsana Reddy Kadiri, Paavo Alku, and Bayya Yegnanarayana. 2020. Analysis and classification of phonation types in speech and singing voice. Speech Communication 118(2020), 33–47.
[16]
Elizabeth Johnson Knight and Stephen F Austin. 2020. The effect of head flexion/extension on acoustic measures of singing voice quality. Journal of Voice 34, 6 (2020), 964.e11–964.e21.
[17]
Seyed Kooshan, Hashemi Fard, and Rahil Mahdian Toroghi. 2019. Singer identification by vocal parts detection and singer classification using lstm neural networks. In Proceedings of IEEE international conference on pattern recognition and image analysis (IPRIA). IEEE, 246–250.
[18]
Pauline Larrouy-Maestri, David Magis, and Dominique Morsomme. 2014. Effects of melody and technique on acoustical and musical features of western operatic singing voices. Journal of Voice 28, 3 (2014), 332–340.
[19]
Björn EF Lindblom and Johan EF Sundberg. 1971. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America 50, 4B (1971), 1166–1179.
[20]
Mario Madruga, Yolanda Campos-Roca, and Carlos J Pérez. 2021. Impact of noise on the performance of automatic systems for vocal fold lesions detection. biocybernetics and biomedical engineering 41, 3 (2021), 1039–1056.
[21]
Alexander Mainka, Anton Poznyakovskiy, Ivan Platzek, Mario Fleischer, Johan Sundberg, and Dirk Mürbe. 2015. Lower vocal tract morphologic adjustments are relevant for voice timbre in singing. PLoS One 10, 7 (2015), e0132241.
[22]
James C McKinney. 2005. The diagnosis and correction of vocal faults: A manual for teachers of singing and for choir directors. Waveland Press.
[23]
Daryush D Mehta, Jarrad H Van Stan, Matías Zañartu, Marzyeh Ghassemi, John V Guttag, Víctor M Espinoza, Juan P Cortés, Harold A Cheyne, and Robert E Hillman. 2015. Using ambulatory voice monitoring to investigate common voice disorders: Research update. Frontiers in bioengineering and biotechnology 3 (2015), 155.
[24]
Sarah Khatcherian Milo. 2014. The Voice Teacher’s Guide to Vocal Health for Voice Students: Preventing, Detecting, and Addressing Symptoms. Ph. D. Dissertation. The Ohio State University.
[25]
Koichi Omori, Ashutosh Kacker, Linda M Carroll, William D Riley, and Stanley M Blaugrund. 1996. Singing power ratio: quantitative evaluation of singing voice quality. Journal of voice 10, 3 (1996), 228–235.
[26]
Andrew J Ortiz, Laura E Toles, Katherine L Marks, Silvia Capobianco, Daryush D Mehta, Robert E Hillman, and Jarrad H Van Stan. 2019. Automatic speech and singing classification in ambulatory recordings for normal and disordered voices. The Journal of the Acoustical Society of America 146, 1 (2019), EL22–EL27.
[27]
Polina Proutskova, Christophe Rhodes, Geraint Wiggins, and Tim Crawford. 2012. Breathy or resonant-A controlled and curated dataset for phonation mode detection in singing. In 13th International Society for Music Information Retrieval Conference. 589–594.
[28]
Jean-Luc Rouas and Leonidas Ioannidis. 2016. Automatic classification of phonation modes in singing voice: towards singing style characterisation and application to ethnomusicological recordings. In interspeech, Vol.  2016. 150–154.
[29]
Gláucia Laís Salomão and Johan Sundberg. 2010. Perceptual relevance of voice source characteristics in male singers’ modal and falsetto registers. In 5th International Conference on the Physiology & Acoustics of Singing, August 10-13, 2010, Stockholm. 29–29.
[30]
Marcel Simon, Erik Rodner, and Joachim Denzler. 2016. Imagenet pre-trained models with batch normalization. arXiv preprint arXiv:1612.01452(2016).
[31]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems 30 (2017), 1–11.
[32]
Xiaoheng Sun, Yuejie Gao, Hanyao Lin, and Huaping Liu. 2023. Tg-Critic: A Timbre-Guided Model For Reference-Independent Singing Evaluation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[33]
Johan Sundberg, Gláucia Laís Salomão, and Klaus R Scherer. 2024. Emotional expressivity in singing. Assessing physiological and acoustic indicators of two opera singers’ voice characteristics. The Journal of the Acoustical Society of America 155, 1 (2024), 18–28.
[34]
Terry Tan. 2019. Singing evaluation based on deep metric learning. In Proceedings of the 3rd International Symposium on Computer Science and Intelligent Control. ACM, 1–5.
[35]
Zhangcheng Tang. 2022. Music sense analysis of bel canto audio and bel canto teaching based on LSTM mixed model. Mobile Information Systems 2022 (2022).
[36]
Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics, Vol.  19. AIP Publishing.
[37]
Cheruvathur Uthup. 2016. The acoustical foundations of bel canto. Ph. D. Dissertation. Indiana University.
[38]
Cora-Mari Van Vuuren et al. 2017. Exploring the diagnosis and correction of vocal faults encountered during the training of the classical singing voice. Ph. D. Dissertation. University of Pretoria.
[39]
Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. 2018. VocalSet: A Singing Voice Dataset. In ISMIR. 468–474.
[40]
D Wong, C Hsiao, and J Markel. 1980. Spectral mismatch due to preemphasis in LPC analysis/synthesis. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 2(1980), 263–264.
[41]
Fusang Zhang, Jie Xiong, Zhaoxin Chang, Junqi Ma, and Daqing Zhang. 2022. Mobi2Sense: empowering wireless sensing with mobility. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. ACM, 268–281.
[42]
Fusang Zhang, Daqing Zhang, Jie Xiong, Hao Wang, Kai Niu, Beihong Jin, and Yuxiang Wang. 2018. From Fresnel Diffraction Model to Fine-grained Human Respiration Sensing with Commodity Wi-Fi Devices. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 1, Article 53 (2018), 23 pages.
[43]
Huan Zhang, Yiliang Jiang, Tao Jiang, and Hu Peng. 2021. Learn by Referencing: Towards Deep Metric Learning for Singing Assessment. In ISMIR. 825–832.
[44]
Xiaobin Zhuang, Huiran Yu, Weifeng Zhao, Tao Jiang, and Peng Hu. 2021. KaraTuner: Towards end to end natural pitch correction for singing voice in karaoke. arXiv preprint arXiv:2110.09121(2021).
[45]
Yongpan Zou, Haibo Lei, and Kaishun Wu. 2021. Beyond legitimacy, also with identity: Your smart earphones know who you are quietly. IEEE Transactions on Mobile Computing 22, 6 (2021), 3179–3192.
[46]
Yongpan Zou, Jianhao Weng, Haibo Lei, Danyang Wang, Victor CM Leung, and Kaishun Wu. 2024. EarPrint: Earphone-Based Implicit User Authentication With Behavioural and Physiological Acoustics. IEEE Internet of Things Journal 11, 19 (2024), 31128–31143.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Sensor Networks
ACM Transactions on Sensor Networks Just Accepted
EISSN:1550-4867
Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Online AM: 17 December 2024
Accepted: 06 December 2024
Revised: 28 September 2024
Received: 02 May 2024

Check for updates

Author Tags

  1. Bel canto
  2. Timbre abnormality
  3. Few-shot learning
  4. Meta learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 51
    Total Downloads
  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)51
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media