Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
Abstract
:1. Introduction
2. Methods
2.1. Dual-Input Temporal-Frequency Attention
2.2. Feature Space Attention
2.3. BGRU
3. Experiment
3.1. Environmental Sound Datasets
3.2. Evaluation Metrics
3.3. Experimental Configurations
4. Discussion
4.1. Comparison of Different Features
4.2. Comparison of Different Fusion Strategies
4.3. Comparison of Different Classifiers
4.4. Comparison of Different Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Baumann, J.; Meyer, P.; Lohrenz, T.; Roy, A.; Papendieck, M.; Fingscheidt, T. A New DCASE 2017 Rare Sound Event Detection Benchmark under Equal Training Data: CRNN with Multi-Width Kernels. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–9 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 865–869. [Google Scholar]
- De Benito-Gorrón, D.; Ramos, D.; Toledano, D.T. A multi-resolution CRNN-based approach for semi-supervised sound event detection in DCASE 2020 challenge. IEEE Access 2021, 9, 89029–89042. [Google Scholar] [CrossRef]
- Luo, L.; Zhang, L.; Wang, M.; Liu, Z.; Liu, X.; He, R.; Jin, Y. A System for the Detection of Polyphonic Sound on a University Campus Based on CapsNet-RNN. IEEE Access 2021, 9, 147900–147913. [Google Scholar] [CrossRef]
- Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, M. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Trans. Intell. Transp. Syst. 2015, 17, 279–288. [Google Scholar] [CrossRef]
- Mnasri, Z.; Rovetta, S.; Masulli, F. Anomalous sound event detection: A survey of machine learning based methods and applications. Multimed. Tools Appl. 2022, 81, 5537–5586. [Google Scholar] [CrossRef]
- Aljshamee, M.; Mousa, A.H.; Omran, A.A.; Ahmed, S. Sound Signal Control on Home Appliances Using Android Smart-Phone; AIP Publishing LLC: Melville, NY, USA, 2020; Volume 2290, p. 040023. [Google Scholar]
- Serizel, R.; Turpault, N.; Shah, A.; Salamon, J. Sound Event Detection in Synthetic Domestic Environments. In Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 86–90. [Google Scholar]
- Chaudhary, M.; Prakash, V.; Kumari, N. Identification vehicle movement detection in forest area using MFCC and KNN. In Proceedings of the 2018 International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 23–24 November 2018; pp. 158–164. [Google Scholar]
- Florentin, J.; Dutoit, T.; Verlinden, O. Identification of European woodpecker species in audio recordings from their drumming rolls. Ecol. Inform. 2016, 35, 61–70. [Google Scholar] [CrossRef]
- Guo, G.; Li, S.Z. Content-based audio classification and retrieval by support vector machines. IEEE Trans. Neural Netw. 2003, 14, 209–215. [Google Scholar]
- Heittola, T.; Mesaros, A.; Eronen, A.; Virtanen, T. Audio context recognition using audio event histograms. In Proceedings of the 2010 18th European Signal Processing Conference, Aalborg, Denmark, 23–27 August 2010; pp. 1272–1276. [Google Scholar]
- Degara, N.; Davies, M.E.; Pena, A.; Plumbley, M.D. Onset event decoding exploiting the rhythmic structure of polyphonic music. IEEE J. Sel. Top. Signal Process. 2011, 5, 1228–1239. [Google Scholar] [CrossRef]
- Sidiropoulos, P.; Mezaris, V.; Kompatsiaris, I.; Meinedo, H.; Bugalho, M.; Trancoso, I. On the use of audio events for improving video scene segmentation. In Analysis, Retrieval and Delivery of Multimedia Content; Springer: Berlin/Heidelberg, Germany, 2013; pp. 3–19. [Google Scholar]
- Liu, Y.; Tang, J.; Song, Y.; Dai, L. A capsule based approach for polyphonic sound event detection. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1853–1857. [Google Scholar]
- Parascandolo, G.; Huttunen, H.; Virtanen, T. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6440–6444. [Google Scholar]
- Jeong, I.-Y.; Lee, S.; Han, Y.; Lee, K. Audio Event Detection Using Multiple-Input Convolutional Neural Network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 16 November 2017; pp. 51–54. [Google Scholar]
- Adavanne, S.; Virtanen, T. A report on sound event detection with different binaural features. arXiv 2017, arXiv:1710.02997. [Google Scholar]
- Dinkel, H.; Yu, K. Duration robust weakly supervised sound event detection. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 311–315. [Google Scholar]
- Imoto, K.; Mishima, S.; Arai, Y.; Kondo, R. Impact of sound duration and inactive frames on sound event detection performance. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–7 June 2021; pp. 860–864. [Google Scholar]
- Lim, H.; Park, J.-S.; Han, Y. Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017, Munich, Germany, 16 November 2017; pp. 80–84. [Google Scholar]
- Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Zhang, X.; Zou, Y.; Shi, W. Dilated convolution neural network with LeakyReLU for environmental sound classification. In Proceedings of the 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK, 23–25 August 2017; pp. 1–5. [Google Scholar]
- Phan, H.; Hertel, L.; Maass, M.; Mertins, A. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv 2016, arXiv:1604.06338. [Google Scholar]
- Cakır, E.; Virtanen, T. Convolutional recurrent neural networks for rare sound event detection. Deep Neural Netw. Sound Event Detect. 2019, 12, 141–145. [Google Scholar]
- Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 46–50. [Google Scholar]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. arXiv 2017, arXiv:1704.06904. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the NAACL-HLT 2016, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
- Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- Chiu, C.-C.; Sainath, T.N.; Wu, Y.; Prabhavalkar, R.; Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R.J.; Rao, K.; Gonina, E. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4774–4778. [Google Scholar]
- Zhang, Z.; Xu, S.; Zhang, S.; Qiao, T.; Cao, S. Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 2021, 453, 896–903. [Google Scholar] [CrossRef]
- Mu, W.; Yin, B.; Huang, X.; Xu, J.; Du, Z. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci. Rep. 2021, 11, 21552. [Google Scholar] [CrossRef]
- Li, D.; Xu, J.; Wang, J.; Fang, X.; Ji, Y. A multi-scale fusion convolutional neural network based on attention mechanism for the visualization analysis of EEG signals decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 2615–2626. [Google Scholar] [CrossRef]
- Tang, X.; Meng, F.; Zhang, X.; Cheung, Y.-M.; Ma, J.; Liu, F.; Jiao, L. Hyperspectral image classification based on 3-D octave convolution with spatial–spectral attention network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2430–2447. [Google Scholar] [CrossRef]
- Xia, X.; Pan, J.; Wang, Y. Audio Sound Determination Using Feature Space Attention Based Convolution Recurrent Neural Network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3382–3386. [Google Scholar]
- Tang, P.; Du, P.; Xia, J.; Zhang, P.; Zhang, W. Channel attention-based temporal convolutional network for satellite image time series classification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Shen, Y.-H.; He, K.-X.; Zhang, W.-Q. Learning how to listen: A temporal-frequential attention model for sound event detection. arXiv 2018, arXiv:1810.11939. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Li, X.; Chebiyyam, V.; Kirchhoff, K. Multi-stream network with temporal attention for environmental sound classification. arXiv 2019, arXiv:1901.08608. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 2016, 6, 162. [Google Scholar] [CrossRef]
- Poliner, G.E.; Ellis, D.P. A discriminative model for polyphonic piano transcription. EURASIP J. Adv. Signal Process. 2006, 2007, 1–9. [Google Scholar] [CrossRef]
- Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1128–1132. [Google Scholar]
- Cakır, E.; Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar] [CrossRef]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models; Atlanta, GA, USA, 2013; Volume 30, p. 3. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.1422&rep=rep1&type=pdf (accessed on 2 September 2022).
- Jin, W.; Liu, J.; Feng, M.; Ren, J. Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation; IEEE: Piscataway, NJ, USA, 2022; pp. 146–150. [Google Scholar]
- Ding, W.; He, L. Adaptive multi-scale detection of acoustic events. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 28, 294–306. [Google Scholar] [CrossRef]
- Meng, J.; Wang, X.; Wang, J.; Teng, X.; Xu, Y. A capsule network with pixel-based attention and BGRU for sound event detection. Digit. Signal Process. 2022, 123, 103434. [Google Scholar] [CrossRef]
- Wang, M.; Yao, Y.; Qiu, H.; Song, X. Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection. Symmetry 2022, 14, 366. [Google Scholar] [CrossRef]
DCASE 2016 Task3 Dataset | DCASE 2017 Task3 Dataset | ||||
---|---|---|---|---|---|
Residential Area | Home | Street | |||
Event Label | Instances | Event Label | Instances | Event Label | Instances |
(object) banging | 23 | (object) rustling | 60 | brakes squeaking | 24 |
bird singing | 271 | (object) snapping | 57 | car | 110 |
car passing by | 108 | cupboard | 40 | children | 19 |
children shouting | 31 | cutlery | 76 | large vehicle | 24 |
people speaking | 52 | dishes | 151 | people speaking | 47 |
people walking | 44 | drawer | 51 | people walking | 48 |
wind blowing | 30 | glass jingling | 36 | ||
object impact | 250 | ||||
people walking | 54 | ||||
washing dishes | 84 | ||||
water tap running | 47 |
Layer Type | Configurations |
---|---|
Output | output shape = (256,6) |
Recurrent | hidden unit number = 32 |
Recurrent | hidden unit number = 32 |
Merge | mode = ’mul’ |
Repeat and Reshape | output shape = (256,128,2) |
Softmax activation | - |
Feedforward | hidden unit number = 256 |
Reshape | output shape = 256,256 |
Permute | output shape = 256,128, 2 |
Max pooling | sub-sampling rate = 2 |
ReLU activation | - |
Convolution | filter number, kernel size = 128, (3,3) |
Max pooling | sub-sampling rate = 2 |
ReLU activation | - |
Convolution | filter number, kernel size = 128, (3,3) |
Max pooling | sub-sampling rate = 5 |
ReLU activation | - |
Convolution | filter number, kernel size = 128, (3,3) |
Merge | mode = ’TF-Attention’ |
Multiply on the T/F direction | mode = ’T-Attention’ and ‘F-Attention’ |
Softmax activation | - |
Convolution | filter number, kernel size = 1, (1,1) |
ReLU activation | - |
Convolution | filter number, kernel size = 32, F(1,3) × 254/T(2,1) × 39 |
Input | Input shape = (256,40) |
DCASE 2016 Task3 | DCASE 2017 Task3 | ||||
---|---|---|---|---|---|
Feature | F1 | ER | Feature | F1 | ER |
Log-Mel | 58.1% | 0.42 | Log-Mel | 63.2% | 0.51 |
MFCCS | 57.8% | 0.45 | MFCCS | 58.5% | 0.55 |
STFT | 50.9% | 0.52 | STFT | 55.1% | 0.60 |
FMS | 58.5% | 0.43 | FMS | 63.7% | 0.52 |
FLS | 59.0% | 0.46 | FLS | 63.2% | 0.57 |
FLM | 60.2% | 0.40 | FLM | 66.9% | 0.42 |
DCASE 2016 Task3 | DCASE 2017 Task3 | ||||
---|---|---|---|---|---|
Strategy | F1 | ER | Strategy | F1 | ER |
LM-T&F-Average | 58.6% | 0.42 | LM-T&F-Average | 61.4% | 0.46 |
LM-T&F-Weight | 60.2% | 0.40 | LM-T&F-Weight | 66.9% | 0.42 |
LM-T&F-Channel | 55.3% | 0.47 | LM-T&F-Channel | 58.9% | 0.55 |
L-T&F-Average | 52.6% | 0.49 | L-T&F-Average | 57.4% | 0.45 |
L-T&F-Weight | 55.1% | 0.41 | L-T&F-Weight | 59.1% | 0.46 |
L-T&F-Channel | 50.3% | 0.48 | L-T&F-Channel | 56.9% | 0.50 |
DCASE 2016 Task3 | DCASE 2017 Task3 | ||||
---|---|---|---|---|---|
Classifiers | F1 | ER | Classifiers | F1 | ER |
DNN | 36.0% | 0.76 | DNN | 41.3% | 0.90 |
CNN | 34.8% | 0.52 | CNN | 43.4% | 0.77 |
RNN | 32.7% | 0.65 | RNN | 38.6% | 0.81 |
CRNN | 38.3% | 0.56 | CRNN | 43.7% | 0.75 |
FS-CRNN | 41.6% | 0.49 | FS-CRNN | 48.2% | 0.57 |
TF-CRNN | 55.1% | 0.41 | TF-CRNN | 59.1% | 0.46 |
TFFS-CRNN | 60.2% | 0.40 | TFFS-CRNN | 66.9% | 0.42 |
DCASE 2016 Task3 | DCASE 2017 Task3 | ||||
---|---|---|---|---|---|
Methods | F1 | ER | Methods | F1 | ER |
MFCCs+GMM [42] * | 34.3% | 0.88 | Log-Mel+RNN [15] | 39.6% | 0.83 |
Binaural Mel energy +RNN [45] ** | 47.8% | 0.81 | Log-Mel+CNN [16] | 40.8% | 0.81 |
MFCCs+CNN [44] | 59.8% | 0.56 | Log-Mel+CRNN [17] ** | 41.7% | 0.79 |
Log-Mel+CaspNet [45] | - | 0.62 | Binaural Mel energy+ CRNN(BGRU) [47] | 57.6% | 0.62 |
Log-Mel+CRNN [46] | 48.7% | 0.78 | Log-Mel+CRNN [48] | 49.6% | 0.68 |
FLM+TFFS-CRNN | 60.2% | 0.40 | FLM+TFFS-CRNN | 66.9% | 0.42 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, Y.; Wang, M.; Luo, L.; Zhao, D.; Liu, Z. Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention. Sensors 2022, 22, 6818. https://doi.org/10.3390/s22186818
Jin Y, Wang M, Luo L, Zhao D, Liu Z. Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention. Sensors. 2022; 22(18):6818. https://doi.org/10.3390/s22186818
Chicago/Turabian StyleJin, Ye, Mei Wang, Liyan Luo, Dinghao Zhao, and Zhanqi Liu. 2022. "Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention" Sensors 22, no. 18: 6818. https://doi.org/10.3390/s22186818
APA StyleJin, Y., Wang, M., Luo, L., Zhao, D., & Liu, Z. (2022). Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention. Sensors, 22(18), 6818. https://doi.org/10.3390/s22186818