Optimizer | CNN | CRNN | CNN-RNN |
Adam | 0.45 | 0.32 | 0.63 |
Adadelta | 0.30 | 0.31 | 0.35 |
RMSprop | 0.41 | 0.54 | 0.60 |
In this paper, we evaluate two deep learning models which integrate convolutional and recurrent neural networks. We implement both sequential and parallel architectures for fine-grain musical subgenre classification. Due to the exceptionally low signal to noise ratio (SNR) of our low level mel-spectrogram dataset, more sensitive yet robust learning models are required to generate meaningful results. We investigate the effects of three commonly applied optimizers, dropout, batch regularization, and sensitivity to varying initialization distributions. The results demonstrate that the sequential model specifically requires the RMSprop optimizer, while the parallel model implemented with the Adam optimizer yielded encouraging and stable results achieving an average F1 score of $ 0.63 $. When all factors are considered, the optimized hybrid parallel model outperformed the sequential in classification accuracy and system stability.
Citation: |
Table 1. F1 scores for optimizer evaluation
Optimizer | CNN | CRNN | CNN-RNN |
Adam | 0.45 | 0.32 | 0.63 |
Adadelta | 0.30 | 0.31 | 0.35 |
RMSprop | 0.41 | 0.54 | 0.60 |
Table 2. Optimal classification accuracy
CNN | CRNN | CNN-RNN | |
Optimizer | Adam | RMSprop | Adam |
Accuracy | 0.31 | 0.57 | 0.64 |
Table 3. Marco F1 scores for effect of regularization
Model | Data | Dropout | Batch Normalization | Dropout + Batch Normalization |
CRNN | Train | 0.67 | 1.00 | 0.98 |
Validation | 0.65 | 0.58 | 0.60 | |
Test | 0.62 | 0.57 | 0.41 | |
CNN-RNN | Train | 0.65 | 1.00 | 0.90 |
Validation | 0.65 | 0.58 | 0.60 | |
Test | 0.63 | 0.61 | 0.63 |
Table 4. Average F1 accuracy scores for effects of initialization methods
Initialization | CNN | CRNN | CNN-RNN |
Glorot Normal | 0.31 | 0.63 | 0.63 |
Glorot Uniform | 0.34 | 0.60 | 0.59 |
Random Normal | 0.33 | 0.45 | 0.53 |
Random Uniform | 0.33 | 0.37 | 0.57 |
[1] | M. Browne and S. S. Ghidary, Convolutional neural networks for image processing: An application in robot vision, in AI 2003: Advances in Artificial Intelligence, Lecture Notes in Comput. Sci., 2903, Lecture Notes in Artificial Intelligence, Springer, Berlin, 2003,641–652. doi: 10.1007/978-3-540-24581-0_55. |
[2] | K. Choi, G. Fazekas, M. Sandler and K. Cho, Convolutional recurrent neural networks for music classification, IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, 2017. doi: 10.1109/ICASSP.2017.7952585. |
[3] | J. Chung, C. Gulcehre, K. Cho and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555v1. |
[4] | Y. M. G. Costa, L. S. Oliveira and C. N. Silla Jr., An evaluation of Convolutional Neural Networks for music classification using spectrograms, Applied Soft Computing, 52 (2017), 28-38. doi: 10.1016/j.asoc.2016.12.024. |
[5] | G. Gessle and S. Åkesson, A Comparative Analysis of CNN and LSTM for Music Genre Classification, Degree Project in Technology, Stockholm, Sweden, 2019. Available from: https://www.diva-portal.org/smash/get/diva2:1354738/FULLTEXT01.pdf. |
[6] | X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9 (2010), 249–256. Available from: http://proceedings.mlr.press/v9/glorot10a.html. |
[7] | M. Helén and T. Virtanen, Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine, $13^{th}$ European Signal Processing Conference, Antalya, Turkey, (2005), 1–4. Available from: https://ieeexplore.ieee.org/document/7078147. |
[8] | S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9 (1997), 1735-1780. doi: 10.1162/neco.1997.9.8.1735. |
[9] | D. P. Kingma and J. Ba, ADAM: A method for stochastic optimization, preprint, arXiv: 1412.6980. |
[10] | A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, $26^{th}$ Conference on Neural Information Processing Systems, NeurIPS, 2012. |
[11] | Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation, 1 (1989), 541-551. |
[12] | H. Lee, Y. Largman, P. Pham and A. Y. Ng, Unsupervised feature learning for audio classification using convolutional deep belief networks, $23^{rd}$ Conference on Neural Information Processing Systems, NeurIPS, 2009. Available from: https://ai.stanford.edu/ ang/papers/nips09-AudioConvolutionalDBN.pdf. |
[13] | B. Logan, Mel frequency cepstral coefficients for music modeling, International Symposium on Music Information Retrieval, 2000. |
[14] | Z. Nasrullah and Y. Zhao, Music artist classification with convolutional recurrent neural networks, preprint, arXiv: 1901.04555v2. |
[15] | Y. Panagakis, C. Kotropoulos and G. R. Arce, Music genre classification via sparse representations of auditory temporal modulations, $17^{th}$ European Signal Processing Conference, Glasgow, UK, 2009. |
[16] | Python, Package Index: Spotify and Spot-dl., Available from: https://pypi.org/project/spotify/ and https://pypi.org/project/spotdl/. |
[17] | A. J. R. Simpson, G. Roma and M. D. Plumbley, Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network, preprint, arXiv: 1504.04658. |
[18] | G. Tzanetakis and P. Cook, Musical genre classification of audio signals, IEEE Transactions on Speech and Audio Processing, 10 (2002), 293-302. doi: 10.1109/TSA.2002.800560. |
[19] | J. Wülfing and M. A. Riedmiller, Unsupervised learning of local features for music classification, International Society for Music Information Retrieval Conference (ISMIR), 2012. Available from: http://ml.informatik.uni-freiburg.de/former/_media/publications/wuelf2012.pdf. |
[20] | C. Xu, N. C. Maddage, X. Shao, F. Cao and Q. Tian, Musical genre classification using support vector machines, IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP '03), Hong Kong, 2003. doi: 10.1109/ICASSP.2003.1199998. |
[21] | R. Yang, L. Feng, H. Wang, J. Yao and S. Luo, Parallel recurrent convolutional neural networks based music genre classification method for mobile devices, IEEE Access, 8 (2020), 19629-19637. doi: 10.1109/ACCESS.2020.2968170. |
[22] | M. D. Zeiler, ADADELTA: An adaptive learning rate method, preprint, arXiv: 1212.5701. |
Baseline CNN model
CRNN sequential architecture
Parallel CNN-RNN architecture
Visualization of one song from our dataset
RMSprop learning process on two axes
Classification accuracy across 50 epochs