A light-weight multimodal framework for improved environmental audio tagging
2018 IEEE International Conference on Acoustics, Speech and Signal …, 2018•ieeexplore.ieee.org
The lack of strong labels has severely limited the state-of-the-art fully supervised audio
tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models
based on unlabeled videos have been successfully applied to audio tagging, but they are
inevitably resource hungry and require a long time to train. In this work, we propose a light-
weight, multimodal framework for environmental audio tagging. The audio branch of the
framework is a convolutional and recurrent neural network (CRNN) based on multiple …
tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models
based on unlabeled videos have been successfully applied to audio tagging, but they are
inevitably resource hungry and require a long time to train. In this work, we propose a light-
weight, multimodal framework for environmental audio tagging. The audio branch of the
framework is a convolutional and recurrent neural network (CRNN) based on multiple …
The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, but they are inevitably resource hungry and require a long time to train. In this work, we propose a light-weight, multimodal framework for environmental audio tagging. The audio branch of the framework is a convolutional and recurrent neural network (CRNN) based on multiple instance learning (MIL). It is trained with the audio tracks of a large collection of weakly labeled YouTube video excerpts; the video branch uses pretrained state-of-the-art image recognition networks and word embeddings to extract information from the video track and to map visual objects to sound events. Experiments on the audio tagging task of the DCASE 2017 challenge show that the incorporation of video information improves a strong baseline audio tagging system by 5.3% in terms of F 1 score. The entire system can be trained within 6 hours on a single GPU, and can be easily carried over to other audio tasks such as speech sentimental analysis.
ieeexplore.ieee.org
Showing the best result for this search. See all results