Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching

Zhuobin Zheng; Youcheng Ben; Chun Yuan

Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching

Zhuobin Zheng, Youcheng Ben, Chun Yuan

Proceedings of The Eleventh Asian Conference on Machine Learning, PMLR 101:940-955, 2019.

Abstract

The bird community in the mangrove areas is an important component of the mangrove wetlands ecosystem and an indicator species for the assessment of the environmental health status of mangrove wetlands. The classification of bird species by the sound of bird in the mangrove areas has the advantages of less interference to the environment and wide monitoring range. In this paper, we propose a novel method that combines the feature recalibration mechanism with depthwise separable convolution for the mangrove bird sound classification. In the proposed method, we introduce Xception network in which depthwise separable convolution with lower parameter number and computational cost than traditional convolution can be stacked in a residual manner, as the baseline network. And we fuse the feature recalibration mechanism into the depthwise separable convolution for actively learning the weights of the feature channels in the network layer, so that we can enhance the important features in bird sound signals to improve the performance of the classification. In the proposed method, firstly we extract three-channel log-mel features of the bird sound signals and we introduce the mixup method to augment the extracted features. Secondly, we construct the recalibrated feature maps including the different scales of information to get the classification results. To verify the effectiveness of the proposed method, we build a dataset with 9282 samples including 25 kinds of the mangrove birds such as Egretta alba, Parus major, Charadrius dubius, etc. habiting in the mangroves of Fangcheng Port of China, and execute the experiments on the built dataset. Furthermore, we also validate the adaptability of our proposed method on the dataset of TAU Urban Acoustic Scenes 2019, and achieve a better result.

Cite this Paper

BibTeX


@InProceedings{pmlr-v101-zheng19a,
  title = 	 {Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching},
  author =       {Zheng, Zhuobin and Ben, Youcheng and Yuan, Chun},
  booktitle = 	 {Proceedings of The Eleventh Asian Conference on Machine Learning},
  pages = 	 {940--955},
  year = 	 {2019},
  editor = 	 {Lee, Wee Sun and Suzuki, Taiji},
  volume = 	 {101},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v101/zheng19a/zheng19a.pdf},
  url = 	 {https://proceedings.mlr.press/v101/zheng19a.html},
  abstract = 	 {The bird community in the mangrove areas is an important component of the mangrove wetlands ecosystem and an indicator species for the assessment of the environmental health status of mangrove wetlands. The classification of bird species by the sound of bird in the mangrove areas has the advantages of less interference to the environment and wide monitoring range. In this paper, we propose a novel method that combines the feature recalibration mechanism with depthwise separable convolution for the mangrove bird sound classification. In the proposed method, we introduce Xception network in which depthwise separable convolution with lower parameter number and computational cost than traditional convolution can be stacked in a residual manner, as the baseline network. And we fuse the feature recalibration mechanism into the depthwise separable convolution for actively learning the weights of the feature channels in the network layer, so that we can enhance the important features in bird sound signals to improve the performance of the classification. In the proposed method, firstly we extract three-channel log-mel features of the bird sound signals and we introduce the mixup method to augment the extracted features. Secondly, we construct the recalibrated feature maps including the different scales of information to get the classification results. To verify the effectiveness of the proposed method, we build a dataset with 9282 samples including 25 kinds of the mangrove birds such as Egretta alba, Parus major, Charadrius dubius, etc. habiting in the mangroves of Fangcheng Port of China, and execute the experiments on the built dataset. Furthermore, we also validate the adaptability of our proposed method on the dataset of TAU Urban Acoustic Scenes 2019, and achieve a better result.}
}

Endnote

%0 Conference Paper
%T Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching
%A Zhuobin Zheng
%A Youcheng Ben
%A Chun Yuan
%B Proceedings of The Eleventh Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Wee Sun Lee
%E Taiji Suzuki	
%F pmlr-v101-zheng19a
%I PMLR
%P 940--955
%U https://proceedings.mlr.press/v101/zheng19a.html
%V 101
%X The bird community in the mangrove areas is an important component of the mangrove wetlands ecosystem and an indicator species for the assessment of the environmental health status of mangrove wetlands. The classification of bird species by the sound of bird in the mangrove areas has the advantages of less interference to the environment and wide monitoring range. In this paper, we propose a novel method that combines the feature recalibration mechanism with depthwise separable convolution for the mangrove bird sound classification. In the proposed method, we introduce Xception network in which depthwise separable convolution with lower parameter number and computational cost than traditional convolution can be stacked in a residual manner, as the baseline network. And we fuse the feature recalibration mechanism into the depthwise separable convolution for actively learning the weights of the feature channels in the network layer, so that we can enhance the important features in bird sound signals to improve the performance of the classification. In the proposed method, firstly we extract three-channel log-mel features of the bird sound signals and we introduce the mixup method to augment the extracted features. Secondly, we construct the recalibrated feature maps including the different scales of information to get the classification results. To verify the effectiveness of the proposed method, we build a dataset with 9282 samples including 25 kinds of the mangrove birds such as Egretta alba, Parus major, Charadrius dubius, etc. habiting in the mangroves of Fangcheng Port of China, and execute the experiments on the built dataset. Furthermore, we also validate the adaptability of our proposed method on the dataset of TAU Urban Acoustic Scenes 2019, and achieve a better result.

APA


Zheng, Z., Ben, Y. & Yuan, C.. (2019). Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching. Proceedings of The Eleventh Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 101:940-955 Available from https://proceedings.mlr.press/v101/zheng19a.html.

Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching

Abstract

Cite this Paper

Related Material