skip to main content
10.1145/3240508.3240527acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering

Published: 15 October 2018 Publication History

Abstract

Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting.
We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.

References

[1]
Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell, Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, et al. 2016. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. In CVPR .
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV .
[3]
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv preprint arXiv:1803.01271 (2018).
[4]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP .
[5]
Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. 2017a. More Is Less: A More Complicated Network With Less Inference Complexity. In CVPR .
[6]
Xuanyi Dong, Deyu Meng, Fan Ma, and Yi Yang. 2017b. A dual-network progressive approach to weakly supervised object detection. In ACM on Multimedia .
[7]
Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. 2018a. Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors. In CVPR .
[8]
Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and Deyu Meng. 2018b. Few-Example Object Detection with Model Communication. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
[9]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. In ECCV .
[10]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML .
[11]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR .
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In ICCV .
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR .
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).
[15]
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV .
[16]
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely Connected Convolutional Networks. In CVPR .
[17]
Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization. In ICCV .
[18]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .
[19]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (2017).
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS .
[21]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).
[22]
Angeliki Lazaridou, Marco Marelli, and Marco Baroni. 2017. Multimodal Word Meaning Induction from Minimal Exposure to Natural Text. Cognitive Science (2017).
[23]
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal Convolutional Networks for Action Segmentation and Detection. In CVPR .
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in Context. In ECCV .
[25]
Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2018. Exploring Disentangled Feature Representation beyond Face Identification. In CVPR .
[26]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural Baby Talk. In CVPR .
[27]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images. In ICCV .
[28]
Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating Image Descriptions from Computer Vision Detections. In EACL .
[29]
Tsendsuren Munkhdalai and Hong Yu. 2017. Meta Networks. In ICML .
[30]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. In CVPR .
[31]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI .
[32]
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks (1999).
[33]
Santhosh K Ramakrishnan, Ambar Pal, Gaurav Sharma, and Anurag Mittal. 2017. An Empirical Evaluation of Visual Question Answering for Novel Objects. In CVPR .
[34]
Sachin Ravi and Hugo Larochelle. 2017. Optimization as a Model for Few-shot Learning. In ICLR .
[35]
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring Models and Data for Image Question Answering. In NIPS .
[36]
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to Look: Focus Regions for Visual Question Answering. In CVPR .
[37]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical Networks for Few-shot Learning. In NIPS .
[38]
Damien Teney and Anton van den Hengel. 2017. Visual Question Answering as a Meta Learning Task. arXiv preprint arXiv:1711.08105 (2017).
[39]
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In CVPR .
[40]
Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al.2016. Matching Networks for One Shot Learning. In NIPS .
[41]
Su Wang, Stephen Roller, and Katrin Erk. 2017. Distributional Modeling on a Diet: One-shot Word Learning from Text only. In IJCNLP .
[42]
Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018a. Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning. In CVPR .
[43]
Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. 2018b. Decoupled Novel Object Captioner. In ACM on Multimedia .
[44]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML .
[45]
Zhongwen Xu, Linchao Zhu, and Yi Yang. 2017. Few-shot object recognition from machine-labeled web images. In CVPR .
[46]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked Attention Networks for Image Question Answering. In CVPR .
[47]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In CVPR .
[48]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR .
[49]
Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. 2015. Visual Madlibs: Fill in the Blank Description Generation and Question Answering. In ICCV .
[50]
Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. In ICLR .
[51]
Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. 2018. Camera Style Adaptation for Person Re-Identification. In CVPR .
[52]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In CVPR .

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. few-shot learning
  2. image captioning
  3. visual question answering

Qualifiers

  • Research-article

Conference

MM '18
Sponsor:
MM '18: ACM Multimedia Conference
October 22 - 26, 2018
Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media