research-article

Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering

Authors:

Fei WuAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 54 - 62

https://doi.org/10.1145/3240508.3240527

Published: 15 October 2018 Publication History

Abstract

Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting.

We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.

References

[1]

Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell, Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, et al. 2016. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. In CVPR .

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV .

Digital Library

[3]

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv preprint arXiv:1803.01271 (2018).

[4]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation. In EMNLP .

[5]

Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. 2017a. More Is Less: A More Complicated Network With Less Inference Complexity. In CVPR .

[6]

Xuanyi Dong, Deyu Meng, Fan Ma, and Yi Yang. 2017b. A dual-network progressive approach to weakly supervised object detection. In ACM on Multimedia .

Digital Library

[7]

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. 2018a. Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors. In CVPR .

[8]

Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and Deyu Meng. 2018b. Few-Example Object Detection with Model Communication. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).

[9]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. In ECCV .

Digital Library

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML .

[11]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR .

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet Classification. In ICCV .

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR .

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997).

[15]

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. 2017. Learning to Reason: End-To-End Module Networks for Visual Question Answering. In ICCV .

[16]

Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely Connected Convolutional Networks. In CVPR .

[17]

Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization. In ICCV .

[18]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR .

[19]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision (2017).

Digital Library

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS .

Digital Library

[21]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013).

Digital Library

[22]

Angeliki Lazaridou, Marco Marelli, and Marco Baroni. 2017. Multimodal Word Meaning Induction from Minimal Exposure to Natural Text. Cognitive Science (2017).

[23]

Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal Convolutional Networks for Action Segmentation and Detection. In CVPR .

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in Context. In ECCV .

[25]

Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2018. Exploring Disentangled Feature Representation beyond Face Identification. In CVPR .

[26]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural Baby Talk. In CVPR .

[27]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images. In ICCV .

Digital Library

[28]

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating Image Descriptions from Computer Vision Detections. In EACL .

Digital Library

[29]

Tsendsuren Munkhdalai and Hong Yu. 2017. Meta Networks. In ICML .

[30]

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. In CVPR .

[31]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI .

[32]

Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks (1999).

Digital Library

[33]

Santhosh K Ramakrishnan, Ambar Pal, Gaurav Sharma, and Anurag Mittal. 2017. An Empirical Evaluation of Visual Question Answering for Novel Objects. In CVPR .

[34]

Sachin Ravi and Hugo Larochelle. 2017. Optimization as a Model for Few-shot Learning. In ICLR .

[35]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring Models and Data for Image Question Answering. In NIPS .

Digital Library

[36]

Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to Look: Focus Regions for Visual Question Answering. In CVPR .

[37]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical Networks for Few-shot Learning. In NIPS .

[38]

Damien Teney and Anton van den Hengel. 2017. Visual Question Answering as a Meta Learning Task. arXiv preprint arXiv:1711.08105 (2017).

[39]

Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In CVPR .

[40]

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al.2016. Matching Networks for One Shot Learning. In NIPS .

Digital Library

[41]

Su Wang, Stephen Roller, and Katrin Erk. 2017. Distributional Modeling on a Diet: One-shot Word Learning from Text only. In IJCNLP .

[42]

Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018a. Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning. In CVPR .

[43]

Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. 2018b. Decoupled Novel Object Captioner. In ACM on Multimedia .

Digital Library

[44]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML .

Digital Library

[45]

Zhongwen Xu, Linchao Zhu, and Yi Yang. 2017. Few-shot object recognition from machine-labeled web images. In CVPR .

[46]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked Attention Networks for Image Question Answering. In CVPR .

[47]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In CVPR .

[48]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image Captioning with Semantic Attention. In CVPR .

[49]

Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. 2015. Visual Madlibs: Fill in the Blank Description Generation and Question Answering. In ICCV .

Digital Library

[50]

Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. In ICLR .

[51]

Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. 2018. Camera Style Adaptation for Person Re-Identification. In CVPR .

[52]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In CVPR .

Cited By

Jiang YChang Y(2025)Cross-modal Prompt-Driven Network for low-resource vision-to-language generationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109591139(109591)Online publication date: Jan-2025
https://doi.org/10.1016/j.engappai.2024.109591
Chauhan R(2024)Introduction to Bioinformatics and Machine LearningApplying Machine Learning Techniques to Bioinformatics10.4018/979-8-3693-1822-5.ch017(317-332)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-1822-5.ch017
Yang XWei DZhang LMa GXu XLong H(2024)Task-related network based on meta-learning for few-shot knowledge graph completionApplied Intelligence10.1007/s10489-024-05480-454:8(5961-5975)Online publication date: 3-May-2024
https://doi.org/10.1007/s10489-024-05480-4
Show More Cited By

Index Terms

Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Few-Shot Adaptation for Multimedia Semantic Indexing
MM '18: Proceedings of the 26th ACM international conference on Multimedia

We propose a few-shot adaptation framework, which bridges zero-shot learning and supervised many-shot learning, for semantic indexing of image and video data. Few-shot adaptation provides robust parameter estimation with few training examples, by ...
Overcoming Data Limitation in Medical Visual Question Answering
Medical Image Computing and Computer Assisted Intervention – MICCAI 2019
Abstract
Traditional approaches for Visual Question Answering (VQA) require large amount of labeled data for training. Unfortunately, such large scale data is usually not available for medical domain. In this paper, we propose a novel medical VQA ...
Learning to teach and learn for semi-supervised few-shot image classification
Abstract
This paper presents a novel semi-supervised few-shot image classification method named Learning to Teach and Learn (LTTL) to effectively leverage unlabeled samples in small-data regimes. Our method is based on self-training, which ...
Highlights
- We propose a novel self-training strategy for semi-supervised few-shot image classification.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
815
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang YChang Y(2025)Cross-modal Prompt-Driven Network for low-resource vision-to-language generationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109591139(109591)Online publication date: Jan-2025
https://doi.org/10.1016/j.engappai.2024.109591
Chauhan R(2024)Introduction to Bioinformatics and Machine LearningApplying Machine Learning Techniques to Bioinformatics10.4018/979-8-3693-1822-5.ch017(317-332)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-1822-5.ch017
Yang XWei DZhang LMa GXu XLong H(2024)Task-related network based on meta-learning for few-shot knowledge graph completionApplied Intelligence10.1007/s10489-024-05480-454:8(5961-5975)Online publication date: 3-May-2024
https://doi.org/10.1007/s10489-024-05480-4
Liu BZhan LXu LWu X(2023)Medical Visual Question Answering via Conditional Reasoning and Contrastive LearningIEEE Transactions on Medical Imaging10.1109/TMI.2022.323241142:5(1532-1545)Online publication date: May-2023
https://doi.org/10.1109/TMI.2022.3232411
Liu DZhou PXu ZWang HLi R(2023)Few-Shot Temporal Sentence Grounding via Memory-Guided Semantic LearningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322372533:5(2491-2505)Online publication date: May-2023
https://doi.org/10.1109/TCSVT.2022.3223725
Wanyan YYang XChen CXu C(2023)Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00628(6492-6502)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00628
Li LZhang HFang ZXie ZLiu J(2023)Transductive Cross-Lingual Scene-Text Visual Question AnsweringNeural Information Processing10.1007/978-981-99-8076-5_33(452-467)Online publication date: 14-Nov-2023
https://doi.org/10.1007/978-981-99-8076-5_33
Zhang LWang SLiu JLin QChang XWu YZheng Q(2022)MuL-GRN: Multi-Level Graph Relation Network for Few-Shot Node ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3176880(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3176880
Fu KHe JYang X(2022)Few-shot learning-based RGB-D salient object detection: A case studyNeurocomputing10.1016/j.neucom.2022.09.019512(142-152)Online publication date: Nov-2022
https://doi.org/10.1016/j.neucom.2022.09.019
Gasmi K(2022)Hybrid deep learning model for answering visual medical questionsThe Journal of Supercomputing10.1007/s11227-022-04474-878:13(15042-15059)Online publication date: 11-Apr-2022
https://doi.org/10.1007/s11227-022-04474-8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents