research-article

FVQA: Fact-Based Visual Question Answering

Authors:

Anton van den HengelAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 40, Issue 10

Pages 2413 - 2427

https://doi.org/10.1109/TPAMI.2017.2754246

Published: 01 October 2018 Publication History

Abstract

Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA (Fact-based VQA), a VQA dataset which requires, and supports, much deeper reasoning. FVQA primarily contains questions that require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answer triplets, through additional image-question-answer-supporting fact tuples. Each supporting-fact is represented as a structural triplet, such as <Cat,CapableOf,ClimbingTrees>. We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting-facts.

References

[1]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1106–1114.

Digital Library

[2]

T.-Y. Lin, et al., “ Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comp. Vis., 2014, pp. 740–755.

[3]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 248–255.

[4]

K. Simonyan and A. Zisserman, “ Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.

[5]

S. Antol, et al., “ VQA: Visual Question Answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2425–2433.

Digital Library

[6]

M. Malinowski and M. Fritz, “ Towards a visual turing challenge,”, 2014.

[7]

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “ Are you talking to a machine? Dataset and methods for multilingual image question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2296–2304.

Digital Library

[8]

L. Yu, E. Park, A. C. Berg, and T. L. Berg, “ Visual Madlibs: Fill in the blank description generation and question answering,” in Proc. IEEE Int. Conf. Comp. Vis., Dec. 2015, pp. 2461–2469.

Digital Library

[9]

M. Ren, R. Kiros, and R. Zemel, “ Exploring models and data for image question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2953–2961.

Digital Library

[10]

Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “ Visual7W: Grounded question answering in images,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 4995–5004.

[11]

R. Krishna, et al., “ Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comp. Vis., vol. Volume 123, no. Issue 1, pp. 32–73, 2017.

Digital Library

[12]

M. Malinowski and M. Fritz, “ A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1682–1690.

Digital Library

[13]

K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “ Joint video and text parsing for understanding events and answering queries,” IEEE MultiMedia, vol. Volume 21, no. Issue 2, pp. 42–70, 2014.

[14]

D. Geman, S. Geman, N. Hallonquist, and L. Younes, “ Visual Turing test for computer vision systems,” Proc. Nat. Academy Sci. United State America, vol. Volume 112, no. Issue 12, pp. 3618–3623, 2015.

[15]

M. Malinowski, M. Rohrbach, and M. Fritz, “ Ask your neurons: A neural-based approach to answering questions about images,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1–9.

Digital Library

[16]

K. Xu, et al., “ Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.

Digital Library

[17]

K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “ ABC-CNN: An attention based convolutional neural network for visual question answering,”, 2015.

[18]

A. Jiang, F. Wang, F. Porikli, and Y. Li, “ Compositional memory for visual question answering,”, 2015.

[19]

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “ Neural module networks,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 39–48.

[20]

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “ Stacked attention networks for image question answering,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 21–29.

[21]

L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “ Uncovering temporal context for video question and answering,”, 2015.

[22]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “ Dbpedia: A nucleus for a web of open data,” Semantic Web, vol. Volume 4825, pp. 722–735, 2007.

Digital Library

[23]

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “ Open information extraction for the web,” in Proc. Int. Joint Conf. Artif. Intell., 2007, pp. 2670–2676.

Digital Library

[24]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “ Freebase: A collaboratively created graph database for structuring human knowledge,” in Proc. ACM SIGMOD/PODS Conf., 2008, pp. 1247–1250.

Digital Library

[25]

A. Carlson, J. Betteridge, B. Kisiel, and B. Settles, “ Toward an architecture for never-ending language learning,” in Proc. Nat. Conf. Artif. Intell., 2010, pp. 1306–1313.

Digital Library

[26]

X. Chen, A. Shrivastava, and A. Gupta, “ Neil: Extracting visual knowledge from web data,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1409–1416.

Digital Library

[27]

F. Mahdisoltani, J. Biega, and F. Suchanek, “ YAGO3: A knowledge base from multilingual Wikipedias,” in Proc. Conf. Innovative Data Syst. Res., 2015.

[28]

D. Vrandečić and M. Krötzsch, “ Wikidata: A free collaborative knowledgebase,” Commun. ACM, vol. Volume 57, no. Issue 10, pp. 78–85, 2014.

Digital Library

[29]

R. W. Group, et al., “ Resource description framework,” 2014. {Online}. Available: http://www.w3.org/standards/techs/rdf

[30]

E. Prud'Hommeaux, et al., “ SPARQL query language for RDF,” W3C Recommend., vol. Volume 15, 2008.

[31]

O. Erling, “ Virtuoso, a hybrid RDBMS/graph column store,” IEEE Data Eng. Bull., vol. Volume 35, no. Issue 1, pp. 3–8, 2012.

[32]

J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “ YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia,” in Proc. Int. Joint Conf. Artificial Intell., 2013, pp. 3161–3165.

Digital Library

[33]

O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “ Open Information Extraction: The Second Generation.” in Proc. Int. Joint Conf. Artif. Intell., 2011, pp. 3–10.

Digital Library

[34]

A. Fader, S. Soderland, and O. Etzioni, “ Identifying relations for open information extraction,” in Proc. Conf. Empirical Methods Natural Language Process., 2011, pp. 1535–1545.

Digital Library

[35]

N. Tandon, G. De Melo, and G. Weikum, “ Acquiring comparative commonsense knowledge from the web,” in Proc. Nat. Conf. Artif. Intell., 2014, pp. 166–172.

Digital Library

[36]

H. Liu and P. Singh, “ ConceptNet—a practical commonsense reasoning tool-kit,” BT Technol. J., vol. Volume 22, no. Issue 4, pp. 211–226, 2004.

Digital Library

[37]

L. S. Zettlemoyer and M. Collins, “ Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars,” in Proc. Uncertainty Artif. Intell., 2005, pp. 658–666.

Digital Library

[38]

L. S. Zettlemoyer and M. Collins, “ Learning context-dependent mappings from sentences to logical form,” in Proc. Int. Joint Conf. Natural Language Process., 2005, pp. 976–984.

Digital Library

[39]

J. Berant, A. Chou, R. Frostig, and P. Liang, “ Semantic parsing on Freebase from question-answer pairs,” in Proc. Conf. Empirical Methods Natural Language Process., 2013, pp. 1533–1544.

[40]

Q. Cai and A. Yates, “ Large-scale semantic parsing via schema matching and lexicon extension,” in Proc. Conf. Assoc. Comput. Linguistics, 2013, pp. 423–433.

[41]

P. Liang, M. I. Jordan, and D. Klein, “ Learning dependency-based compositional semantics,” Comput. Linguistics, vol. Volume 39, no. Issue 2, pp. 389–446, 2013.

Digital Library

[42]

T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer, “ Scaling semantic parsers with on-the-fly ontology matching,” in Proc. Conf. Empirical Methods Natural Language Process., 2013, pp. 1545–1556.

[43]

J. Berant and P. Liang, “ Semantic parsing via paraphrasing.” in Proc. Conf. Assoc. for Comput. Linguistics, 2014, pp. 1415–1425.

[44]

A. Fader, L. Zettlemoyer, and O. Etzioni, “ Open question answering over curated and extracted knowledge bases,” in Proc. ACM Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 1156–1165.

Digital Library

[45]

S. W.-t. Yih, M.-W. Chang, X. He, and J. Gao, “ Semantic parsing via staged query graph generation: Question answering with knowledge base,” in Proc. Int. Joint Conf. Natural Language Process., 2015, pp. 1321–1331.

[46]

S. Reddy, et al., “ Transforming dependency structures to logical forms for semantic parsing,” Trans. Assoc. for Comput. Linguistics, vol. Volume 4, pp. 127–140, 2016.

[47]

C. Xiao, M. Dymetman, and C. Gardent, “ Sequence-based structured prediction for semantic parsing,” in Proc. Conf. Assoc. Comput. Linguistics, 2016, pp. 1341–1350.

[48]

C. Unger, L. Bühmann, J. Lehmann, A.-C. NgongaNgomo, D. Gerber, and P. Cimiano, “ Template-based question answering over RDF data,” in Proc. 21st Int. Conf. World Wide Web, 2012, pp. 639–648.

Digital Library

[49]

O. Kolomiyets and M.-F. Moens, “ A survey on question answering technology from an information retrieval perspective,” Inform. Sci., vol. Volume 181, no. Issue 24, pp. 5412–5434, 2011.

Digital Library

[50]

X. Yao and B. Van Durme, “ Information extraction over structured data: Question answering with Freebase,” in Proc. Conf. Assoc. Comput. Linguistics, 2014, pp. 956–966.

[51]

A. Bordes, S. Chopra, and J. Weston, “ Question answering with subgraph embeddings,” in Proc. Conf. Empirical Methods Natural Language Process., 2014, pp. 615–620.

[52]

A. Bordes, J. Weston, and N. Usunier, “ Open question answering with weakly supervised embedding models,” in Proc. Joint Eur. Conf. Mach. Learning Knowl. Discovery Databases, 2014, pp. 165–180.

Digital Library

[53]

L. Dong, F. Wei, M. Zhou, and K. Xu, “ Question answering over freebase with multi-column convolutional neural networks.” in Proc. Int. Joint Conf. Natural Language Process., 2015, pp. 260–269.

[54]

A. Bordes, N. Usunier, S. Chopra, and J. Weston, “ Large-scale simple question answering with memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.

[55]

J. Weston, S. Chopra, and A. Bordes, “ Memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.

[56]

S. Sukhbaatar, J. Weston, R. Fergus, et al., “ End-to-end memory networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2440–2448.

Digital Library

[57]

A. Kumar, et al., “ Ask me anything: Dynamic memory networks for natural language processing,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1378–1387.

Digital Library

[58]

C. Xiong, S. Merity, and R. Socher, “ Dynamic memory networks for visual and textual question answering,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2397–2406.

Digital Library

[59]

Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei, “ Building a large-scale multimodal knowledge base for visual question answering,” CoRR abs/1507.05670, 2015.

[60]

Q. Wu, P. Wang, C. Shen, A. van den Hengel, and A. Dick, “ Ask me anything: Free-form visual question answering based on knowledge from external sources,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 4622–4630.

[61]

P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “ Explicit knowledge-based reasoning for visual question answering,”, 2015.

[62]

J. Krishnamurthy and T. Kollar, “ Jointly learning to parse and perceive: Connecting natural language to the physical world,” Trans. Assoc. Comput. Linguistics, vol. Volume 1, pp. 193–206, 2013.

[63]

K. Narasimhan, A. Yala, and R. Barzilay, “ Improving information extraction by acquiring external evidence with reinforcement learning,” in Proc. Conf. Empirical Methods Natural Language Process., 2016, pp. 2355–2365.

[64]

R. Girshick, “ Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.

Digital Library

[65]

Q. Wu, C. Shen, A. van den Hengel, L. Liu, and A. Dick, “ What value do explicit high-level concepts have in vision to language problems?” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 203–212.

[66]

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “ Learning deep features for scene recognition using places database,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 487–495.

Digital Library

[67]

P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, “ Yin and yang: Balancing and answering binary visual questions,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 5014–5022.

[68]

J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “ Towards ai-complete question answering: A set of prerequisite toy tasks,”, 2015.

[69]

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “ Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,”, 2016.

[70]

I. Sutskever, O. Vinyals, and Q. V. Le, “ Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3104–3112.

Digital Library

[71]

T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “ Recurrent neural network based language model.” in Proc. Interspeech, 2010, vol. Volume 2, Art. no. .

[72]

S. Hochreiter and J. Schmidhuber, “ Long short-term memory,” Neural Comput., vol. Volume 9, no. Issue 8, pp. 1735–1780, 1997.

Digital Library

[73]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “ Efficient estimation of word representations in vector space,”, 2013.

[74]

Z. Wu and M. Palmer, “ Verbs semantics and lexical selection,” in Proc. 32Nd Annu. Meet. Assoc. Comput. Linguistics, 1994, pp. 133–138. {Online}. Available:

Digital Library

[75]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “ Distributed representations of words and phrases and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.

Digital Library

[76]

C.-C. Chang and C.-J. Lin, “ LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. Volume 2, no. Issue 3, 2011, Art. no. .

Digital Library

[77]

J. Lu, J. Yang, D. Batra, and D. Parikh, “ Hierarchical question-image co-attention for visual question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 289–297.

Digital Library

Cited By

Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
An-An LZimu LNing XMin LChenggang YBolun ZBo LYulong DZhuang SXuanya L(2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3616399
Show More Cited By

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Query containment under bag and bag-set semantics

Conjunctive queries (CQs) are at the core of query languages encountered in many logic-based research fields such as AI, or database systems. The majority of existing work assumes set semantics but often in real applications the manipulation of ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 40, Issue 10

October 2018

252 pages

ISSN:0162-8828

Issue’s Table of Contents

Copyright © 2018.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 October 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
An-An LZimu LNing XMin LChenggang YBolun ZBo LYulong DZhuang SXuanya L(2024)Multi-stage reasoning on introspecting and revising bias for visual question answeringACM Transactions on the Web10.1145/361639918:4(1-13)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3616399
Ma JWang PKong DWang ZLiu JPei HZhao J(2024)Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336615446:8(5575-5594)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3366154
Zhang XZhang FXu C(2024)NExT-OOD: Overcoming Dual Multiple-Choice VQA BiasesIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.326942946:4(1913-1931)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3269429
Zhang HXiao LCao XForoosh H(2024)Multiple Adverse Weather Conditions Adaptation for Object Detection via Causal InterventionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.316676546:3(1742-1756)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPAMI.2022.3166765
Xu NLu ZTian HKang RCao JZhang YLiu A(2024)Learning to Supervise Knowledge Retrieval Over a Tree Structure for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2024.335563826(6689-6700)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3355638
Wu SZhao GQian X(2024)Resolving Zero-Shot and Fact-Based Visual Question Answering via Enhanced Fact RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.328972926(1790-1800)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3289729
Song YYang XWang YXu C(2024)Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.327222426(837-851)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3272224
Xu NGao YLiu ATian HZhang Y(2024)Multi-Modal Validation and Domain Interaction Learning for Knowledge-Based Visual Question AnsweringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338427036:11(6628-6640)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3384270
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents