skip to main content
research-article

FVQA: Fact-Based Visual Question Answering

Published: 01 October 2018 Publication History

Abstract

Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA (Fact-based VQA), a VQA dataset which requires, and supports, much deeper reasoning. FVQA primarily contains questions that require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answer triplets, through additional image-question-answer-supporting fact tuples. Each supporting-fact is represented as a structural triplet, such as <Cat,CapableOf,ClimbingTrees>. We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting-facts.

References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1106–1114.
[2]
T.-Y. Lin, et al., “ Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comp. Vis., 2014, pp. 740–755.
[3]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 248–255.
[4]
K. Simonyan and A. Zisserman, “ Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.
[5]
S. Antol, et al., “ VQA: Visual Question Answering,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2425–2433.
[6]
M. Malinowski and M. Fritz, “ Towards a visual turing challenge,”, 2014.
[7]
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “ Are you talking to a machine? Dataset and methods for multilingual image question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2296–2304.
[8]
L. Yu, E. Park, A. C. Berg, and T. L. Berg, “ Visual Madlibs: Fill in the blank description generation and question answering,” in Proc. IEEE Int. Conf. Comp. Vis., Dec. 2015, pp. 2461–2469.
[9]
M. Ren, R. Kiros, and R. Zemel, “ Exploring models and data for image question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2953–2961.
[10]
Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “ Visual7W: Grounded question answering in images,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 4995–5004.
[11]
R. Krishna, et al., “ Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comp. Vis., vol. Volume 123, no. Issue 1, pp. 32–73, 2017.
[12]
M. Malinowski and M. Fritz, “ A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1682–1690.
[13]
K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “ Joint video and text parsing for understanding events and answering queries,” IEEE MultiMedia, vol. Volume 21, no. Issue 2, pp. 42–70, 2014.
[14]
D. Geman, S. Geman, N. Hallonquist, and L. Younes, “ Visual Turing test for computer vision systems,” Proc. Nat. Academy Sci. United State America, vol. Volume 112, no. Issue 12, pp. 3618–3623, 2015.
[15]
M. Malinowski, M. Rohrbach, and M. Fritz, “ Ask your neurons: A neural-based approach to answering questions about images,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1–9.
[16]
K. Xu, et al., “ Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
[17]
K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “ ABC-CNN: An attention based convolutional neural network for visual question answering,”, 2015.
[18]
A. Jiang, F. Wang, F. Porikli, and Y. Li, “ Compositional memory for visual question answering,”, 2015.
[19]
J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “ Neural module networks,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 39–48.
[20]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “ Stacked attention networks for image question answering,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 21–29.
[21]
L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “ Uncovering temporal context for video question and answering,”, 2015.
[22]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “ Dbpedia: A nucleus for a web of open data,” Semantic Web, vol. Volume 4825, pp. 722–735, 2007.
[23]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “ Open information extraction for the web,” in Proc. Int. Joint Conf. Artif. Intell., 2007, pp. 2670–2676.
[24]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “ Freebase: A collaboratively created graph database for structuring human knowledge,” in Proc. ACM SIGMOD/PODS Conf., 2008, pp. 1247–1250.
[25]
A. Carlson, J. Betteridge, B. Kisiel, and B. Settles, “ Toward an architecture for never-ending language learning,” in Proc. Nat. Conf. Artif. Intell., 2010, pp. 1306–1313.
[26]
X. Chen, A. Shrivastava, and A. Gupta, “ Neil: Extracting visual knowledge from web data,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1409–1416.
[27]
F. Mahdisoltani, J. Biega, and F. Suchanek, “ YAGO3: A knowledge base from multilingual Wikipedias,” in Proc. Conf. Innovative Data Syst. Res., 2015.
[28]
D. Vrandečić and M. Krötzsch, “ Wikidata: A free collaborative knowledgebase,” Commun. ACM, vol. Volume 57, no. Issue 10, pp. 78–85, 2014.
[29]
R. W. Group, et al., “ Resource description framework,” 2014. {Online}. Available: http://www.w3.org/standards/techs/rdf
[30]
E. Prud'Hommeaux, et al., “ SPARQL query language for RDF,” W3C Recommend., vol. Volume 15, 2008.
[31]
O. Erling, “ Virtuoso, a hybrid RDBMS/graph column store,” IEEE Data Eng. Bull., vol. Volume 35, no. Issue 1, pp. 3–8, 2012.
[32]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “ YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia,” in Proc. Int. Joint Conf. Artificial Intell., 2013, pp. 3161–3165.
[33]
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “ Open Information Extraction: The Second Generation.” in Proc. Int. Joint Conf. Artif. Intell., 2011, pp. 3–10.
[34]
A. Fader, S. Soderland, and O. Etzioni, “ Identifying relations for open information extraction,” in Proc. Conf. Empirical Methods Natural Language Process., 2011, pp. 1535–1545.
[35]
N. Tandon, G. De Melo, and G. Weikum, “ Acquiring comparative commonsense knowledge from the web,” in Proc. Nat. Conf. Artif. Intell., 2014, pp. 166–172.
[36]
H. Liu and P. Singh, “ ConceptNet—a practical commonsense reasoning tool-kit,” BT Technol. J., vol. Volume 22, no. Issue 4, pp. 211–226, 2004.
[37]
L. S. Zettlemoyer and M. Collins, “ Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars,” in Proc. Uncertainty Artif. Intell., 2005, pp. 658–666.
[38]
L. S. Zettlemoyer and M. Collins, “ Learning context-dependent mappings from sentences to logical form,” in Proc. Int. Joint Conf. Natural Language Process., 2005, pp. 976–984.
[39]
J. Berant, A. Chou, R. Frostig, and P. Liang, “ Semantic parsing on Freebase from question-answer pairs,” in Proc. Conf. Empirical Methods Natural Language Process., 2013, pp. 1533–1544.
[40]
Q. Cai and A. Yates, “ Large-scale semantic parsing via schema matching and lexicon extension,” in Proc. Conf. Assoc. Comput. Linguistics, 2013, pp. 423–433.
[41]
P. Liang, M. I. Jordan, and D. Klein, “ Learning dependency-based compositional semantics,” Comput. Linguistics, vol. Volume 39, no. Issue 2, pp. 389–446, 2013.
[42]
T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer, “ Scaling semantic parsers with on-the-fly ontology matching,” in Proc. Conf. Empirical Methods Natural Language Process., 2013, pp. 1545–1556.
[43]
J. Berant and P. Liang, “ Semantic parsing via paraphrasing.” in Proc. Conf. Assoc. for Comput. Linguistics, 2014, pp. 1415–1425.
[44]
A. Fader, L. Zettlemoyer, and O. Etzioni, “ Open question answering over curated and extracted knowledge bases,” in Proc. ACM Int. Conf. Knowl. Discovery Data Mining, 2014, pp. 1156–1165.
[45]
S. W.-t. Yih, M.-W. Chang, X. He, and J. Gao, “ Semantic parsing via staged query graph generation: Question answering with knowledge base,” in Proc. Int. Joint Conf. Natural Language Process., 2015, pp. 1321–1331.
[46]
S. Reddy, et al., “ Transforming dependency structures to logical forms for semantic parsing,” Trans. Assoc. for Comput. Linguistics, vol. Volume 4, pp. 127–140, 2016.
[47]
C. Xiao, M. Dymetman, and C. Gardent, “ Sequence-based structured prediction for semantic parsing,” in Proc. Conf. Assoc. Comput. Linguistics, 2016, pp. 1341–1350.
[48]
C. Unger, L. Bühmann, J. Lehmann, A.-C. NgongaNgomo, D. Gerber, and P. Cimiano, “ Template-based question answering over RDF data,” in Proc. 21st Int. Conf. World Wide Web, 2012, pp. 639–648.
[49]
O. Kolomiyets and M.-F. Moens, “ A survey on question answering technology from an information retrieval perspective,” Inform. Sci., vol. Volume 181, no. Issue 24, pp. 5412–5434, 2011.
[50]
X. Yao and B. Van Durme, “ Information extraction over structured data: Question answering with Freebase,” in Proc. Conf. Assoc. Comput. Linguistics, 2014, pp. 956–966.
[51]
A. Bordes, S. Chopra, and J. Weston, “ Question answering with subgraph embeddings,” in Proc. Conf. Empirical Methods Natural Language Process., 2014, pp. 615–620.
[52]
A. Bordes, J. Weston, and N. Usunier, “ Open question answering with weakly supervised embedding models,” in Proc. Joint Eur. Conf. Mach. Learning Knowl. Discovery Databases, 2014, pp. 165–180.
[53]
L. Dong, F. Wei, M. Zhou, and K. Xu, “ Question answering over freebase with multi-column convolutional neural networks.” in Proc. Int. Joint Conf. Natural Language Process., 2015, pp. 260–269.
[54]
A. Bordes, N. Usunier, S. Chopra, and J. Weston, “ Large-scale simple question answering with memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.
[55]
J. Weston, S. Chopra, and A. Bordes, “ Memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.
[56]
S. Sukhbaatar, J. Weston, R. Fergus, et al., “ End-to-end memory networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2440–2448.
[57]
A. Kumar, et al., “ Ask me anything: Dynamic memory networks for natural language processing,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1378–1387.
[58]
C. Xiong, S. Merity, and R. Socher, “ Dynamic memory networks for visual and textual question answering,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2397–2406.
[59]
Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei, “ Building a large-scale multimodal knowledge base for visual question answering,” CoRR abs/1507.05670, 2015.
[60]
Q. Wu, P. Wang, C. Shen, A. van den Hengel, and A. Dick, “ Ask me anything: Free-form visual question answering based on knowledge from external sources,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 4622–4630.
[61]
P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “ Explicit knowledge-based reasoning for visual question answering,”, 2015.
[62]
J. Krishnamurthy and T. Kollar, “ Jointly learning to parse and perceive: Connecting natural language to the physical world,” Trans. Assoc. Comput. Linguistics, vol. Volume 1, pp. 193–206, 2013.
[63]
K. Narasimhan, A. Yala, and R. Barzilay, “ Improving information extraction by acquiring external evidence with reinforcement learning,” in Proc. Conf. Empirical Methods Natural Language Process., 2016, pp. 2355–2365.
[64]
R. Girshick, “ Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[65]
Q. Wu, C. Shen, A. van den Hengel, L. Liu, and A. Dick, “ What value do explicit high-level concepts have in vision to language problems?” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 203–212.
[66]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “ Learning deep features for scene recognition using places database,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 487–495.
[67]
P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, “ Yin and yang: Balancing and answering binary visual questions,” in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., 2016, pp. 5014–5022.
[68]
J. Weston, A. Bordes, S. Chopra, and T. Mikolov, “ Towards ai-complete question answering: A set of prerequisite toy tasks,”, 2015.
[69]
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “ Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,”, 2016.
[70]
I. Sutskever, O. Vinyals, and Q. V. Le, “ Sequence to sequence learning with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3104–3112.
[71]
T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “ Recurrent neural network based language model.” in Proc. Interspeech, 2010, vol. Volume 2, Art. no. .
[72]
S. Hochreiter and J. Schmidhuber, “ Long short-term memory,” Neural Comput., vol. Volume 9, no. Issue 8, pp. 1735–1780, 1997.
[73]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “ Efficient estimation of word representations in vector space,”, 2013.
[74]
Z. Wu and M. Palmer, “ Verbs semantics and lexical selection,” in Proc. 32Nd Annu. Meet. Assoc. Comput. Linguistics, 1994, pp. 133–138. {Online}. Available:
[75]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “ Distributed representations of words and phrases and their compositionality,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.
[76]
C.-C. Chang and C.-J. Lin, “ LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. Volume 2, no. Issue 3, 2011, Art. no. .
[77]
J. Lu, J. Yang, D. Batra, and D. Parikh, “ Hierarchical question-image co-attention for visual question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 289–297.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 40, Issue 10
October 2018
252 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 October 2018

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media