Skip to main content

Showing 1–50 of 68 results for author: Rohrbach, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.15964  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Efficient Pre-training for Localized Instruction Generation of Videos

    Authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

    Abstract: Procedural videos, exemplified by recipe demonstrations, are instrumental in conveying step-by-step instructions. However, understanding such videos is challenging as it involves the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leverag… ▽ More

    Submitted 20 July, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: ECCV 2024

  2. arXiv:2306.08751  [pdf, other

    cs.CV

    Improving Selective Visual Question Answering by Learning from Your Peers

    Authors: Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, Marcus Rohrbach

    Abstract: Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: CVPR 2023. Code available here: https://github.com/facebookresearch/selective-vqa_ood

  3. arXiv:2305.07021  [pdf, other

    cs.CV

    Simple Token-Level Confidence Improves Caption Correctness

    Authors: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  4. arXiv:2206.04790  [pdf, other

    cs.CV

    Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

    Authors: Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara

    Abstract: We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a good video for action recognition and select only high-quality samples for augment… ▽ More

    Submitted 23 July, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: Accepted to ECCV-2022

  5. arXiv:2204.13631  [pdf, other

    cs.CV

    Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

    Authors: Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this… ▽ More

    Submitted 20 October, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

    Comments: ECCV 2022. Code and models are available here: https://github.com/facebookresearch/reliable_vqa

  6. arXiv:2201.10990  [pdf, other

    cs.CV

    Learning To Recognize Procedural Activities with Distant Supervision

    Authors: Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

    Abstract: In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal d… ▽ More

    Submitted 16 June, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Code will be released here https://github.com/facebookresearch/video-distant-supervision

  7. arXiv:2112.04482  [pdf, other

    cs.CV cs.CL

    FLAVA: A Foundational Language And Vision Alignment Model

    Authors: Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

    Abstract: State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic u… ▽ More

    Submitted 29 March, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR 2022

  8. arXiv:2107.13029  [pdf, other

    cs.CV

    A New Split for Evaluating True Zero-Shot Action Recognition

    Authors: Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach

    Abstract: Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap… ▽ More

    Submitted 13 September, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

    Comments: Accepted to GCPR 2021

  9. arXiv:2101.07042  [pdf, other

    cs.CV

    CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

    Authors: Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach

    Abstract: Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plain… ▽ More

    Submitted 23 July, 2022; v1 submitted 18 January, 2021; originally announced January 2021.

    Comments: Accepted to ECCV-22

  10. arXiv:2012.11014  [pdf, other

    cs.CV cs.CL

    KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

    Authors: Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, Marcus Rohrbach

    Abstract: One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

  11. arXiv:2012.10671  [pdf, other

    cs.CV

    SMART Frame Selection for Action Recognition

    Authors: Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

    Abstract: Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not releva… ▽ More

    Submitted 19 December, 2020; originally announced December 2020.

    Comments: To be published in AAAI-21

  12. arXiv:2010.01528  [pdf, other

    cs.CV cs.AI cs.LG

    Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting

    Authors: Sayna Ebrahimi, Suzanne Petryk, Akash Gokul, William Gan, Joseph E. Gonzalez, Marcus Rohrbach, Trevor Darrell

    Abstract: The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting. Previous work has shown that leveraging memory in the form of a replay buffer can reduce performance degradation on prior tasks. We hypothesize that forgetting can be further reduced when the model is encouraged to remember the \textit{evidence} for previously made… ▽ More

    Submitted 2 May, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

    Comments: Accepted at ICLR 2021

  13. arXiv:2003.12462  [pdf, other

    cs.CV cs.CL

    TextCaps: a Dataset for Image Captioning with Reading Comprehension

    Authors: Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh

    Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to c… ▽ More

    Submitted 4 August, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

    Comments: To appear in ECCV 2020 (oral) Project page: https://textvqa.org/textcaps

  14. arXiv:2003.09553  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Adversarial Continual Learning

    Authors: Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach

    Abstract: Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representatio… ▽ More

    Submitted 21 July, 2020; v1 submitted 20 March, 2020; originally announced March 2020.

    Comments: Accepted at ECCV 2020

  15. arXiv:2001.03615  [pdf, other

    cs.CV

    In Defense of Grid Features for Visual Question Answering

    Authors: Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

    Abstract: Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this pa… ▽ More

    Submitted 2 April, 2020; v1 submitted 10 January, 2020; originally announced January 2020.

    Journal ref: CVPR, 2020

  16. arXiv:1912.02315  [pdf, other

    cs.CV cs.CL cs.LG

    12-in-1: Multi-Task Vision and Language Representation Learning

    Authors: Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

    Abstract: Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training reg… ▽ More

    Submitted 24 April, 2020; v1 submitted 4 December, 2019; originally announced December 2019.

    Comments: Jiasen Lu and Vedanuj Goswami contributed equally to this work

  17. arXiv:1911.06258  [pdf, other

    cs.CV cs.CL

    Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

    Authors: Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

    Abstract: Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for Te… ▽ More

    Submitted 24 March, 2020; v1 submitted 14 November, 2019; originally announced November 2019.

    Comments: CVPR 2020

  18. arXiv:1910.09217  [pdf, other

    cs.CV

    Decoupling Representation and Classifier for Long-Tailed Recognition

    Authors: Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, Yannis Kalantidis

    Abstract: The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but most of them adhere to the scheme of jointly learning representations and cl… ▽ More

    Submitted 19 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Journal ref: Published as a conference paper at ICLR 2020

  19. arXiv:1906.02425  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Uncertainty-guided Continual Learning with Bayesian Neural Networks

    Authors: Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach

    Abstract: Continual learning aims to learn new tasks without forgetting previously learned ones. This is especially challenging when one cannot access data from previous tasks and when the model has a fixed capacity. Current regularization-based continual learning algorithms need an external representation and extra computation to measure the parameters' \textit{importance}. In contrast, we propose Uncertai… ▽ More

    Submitted 19 February, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Accepted at ICLR 2020

  20. arXiv:1906.00283  [pdf, other

    cs.CV cs.CL cs.LG

    Learning to Generate Grounded Visual Captions without Localization Supervision

    Authors: Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira

    Abstract: When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is t… ▽ More

    Submitted 17 July, 2020; v1 submitted 1 June, 2019; originally announced June 2019.

    Comments: ECCV 2020. Code is available at https://github.com/chihyaoma/cyclical-visual-captioning

  21. arXiv:1904.08920  [pdf, other

    cs.CL cs.CV cs.LG

    Towards VQA Models That Can Read

    Authors: Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach

    Abstract: Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of… ▽ More

    Submitted 13 May, 2019; v1 submitted 18 April, 2019; originally announced April 2019.

    Comments: CVPR 2019

  22. arXiv:1904.05049  [pdf, ps, other

    cs.CV

    Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

    Authors: Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

    Abstract: In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their freq… ▽ More

    Submitted 18 August, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

    Comments: Accepted to ICCV 2019

  23. arXiv:1903.03166  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

    Authors: Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

    Abstract: Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs. We deve… ▽ More

    Submitted 18 September, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

    Comments: 13 pages, 11 figures, 3 tables, accepted as a short paper at NAACL 2019

  24. arXiv:1902.10486  [pdf, other

    cs.LG stat.ML

    On Tiny Episodic Memories in Continual Learning

    Authors: Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, Marc'Aurelio Ranzato

    Abstract: In continual learning (CL), an agent learns from a stream of tasks leveraging prior experience to transfer knowledge to future tasks. It is an ideal framework to decrease the amount of supervision in the existing learning algorithms. But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks. One way to endow the learner the ability to perform tasks seen i… ▽ More

    Submitted 4 June, 2019; v1 submitted 27 February, 2019; originally announced February 2019.

    Comments: Making the main point of the paper more clear

  25. arXiv:1902.07864  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering

    Authors: Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh

    Abstract: We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring lesser num… ▽ More

    Submitted 27 June, 2019; v1 submitted 20 February, 2019; originally announced February 2019.

    Comments: ICML 2019 Camera Ready + Appendix

  26. arXiv:1902.05660  [pdf, other

    cs.CV

    Cycle-Consistency for Robust Visual Question Answering

    Authors: Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh

    Abstract: Despite significant progress in Visual Question Answering over the years, robustness of today's VQA models leave much to be desired. We introduce a new evaluation protocol and associated dataset (VQA-Rephrasings) and show that state-of-the-art VQA models are notoriously brittle to linguistic variations in questions. VQA-Rephrasings contains 3 human-provided rephrasings for 40k questions spanning 4… ▽ More

    Submitted 14 February, 2019; originally announced February 2019.

    Comments: Technical Report

  27. arXiv:1901.03460  [pdf, other

    cs.CV

    DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

    Authors: Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan

    Abstract: Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is… ▽ More

    Submitted 7 May, 2019; v1 submitted 10 January, 2019; originally announced January 2019.

    Comments: Accepted by CVPR'19

  28. arXiv:1812.10524  [pdf, other

    cs.CV

    Exploring the Challenges towards Lifelong Fact Learning

    Authors: Mohamed Elhoseiny, Francesca Babiloni, Rahaf Aljundi, Marcus Rohrbach, Manohar Paluri, Tinne Tuytelaars

    Abstract: So far life-long learning (LLL) has been studied in relatively small-scale and relatively artificial setups. Here, we introduce a new large-scale alternative. What makes the proposed setup more natural and closer to human-like visual systems is threefold: First, we focus on concepts (or facts, as we call them) of varying complexity, ranging from single objects to more complex structures such as ob… ▽ More

    Submitted 26 December, 2018; originally announced December 2018.

    Comments: This work got published at ACCV 2018 as a main conference paper

  29. arXiv:1812.06587  [pdf, other

    cs.CV

    Grounded Video Description

    Authors: Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach

    Abstract: Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the v… ▽ More

    Submitted 5 May, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

    Comments: CVPR 2019 oral, camera-ready version including appendix

  30. arXiv:1812.05634  [pdf, other

    cs.CV cs.CL

    Adversarial Inference for Multi-Sentence Video Description

    Authors: Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

    Abstract: While significant progress has been made in the image captioning task, video description is still in its infancy due to the complex nature of video data. Generating multi-sentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning… ▽ More

    Submitted 15 April, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: Accepted to Computer Vision and Pattern Recognition (CVPR) 2019

  31. arXiv:1812.00420  [pdf, other

    cs.LG stat.ML

    Efficient Lifelong Learning with A-GEM

    Authors: Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny

    Abstract: In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task. In this work, we investigate the efficiency of current lifelong approaches, in terms of sample complexity, computational and memory cost. Towards this end, we first introduce a new and a more realistic evaluation protocol, wh… ▽ More

    Submitted 9 January, 2019; v1 submitted 2 December, 2018; originally announced December 2018.

    Comments: Published as a conference paper at ICLR 2019

  32. arXiv:1811.12814  [pdf, other

    cs.CV

    Graph-Based Global Reasoning Networks

    Authors: Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, Yannis Kalantidis

    Abstract: Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional Neural Networks (CNNs) excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose… ▽ More

    Submitted 30 November, 2018; originally announced November 2018.

  33. arXiv:1809.01816  [pdf, other

    cs.CV cs.AI cs.CL

    Visual Coreference Resolution in Visual Dialog using Neural Module Networks

    Authors: Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

    Abstract: Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: ECCV 2018 + results on VisDial v1.0 dataset

  34. arXiv:1807.09956  [pdf, other

    cs.CV

    Pythia v0.1: the Winning Entry to the VQA Challenge 2018

    Authors: Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

    Abstract: This document describes Pythia v0.1, the winning entry from Facebook AI Research (FAIR)'s A-STAR team to the VQA Challenge 2018. Our starting point is a modular re-implementation of the bottom-up top-down (up-down) model. We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, w… ▽ More

    Submitted 27 July, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

  35. arXiv:1806.05421  [pdf, other

    stat.ML cs.AI cs.CV cs.LG

    Selfless Sequential Learning

    Authors: Rahaf Aljundi, Marcus Rohrbach, Tinne Tuytelaars

    Abstract: Sequential learning, also called lifelong learning, studies the problem of learning tasks in a sequence with access restricted to only the data of the current task. In this paper we look at a scenario with fixed model capacity, and postulate that the learning process should not be selfish, i.e. it should account for future tasks to be added and thus leave enough capacity for them. To achieve Selfl… ▽ More

    Submitted 12 April, 2019; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: Published as a conference paper at ICLR 2019

  36. arXiv:1804.10660  [pdf, other

    cs.CV

    Large-Scale Visual Relationship Understanding

    Authors: Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny

    Abstract: Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces wh… ▽ More

    Submitted 16 August, 2019; v1 submitted 27 April, 2018; originally announced April 2018.

  37. arXiv:1802.08129  [pdf, other

    cs.AI cs.CL cs.CV

    Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

    Authors: Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

    Abstract: Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets… ▽ More

    Submitted 15 February, 2018; originally announced February 2018.

    Comments: arXiv admin note: text overlap with arXiv:1612.04757

  38. arXiv:1712.05558  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

    Authors: Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh

    Abstract: In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pi… ▽ More

    Submitted 4 June, 2019; v1 submitted 15 December, 2017; originally announced December 2017.

    Comments: ACL 2019

  39. arXiv:1711.09601  [pdf, other

    cs.CV cs.AI stat.ML

    Memory Aware Synapses: Learning what (not) to forget

    Authors: Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars

    Abstract: Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model… ▽ More

    Submitted 5 October, 2018; v1 submitted 27 November, 2017; originally announced November 2017.

    Comments: ECCV 2018

  40. arXiv:1711.07373  [pdf, other

    cs.CV

    Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

    Authors: Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

    Abstract: Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotati… ▽ More

    Submitted 17 November, 2017; originally announced November 2017.

    Comments: arXiv admin note: text overlap with arXiv:1612.04757

  41. arXiv:1704.05526  [pdf, other

    cs.CV

    Learning to Reason: End-to-End Module Networks for Visual Question Answering

    Authors: Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko

    Abstract: Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to questi… ▽ More

    Submitted 11 September, 2017; v1 submitted 18 April, 2017; originally announced April 2017.

  42. arXiv:1704.01518  [pdf, other

    cs.CV

    Generating Descriptions with Grounded and Co-Referenced People

    Authors: Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

    Abstract: Learning how to generate descriptions of images or videos received major interest both in the Computer Vision and Natural Language Processing communities. While a few works have proposed to learn a grounding during the generation process in an unsupervised way (via an attention mechanism), it remains unclear how good the quality of the grounding is and whether it benefits the description quality.… ▽ More

    Submitted 5 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017

  43. arXiv:1703.10476  [pdf, other

    cs.CV cs.AI cs.CL

    Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

    Authors: Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele

    Abstract: While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans -- rightfully so -- generate multiple, diverse captions, due to the inherent… ▽ More

    Submitted 6 November, 2017; v1 submitted 30 March, 2017; originally announced March 2017.

    Comments: 16 pages, Published in ICCV 2017

  44. arXiv:1612.04757  [pdf, other

    cs.CV cs.AI cs.CL

    Attentive Explanations: Justifying Decisions and Pointing to the Evidence

    Authors: Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

    Abstract: Deep models are the defacto standard in visual decision models due to their impressive performance on a wide array of visual tasks. However, they are frequently seen as opaque and are unable to explain their decisions. In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. We postulate that deep models can d… ▽ More

    Submitted 25 July, 2017; v1 submitted 14 December, 2016; originally announced December 2016.

  45. arXiv:1611.09978  [pdf, other

    cs.CV

    Modeling Relationships in Referential Expressions with Compositional Modular Networks

    Authors: Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko

    Abstract: People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire r… ▽ More

    Submitted 29 November, 2016; originally announced November 2016.

  46. arXiv:1608.08305  [pdf, other

    cs.CV

    Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

    Authors: Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

    Abstract: Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression. One major difficulty to train such language-based image segmentation systems is the lack of datasets with joint v… ▽ More

    Submitted 29 August, 2016; originally announced August 2016.

  47. arXiv:1606.07770  [pdf, other

    cs.CV cs.CL

    Captioning Images with Diverse Objects

    Authors: Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, Kate Saenko

    Abstract: Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources -- labeled images from object recognition dat… ▽ More

    Submitted 20 July, 2017; v1 submitted 24 June, 2016; originally announced June 2016.

    Comments: CVPR 2017 Camera ready version. 17 pages (8 + 9 supplement), 12 figures, 8 tables. Includes project page http://vsubhashini.github.io/noc.html

  48. arXiv:1606.01847  [pdf, other

    cs.CV cs.AI cs.CL

    Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

    Authors: Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach

    Abstract: Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual repr… ▽ More

    Submitted 23 September, 2016; v1 submitted 6 June, 2016; originally announced June 2016.

    Comments: Accepted to EMNLP 2016

  49. arXiv:1605.03705  [pdf, other

    cs.CV cs.CL

    Movie Description

    Authors: Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

    Abstract: Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full… ▽ More

    Submitted 12 May, 2016; originally announced May 2016.

  50. arXiv:1605.02697  [pdf, other

    cs.CV cs.AI cs.CL

    Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

    Authors: Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

    Abstract: We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is condition… ▽ More

    Submitted 24 November, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

    Comments: Improved version, it also has a final table from the VQA challenge, and more baselines on DAQUAR