Skip to main content

Showing 1–47 of 47 results for author: Rohrbach, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.17942  [pdf, other

    cs.CV

    Object-based (yet Class-agnostic) Video Domain Adaptation

    Authors: Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach

    Abstract: Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data. Current methods for video domain adaptation typically fine-tune the model using fully annotated data on a subset of target domain data or align the representation of the two domains using bootstrapping or adversa… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  2. arXiv:2306.00576  [pdf, other

    cs.CV

    MammalNet: A Large-scale Video Benchmark for Mammal Recognition and Behavior Understanding

    Authors: Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, Mohamed Elhoseiny

    Abstract: Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognitio… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 proceeding

  3. arXiv:2305.07021  [pdf, other

    cs.CV

    Simple Token-Level Confidence Improves Caption Correctness

    Authors: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  4. arXiv:2212.00843  [pdf, other

    cs.CV cs.CL

    Focus! Relevant and Sufficient Context Selection for News Image Captioning

    Authors: Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu

    Abstract: News Image Captioning requires describing an image by leveraging additional context from a news article. Previous works only coarsely leverage the article to extract the necessary context, which makes it challenging for models to identify relevant events and named entities. In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: Findings of EMNLP 2022

  5. arXiv:2212.00210  [pdf, other

    cs.CV cs.AI cs.LG

    Shape-Guided Diffusion with Inside-Outside Attention

    Authors: Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell

    Abstract: We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. backgro… ▽ More

    Submitted 1 April, 2024; v1 submitted 30 November, 2022; originally announced December 2022.

    Comments: WACV 2024

  6. arXiv:2211.15521  [pdf, other

    cs.CV cs.CL

    G^3: Geolocation via Guidebook Grounding

    Authors: Grace Luo, Giscard Biamby, Trevor Darrell, Daniel Fried, Anna Rohrbach

    Abstract: We demonstrate how language can improve geolocation: the task of predicting the location where an image was taken. Here we study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation. We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locat… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: Findings of EMNLP 2022

  7. arXiv:2210.09520  [pdf, other

    cs.CV

    Using Language to Extend to Unseen Domains

    Authors: Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E. Gonzalez, Aditi Raghunathan, Anja Rohrbach

    Abstract: It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our m… ▽ More

    Submitted 29 April, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

  8. arXiv:2208.06773  [pdf, other

    cs.CV cs.IR cs.LG cs.MM

    TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

    Authors: Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

    Abstract: YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In compa… ▽ More

    Submitted 14 August, 2022; originally announced August 2022.

    Comments: Accepted to ECCV 2022. Website: https://medhini.github.io/ivsum/

  9. arXiv:2206.07689  [pdf, other

    cs.CV

    Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

    Authors: Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

    Abstract: This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structur… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Ego4D CVPR22 Object State Localization challenge. arXiv admin note: substantial text overlap with arXiv:2206.06346

  10. arXiv:2206.06346  [pdf

    cs.CV

    Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

    Authors: Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

    Abstract: Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how cou… ▽ More

    Submitted 29 November, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: Tech report

  11. arXiv:2204.13631  [pdf, other

    cs.CV

    Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

    Authors: Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

    Abstract: Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this… ▽ More

    Submitted 20 October, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

    Comments: ECCV 2022. Code and models are available here: https://github.com/facebookresearch/reliable_vqa

  12. arXiv:2204.09222  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    K-LITE: Learning Transferable Visual Models with External Knowledge

    Authors: Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao

    Abstract: The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, due to the broad concept coverage achieved via large-scale data collection process. Alternatively, we argue that learning with ext… ▽ More

    Submitted 21 October, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022 camera ready

  13. arXiv:2204.05991  [pdf, other

    cs.CV cs.CL

    ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

    Authors: Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

    Abstract: Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP,… ▽ More

    Submitted 2 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: ACL 2022

  14. arXiv:2202.08926  [pdf, other

    cs.CV

    On Guiding Visual Attention with Language Specification

    Authors: Suzanne Petryk, Lisa Dunlap, Keyan Nasseri, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach

    Abstract: While real world challenges typically define visual categories with language words or phrases, most visual classification methods define categories with numerical indices. However, the language specification of the classes provides an especially useful prior for biased and noisy datasets, where it can help disambiguate what features are task-relevant. Recently, large-scale multimodal models have b… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 14 pages, 9 figures

  15. arXiv:2202.04800  [pdf, other

    cs.CV cs.CL

    The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

    Authors: Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

    Abstract: Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might as… ▽ More

    Submitted 25 July, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: code, data, models at http://visualabduction.com/

    Journal ref: ECCV 2022

  16. arXiv:2112.10936  [pdf, other

    cs.CV cs.AI cs.CL cs.CR cs.MM

    Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion

    Authors: Shruti Agarwal, Liwen Hu, Evonne Ng, Trevor Darrell, Hao Li, Anna Rohrbach

    Abstract: In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic foren… ▽ More

    Submitted 1 December, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: Accepted in WACV 2023

  17. arXiv:2112.08594  [pdf, other

    cs.CV cs.CL

    Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

    Authors: Giscard Biamby, Grace Luo, Trevor Darrell, Anna Rohrbach

    Abstract: Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance. In this work we aim to develop defenses against such misinformation for the topics of Climate Change, COVID-19, and Military Vehicles. We first present a large-scale multimodal dataset with over 884k tweets relevant to these topics. Next, we propose a… ▽ More

    Submitted 2 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: 11 pages, 6 figures

  18. arXiv:2112.05744  [pdf, other

    cs.CV cs.GR

    More Control for Free! Image Synthesis with Semantic Diffusion Guidance

    Authors: Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell

    Abstract: Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this m… ▽ More

    Submitted 5 December, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

    Comments: WACV 2023. Project page https://xh-liu.github.io/sdg/

  19. arXiv:2110.06915  [pdf, other

    cs.CV

    Object-Region Video Transformers

    Authors: Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

    Abstract: Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly… ▽ More

    Submitted 9 June, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: CVPR 2022

  20. arXiv:2107.06383  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    How Much Can CLIP Benefit Vision-and-Language Tasks?

    Authors: Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

    Abstract: Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amou… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

    Comments: 14 pages

  21. arXiv:2107.00650  [pdf, other

    cs.CV cs.AI cs.MM

    CLIP-It! Language-Guided Video Summarization

    Authors: Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

    Abstract: A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited av… ▽ More

    Submitted 7 December, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Neurips 2021. Website at https://medhini.github.io/clip_it/

    Journal ref: Thirty-Fifth Conference on Neural Information Processing Systems. 2021

  22. arXiv:2106.04550  [pdf, other

    cs.CV

    DETReg: Unsupervised Pretraining with Region Priors for Object Detection

    Authors: Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

    Abstract: Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture. Instead, we introduce DETReg, a new self-supervised method that pretrains the entire object detection network, including the object localization and embedding components. During pretraining, DETReg predicts object localiza… ▽ More

    Submitted 19 July, 2023; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: Project page: https://www.amirbar.net/detreg/

  23. arXiv:2104.05893  [pdf, other

    cs.CV cs.CL

    NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media

    Authors: Grace Luo, Trevor Darrell, Anna Rohrbach

    Abstract: Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and te… ▽ More

    Submitted 21 September, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021

  24. arXiv:2008.09791  [pdf, other

    cs.CV

    Identity-Aware Multi-Sentence Video Description

    Authors: Jae Sung Park, Trevor Darrell, Anna Rohrbach

    Abstract: Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs c… ▽ More

    Submitted 22 August, 2020; originally announced August 2020.

    Comments: Project link at https://sites.google.com/site/describingmovies/lsmdc-2019/

  25. arXiv:2006.15327  [pdf, other

    cs.CV cs.LG

    Compositional Video Synthesis with Action Graphs

    Authors: Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson

    Abstract: Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthes… ▽ More

    Submitted 10 June, 2021; v1 submitted 27 June, 2020; originally announced June 2020.

    Comments: ICML 2021 Camera Ready

  26. arXiv:1906.00347  [pdf, other

    cs.CL

    Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

    Authors: Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

    Abstract: Vision-and-Language Navigation (VLN) requires grounding instructions, such as "turn right and stop at the door", to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route. We investigate where the natur… ▽ More

    Submitted 9 June, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  27. arXiv:1905.04405  [pdf, other

    cs.CV

    Language-Conditioned Graph Networks for Relational Reasoning

    Authors: Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko

    Abstract: Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question "What color is the mug on the plate?" we must check the color of the specific mug that satisfies the "on" relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, mo… ▽ More

    Submitted 16 August, 2019; v1 submitted 10 May, 2019; originally announced May 2019.

  28. arXiv:1901.02527  [pdf, other

    cs.CV cs.AI

    Robust Change Captioning

    Authors: Dong Huk Park, Trevor Darrell, Anna Rohrbach

    Abstract: Describing what has changed in a scene can be useful to a user, but only if generated text focuses on what is semantically relevant. It is thus important to distinguish distractors (e.g. a viewpoint change) from relevant changes (e.g. an object has moved). We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning. Our model learns to distinguish distractors from se… ▽ More

    Submitted 16 April, 2019; v1 submitted 8 January, 2019; originally announced January 2019.

  29. arXiv:1812.05634  [pdf, other

    cs.CV cs.CL

    Adversarial Inference for Multi-Sentence Video Description

    Authors: Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

    Abstract: While significant progress has been made in the image captioning task, video description is still in its infancy due to the complex nature of video data. Generating multi-sentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning… ▽ More

    Submitted 15 April, 2019; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: Accepted to Computer Vision and Pattern Recognition (CVPR) 2019

  30. arXiv:1809.02156  [pdf, other

    cs.CL cs.CV

    Object Hallucination in Image Captioning

    Authors: Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, Kate Saenko

    Abstract: Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and asses… ▽ More

    Submitted 29 March, 2019; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: Rohrbach and Hendricks contributed equally; accepted to EMNLP 2018

  31. arXiv:1807.11546  [pdf, other

    cs.CV

    Textual Explanations for Self-Driving Vehicles

    Authors: Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata

    Abstract: Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rat… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

    Comments: Accepted to ECCV 2018

    Journal ref: European Conference on Computer Vision (ECCV), 2018

  32. arXiv:1807.00517  [pdf, other

    cs.CV

    Women also Snowboard: Overcoming Bias in Captioning Models (Extended Abstract)

    Authors: Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, Anna Rohrbach

    Abstract: Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data. This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over reliance on the learned prior and image co… ▽ More

    Submitted 2 July, 2018; originally announced July 2018.

    Comments: Burns and Hendricks contributed equally. 2018 ICML Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018)

  33. arXiv:1806.02724  [pdf, other

    cs.CV cs.CL

    Speaker-Follower Models for Vision-and-Language Navigation

    Authors: Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

    Abstract: Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it… ▽ More

    Submitted 26 October, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

    Comments: NIPS 2018

  34. arXiv:1803.09797  [pdf, other

    cs.CV

    Women also Snowboard: Overcoming Bias in Captioning Models

    Authors: Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach

    Abstract: Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data (e.g., if a word is present in 60% of training sentences, it might be predicted in 70% of sentences at test time). This can lead to incorrect captions i… ▽ More

    Submitted 13 March, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: 22 pages, 6 figures, Burns and Hendricks contributed equally

  35. arXiv:1803.08006  [pdf, other

    cs.CV

    Video Object Segmentation with Language Referring Expressions

    Authors: Anna Khoreva, Anna Rohrbach, Bernt Schiele

    Abstract: Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical… ▽ More

    Submitted 5 February, 2019; v1 submitted 21 March, 2018; originally announced March 2018.

    Comments: ACCV 2018: 14th Asian Conference on Computer Vision

  36. arXiv:1802.08129  [pdf, other

    cs.AI cs.CL cs.CV

    Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

    Authors: Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

    Abstract: Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets… ▽ More

    Submitted 15 February, 2018; originally announced February 2018.

    Comments: arXiv admin note: text overlap with arXiv:1612.04757

  37. arXiv:1711.07373  [pdf, other

    cs.CV

    Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

    Authors: Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

    Abstract: Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotati… ▽ More

    Submitted 17 November, 2017; originally announced November 2017.

    Comments: arXiv admin note: text overlap with arXiv:1612.04757

  38. arXiv:1710.05958  [pdf, other

    cs.LG cs.AI cs.CV

    Gradient-free Policy Architecture Search and Adaptation

    Authors: Sayna Ebrahimi, Anna Rohrbach, Trevor Darrell

    Abstract: We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to th… ▽ More

    Submitted 16 October, 2017; originally announced October 2017.

    Comments: Accepted in Conference on Robot Learning, 2017

  39. arXiv:1709.08693  [pdf, other

    cs.AI

    Fooling Vision and Language Models Despite Localization and Attention Mechanism

    Authors: Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, Dawn Song

    Abstract: Adversarial attacks are known to succeed on classifiers, but it has been an open question whether more complex vision systems are vulnerable. In this paper, we study adversarial examples for vision and language models, which incorporate natural language understanding and complex structures such as attention, localization, and modular architectures. In particular, we investigate attacks on a dense… ▽ More

    Submitted 5 April, 2018; v1 submitted 25 September, 2017; originally announced September 2017.

    Comments: CVPR 2018

  40. arXiv:1704.01518  [pdf, other

    cs.CV

    Generating Descriptions with Grounded and Co-Referenced People

    Authors: Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

    Abstract: Learning how to generate descriptions of images or videos received major interest both in the Computer Vision and Natural Language Processing communities. While a few works have proposed to learn a grounding during the generation process in an unsupervised way (via an attention mechanism), it remains unclear how good the quality of the grounding is and whether it benefits the description quality.… ▽ More

    Submitted 5 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017

  41. arXiv:1611.07810  [pdf, other

    cs.CV

    A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

    Authors: Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal

    Abstract: While deep convolutional neural networks frequently approach or exceed human-level performance at benchmark tasks involving static images, extending this success to moving images is not straightforward. Having models which can learn to understand video is of interest for many applications, including content recommendation, prediction, summarization, event/object detection and understanding human v… ▽ More

    Submitted 5 February, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

  42. arXiv:1606.01847  [pdf, other

    cs.CV cs.AI cs.CL

    Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

    Authors: Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach

    Abstract: Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual repr… ▽ More

    Submitted 23 September, 2016; v1 submitted 6 June, 2016; originally announced June 2016.

    Comments: Accepted to EMNLP 2016

  43. arXiv:1605.03705  [pdf, other

    cs.CV cs.CL

    Movie Description

    Authors: Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

    Abstract: Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full… ▽ More

    Submitted 12 May, 2016; originally announced May 2016.

  44. Grounding of Textual Phrases in Images by Reconstruction

    Authors: Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

    Abstract: Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns groundin… ▽ More

    Submitted 17 February, 2017; v1 submitted 11 November, 2015; originally announced November 2015.

    Comments: published at ECCV 2016 (oral); updated to final version

  45. arXiv:1506.01698  [pdf, other

    cs.CV cs.CL

    The Long-Short Story of Movie Description

    Authors: Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

    Abstract: Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII Movie Description allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Sho… ▽ More

    Submitted 4 June, 2015; originally announced June 2015.

  46. Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

    Authors: Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele

    Abstract: Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which… ▽ More

    Submitted 15 October, 2015; v1 submitted 23 February, 2015; originally announced February 2015.

    Comments: in International Journal of Computer Vision (IJCV) 2015

  47. arXiv:1501.02530  [pdf, other

    cs.CV cs.CL cs.IR

    A Dataset for Movie Description

    Authors: Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

    Abstract: Descriptive video service (DVS) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned… ▽ More

    Submitted 11 January, 2015; originally announced January 2015.