Skip to main content

Showing 1–50 of 52 results for author: Fragkiadaki, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.04380  [pdf, other

    cs.RO cs.LG

    Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations

    Authors: Julen Urain, Ajay Mandlekar, Yilun Du, Mahi Shafiullah, Danfei Xu, Katerina Fragkiadaki, Georgia Chalvatzaki, Jan Peters

    Abstract: Learning from Demonstrations, the field that proposes to learn robot behavior models from data, is gaining popularity with the emergence of deep generative models. Although the problem has been studied for years under names such as Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning, classical methods have relied on models that don't capture complex data distributions well or… ▽ More

    Submitted 21 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

    Comments: 20 pages, 11 figures, submitted to TRO

  2. arXiv:2407.08737  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Video Diffusion Alignment via Reward Gradients

    Authors: Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, Deepak Pathak

    Abstract: We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we utilize pre-trained reward… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Project Webpage: https://vader-vid.github.io; Code available at: https://github.com/mihirp1998/VADER

  3. arXiv:2406.14596  [pdf, other

    cs.CV cs.AI cs.LG

    ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

    Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

    Abstract: Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Ab… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Project website: http://ical-learning.github.io/

  4. arXiv:2405.02280  [pdf, other

    cs.CV

    DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

    Authors: Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki

    Abstract: View-predictive generative models provide strong priors for lifting object-centric images and videos into 3D and 4D through rendering and score distillation objectives. A question then remains: what about lifting complete multi-object dynamic scenes? There are two challenges in this direction: First, rendering error gradients are often insufficient to recover fast object motion, and second, view p… ▽ More

    Submitted 23 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

    Comments: Project page: https://dreamscene4d.github.io/

  5. arXiv:2404.19065  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

    Authors: Gabriel Sarch, Sahil Somani, Raghav Kapoor, Michael J. Tarr, Katerina Fragkiadaki

    Abstract: Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELP… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Videos and code https://helper-agent-llm.github.io/

  6. arXiv:2403.07232  [pdf, other

    cs.RO cs.LG

    Tractable Joint Prediction and Planning over Discrete Behavior Modes for Urban Driving

    Authors: Adam Villaflor, Brian Yang, Huangyuan Su, Katerina Fragkiadaki, John Dolan, Jeff Schneider

    Abstract: Significant progress has been made in training multimodal trajectory forecasting models for autonomous driving. However, effectively integrating these models with downstream planners and model-based control approaches is still an open problem. Although these models have conventionally been evaluated for open-loop prediction, we show that they can be used to parameterize autoregressive closed-loop… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  7. arXiv:2402.10885  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

    Authors: Tsung-Wei Ke, Nikolaos Gkanatsios, Katerina Fragkiadaki

    Abstract: Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to g… ▽ More

    Submitted 25 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: First two authors contributed equally

  8. arXiv:2402.06559  [pdf, other

    cs.LG cs.AI cs.CL cs.RO

    Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following

    Authors: Brian Yang, Huangyuan Su, Nikolaos Gkanatsios, Tsung-Wei Ke, Ayush Jain, Jeff Schneider, Katerina Fragkiadaki

    Abstract: Diffusion models excel at modeling complex and multimodal trajectory distributions for decision-making and control. Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward fun… ▽ More

    Submitted 16 July, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

  9. arXiv:2401.02416  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    ODIN: A Single Model for 2D and 3D Segmentation

    Authors: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki

    Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that… ▽ More

    Submitted 25 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Camera Ready (CVPR 2024, Highlight)

  10. arXiv:2311.16102  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback

    Authors: Mihir Prabhudesai, Tsung-Wei Ke, Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki

    Abstract: The advancements in generative modeling, particularly the advent of diffusion models, have sparked a fundamental question: how can these models be effectively used for discriminative tasks? In this work, we find that generative models can be great test-time adapters for discriminative models. Our method, Diffusion-TTA, adapts pre-trained discriminative models such as image classifiers, segmenters… ▽ More

    Submitted 29 November, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS 2023 Webpage with Code: https://diffusion-tta.github.io/

  11. arXiv:2311.01455  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

    Authors: Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, Chuang Gan

    Abstract: We present RoboGen, a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation. RoboGen leverages the latest advancements in foundation and generative models. Instead of directly using or adapting these models to produce policies or low-level actions, we advocate for a generative scheme, which uses these models to automatically generate diversifi… ▽ More

    Submitted 14 June, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: ICML 2024

  12. arXiv:2310.18308  [pdf, other

    cs.RO cs.AI cs.LG

    Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models

    Authors: Pushkal Katara, Zhou Xian, Katerina Fragkiadaki

    Abstract: Generalist robot manipulators need to learn a wide variety of manipulation skills across diverse environments. Current robot training pipelines rely on humans to provide kinesthetic demonstrations or to program simulation environments and to code up reward functions for reinforcement learning. Such human involvement is an important bottleneck towards scaling up robot learning across diverse tasks… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

  13. arXiv:2310.15127  [pdf, other

    cs.AI cs.CL cs.LG cs.RO

    Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

    Authors: Gabriel Sarch, Yue Wu, Michael J. Tarr, Katerina Fragkiadaki

    Abstract: Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an… ▽ More

    Submitted 20 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Project page with code & videos: https://helper-agent-llm.github.io

  14. arXiv:2310.06992  [pdf, other

    cs.CV

    Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

    Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki

    Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained… ▽ More

    Submitted 25 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Project page available at https://wenhsuanchu.github.io/ovtracktor/

  15. arXiv:2310.03739  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Aligning Text-to-Image Diffusion Models with Reward Backpropagation

    Authors: Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki

    Abstract: Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works fi… ▽ More

    Submitted 22 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Code available at https://align-prop.github.io/

  16. arXiv:2306.17817  [pdf, other

    cs.RO cs.AI cs.LG

    Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation

    Authors: Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, Katerina Fragkiadaki

    Abstract: 3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing… ▽ More

    Submitted 19 October, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

  17. arXiv:2304.14391  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

    Authors: Nikolaos Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, Christopher Atkeson, Katerina Fragkiadaki

    Abstract: Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energ… ▽ More

    Submitted 23 January, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: First two authors contributed equally | RSS 2023

  18. arXiv:2304.14382  [pdf, other

    cs.CV cs.AI cs.LG

    Analogy-Forming Transformers for Few-Shot 3D Parsing

    Authors: Nikolaos Gkanatsios, Mayank Singh, Zhaoyuan Fang, Shubham Tulsiani, Katerina Fragkiadaki

    Abstract: We present Analogical Networks, a model that encodes domain knowledge explicitly, in a collection of structured labelled 3D scenes, in addition to implicitly, as model parameters, and segments 3D object scenes with analogical reasoning: instead of mapping a scene to part segments directly, our model first retrieves related scenes from memory and their corresponding part structures, and then predic… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: ICLR 2023

  19. arXiv:2303.02346  [pdf, other

    cs.RO cs.AI cs.LG

    FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation

    Authors: Zhou Xian, Bo Zhu, Zhenjia Xu, Hsiao-Yu Tung, Antonio Torralba, Katerina Fragkiadaki, Chuang Gan

    Abstract: Humans manipulate various kinds of fluids in their everyday life: creating latte art, scooping floating objects from water, rolling an ice cream cone, etc. Using robots to augment or replace human labors in these daily settings remain as a challenging task due to the multifaceted complexities of fluids. Previous research in robotic fluid manipulation mostly consider fluids governed by an ideal, Ne… ▽ More

    Submitted 4 March, 2023; originally announced March 2023.

  20. arXiv:2210.15751  [pdf, other

    cs.RO cs.AI

    Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation

    Authors: Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held

    Abstract: Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels. Previous methods typically either focus on short-horizon tasks or make strong assumptions that full-state information is available, which prevents their use on deformable objects. In this paper, we propose PlAnning with Spatial-Temporal Abstraction (PASTA), whic… ▽ More

    Submitted 23 June, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Published at the Conference on Robot Learning (CoRL 2022)

  21. arXiv:2207.10761  [pdf, other

    cs.CV

    TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

    Authors: Gabriel Sarch, Zhaoyuan Fang, Adam W. Harley, Paul Schydlo, Michael J. Tarr, Saurabh Gupta, Katerina Fragkiadaki

    Abstract: We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

  22. arXiv:2206.07959  [pdf, other

    cs.CV

    Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

    Authors: Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, Katerina Fragkiadaki

    Abstract: Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV)… ▽ More

    Submitted 29 September, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

  23. arXiv:2204.04153  [pdf, other

    cs.CV

    Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories

    Authors: Adam W. Harley, Zhaoyuan Fang, Katerina Fragkiadaki

    Abstract: Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this paper, we revisit Sand and Teller's "particle video" approach, an… ▽ More

    Submitted 25 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

  24. arXiv:2203.11194  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Test-time Adaptation with Slot-Centric Models

    Authors: Mihir Prabhudesai, Anirudh Goyal, Sujoy Paul, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gaurav Aggarwal, Thomas Kipf, Deepak Pathak, Katerina Fragkiadaki

    Abstract: Current visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the ta… ▽ More

    Submitted 27 June, 2023; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted at ICML 2023. Project website at https://slot-tta.github.io/

  25. arXiv:2112.08879  [pdf, other

    cs.CV cs.CL

    Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds

    Authors: Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki

    Abstract: Most models tasked to ground referential utterances in 2D and 3D scenes learn to select the referred object from a pool of object proposals provided by a pre-trained detector. This is limiting because an utterance may refer to visual entities at various levels of granularity, such as the chair, the leg of the chair, or the tip of the front leg of the chair, which may be missed by the detector. We… ▽ More

    Submitted 21 July, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: First two authors contributed equally | ECCV 2022 Camera Ready

  26. arXiv:2104.03851  [pdf, other

    cs.CV

    CoCoNets: Continuous Contrastive 3D Scene Representations

    Authors: Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W. Harley, Katerina Fragkiadaki

    Abstract: This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  27. arXiv:2104.03424  [pdf, other

    cs.CV

    Track, Check, Repeat: An EM Approach to Unsupervised Tracking

    Authors: Adam W. Harley, Yiming Zuo, Jing Wen, Ayush Mangal, Shubhankar Potdar, Ritwick Chaudhry, Katerina Fragkiadaki

    Abstract: We propose an unsupervised method for detecting and tracking moving objects in 3D, in unlabelled RGB-D videos. The method begins with classic handcrafted techniques for segmenting objects using motion cues: we estimate optical flow and camera motion, and conservatively segment regions that appear to be moving independently of the background. Treating these initial segments as pseudo-labels, we lea… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

  28. arXiv:2103.09439  [pdf, other

    cs.RO cs.AI cs.LG

    HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks

    Authors: Zhou Xian, Shamit Lal, Hsiao-Yu Tung, Emmanouil Antonios Platanios, Katerina Fragkiadaki

    Abstract: We propose HyperDynamics, a dynamics meta-learning framework that conditions on an agent's interactions with the environment and optionally its visual observations, and generates the parameters of neural dynamics models based on inferred properties of the dynamical system. Physical and visual properties of the environment that are not part of the low-dimensional state yet affect its temporal dynam… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

  29. arXiv:2012.00057  [pdf, other

    cs.CV cs.AI cs.LG

    Move to See Better: Self-Improving Embodied Object Detection

    Authors: Zhaoyuan Fang, Ayush Jain, Gabriel Sarch, Adam W. Harley, Katerina Fragkiadaki

    Abstract: Passive methods for object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence across multiple views. Generalization to novel or difficult viewpoints thus requires additional training with lots of annotations. In contrast, humans often recognize objects by simply moving around, to get more informative viewpoints. In this paper, we pr… ▽ More

    Submitted 29 March, 2021; v1 submitted 30 November, 2020; originally announced December 2020.

    Comments: First three authors contributed equally. Project Page: https://ayushjain1144.github.io/SeeingByMoving/

  30. arXiv:2011.06464  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

    Authors: Hsiao-Yu Fish Tung, Zhou Xian, Mihir Prabhudesai, Shamit Lal, Katerina Fragkiadaki

    Abstract: We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the fu… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  31. arXiv:2011.03367  [pdf, other

    cs.CV

    Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

    Authors: Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki

    Abstract: We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to… ▽ More

    Submitted 20 July, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

  32. arXiv:2010.16279  [pdf, other

    cs.CV

    3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

    Authors: Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, Katerina Fragkiadaki

    Abstract: We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature map… ▽ More

    Submitted 30 October, 2020; originally announced October 2020.

  33. arXiv:2008.01295  [pdf, other

    cs.CV

    Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Paul Schydlo, Katerina Fragkiadaki

    Abstract: We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multivie… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  34. arXiv:2005.04551  [pdf, other

    cs.CV

    Epipolar Transformers

    Authors: Yihui He, Rui Yan, Katerina Fragkiadaki, Shoou-I Yu

    Abstract: A common approach to localize 3D human joints in a synchronized and calibrated multi-view setup consists of two-steps: (1) apply a 2D detector separately on each view to localize joints in 2D, and (2) perform robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is limited to solving challenging cases which could potentially be… ▽ More

    Submitted 9 May, 2020; originally announced May 2020.

    Comments: CVPR 2020

  35. arXiv:1910.01210  [pdf, other

    cs.CV cs.LG cs.RO

    Embodied Language Grounding with 3D Visual Feature Representations

    Authors: Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed, Maximilian Sieb, Adam W. Harley, Katerina Fragkiadaki

    Abstract: We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image.… ▽ More

    Submitted 17 June, 2021; v1 submitted 2 October, 2019; originally announced October 2019.

    Journal ref: Conference on Computer Vision and Pattern Recognition. 2020, pp. 2220-2229

  36. arXiv:1907.05518  [pdf, other

    cs.RO cs.AI cs.CV

    Graph-Structured Visual Imitation

    Authors: Maximilian Sieb, Zhou Xian, Audrey Huang, Oliver Kroemer, Katerina Fragkiadaki

    Abstract: We cast visual imitation as a visual correspondence problem. Our robotic agent is rewarded when its actions result in better matching of relative spatial configurations for corresponding visual entities detected in its workspace and teacher's demonstration. We build upon recent advances in Computer Vision,such as human finger keypoint detectors, object detectors trained on-the-fly with synthetic a… ▽ More

    Submitted 4 March, 2020; v1 submitted 11 July, 2019; originally announced July 2019.

    Comments: 8 pages, 3 figures, 1 table

  37. arXiv:1906.03764  [pdf, other

    cs.CV

    Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Fangyu Li, Xian Zhou, Hsiao-Yu Fish Tung, Katerina Fragkiadaki

    Abstract: Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing information is tightly coupled with perception: we feel as if we see the world in 3 dimension… ▽ More

    Submitted 16 May, 2020; v1 submitted 9 June, 2019; originally announced June 2019.

  38. arXiv:1901.03628  [pdf, other

    cs.CV

    Image Disentanglement and Uncooperative Re-Entanglement for High-Fidelity Image-to-Image Translation

    Authors: Adam W. Harley, Shih-En Wei, Jason Saragih, Katerina Fragkiadaki

    Abstract: Cross-domain image-to-image translation should satisfy two requirements: (1) preserve the information that is common to both domains, and (2) generate convincing images covering variations that appear in the target domain. This is challenging, especially when there are no example translations available as supervision. Adversarial cycle consistency was recently proposed as a solution, with beautifu… ▽ More

    Submitted 19 October, 2019; v1 submitted 11 January, 2019; originally announced January 2019.

  39. arXiv:1901.00003  [pdf, other

    cs.CV

    Learning Spatial Common Sense with Geometry-Aware Recurrent Networks

    Authors: Hsiao-Yu Fish Tung, Ricson Cheng, Katerina Fragkiadaki

    Abstract: We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to "lift" and integrate 2D visual features over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion estimation and… ▽ More

    Submitted 8 April, 2019; v1 submitted 31 December, 2018; originally announced January 2019.

  40. arXiv:1811.08086  [pdf, other

    cs.RO cs.AI cs.LG

    Model Learning for Look-ahead Exploration in Continuous Control

    Authors: Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki

    Abstract: We propose an exploration method that incorporates look-ahead search over basic learnt skills and their dynamics, and use it for reinforcement learning (RL) of manipulation policies . Our skills are multi-goal policies learned in isolation in simpler environments using existing multigoal RL formulations, analogous to options or macroactions. Coarse skill dynamics, i.e., the state transition caused… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

    Comments: This is a pre-print of our paper which is accepted in AAAI 2018

  41. arXiv:1811.08067  [pdf, other

    cs.RO cs.CV cs.LG

    Reinforcement Learning of Active Vision for Manipulating Objects under Occlusions

    Authors: Ricson Cheng, Arpit Agarwal, Katerina Fragkiadaki

    Abstract: We consider artificial agents that learn to jointly control their gripperand camera in order to reinforcement learn manipulation policies in the presenceof occlusions from distractor objects. Distractors often occlude the object of in-terest and cause it to disappear from the field of view. We propose hand/eye con-trollers that learn to move the camera to keep the object within the field of viewan… ▽ More

    Submitted 16 February, 2019; v1 submitted 19 November, 2018; originally announced November 2018.

    Comments: The paper was present in Conference of Robot Learning 2018

    Journal ref: Proceedings of Machine Learning Research 87 (2018) 422--431

  42. arXiv:1811.01292  [pdf, other

    cs.CV

    Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

    Authors: Ricson Cheng, Ziyan Wang, Katerina Fragkiadaki

    Abstract: We present recurrent geometry-aware neural networks that integrate visual information across multiple views of a scene into 3D latent feature tensors, while maintaining an one-to-one mapping between 3D physical locations in the world scene and latent feature locations. Object detection, object segmentation, and 3D reconstruction is then carried out directly using the constructed 3D feature memory,… ▽ More

    Submitted 13 November, 2018; v1 submitted 3 November, 2018; originally announced November 2018.

    Comments: To appear in NIPS2018

  43. arXiv:1804.10692  [pdf, other

    cs.CV cs.RO

    Reward Learning from Narrated Demonstrations

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, Liang-Kang Huang, Katerina Fragkiadaki

    Abstract: Humans effortlessly "program" one another by communicating goals and desires in natural language. In contrast, humans program robotic behaviours by indicating desired object locations and poses to be achieved, by providing RGB images of goal configurations, or supplying a demonstration to be imitated. None of these methods generalize across environment variations, and they convey the goal in awkwa… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.

    Comments: The work has been accepted to Conference on Computer Vision and Pattern Recognition (CVPR) 2018

  44. arXiv:1801.00508  [pdf, other

    cs.CV

    Depth-Adaptive Computational Policies for Efficient Visual Tracking

    Authors: Chris Ying, Katerina Fragkiadaki

    Abstract: Current convolutional neural networks algorithms for video object tracking spend the same amount of computation for each object and video frame. However, it is harder to track an object in some frames than others, due to the varying amount of clutter, scene complexity, amount of motion, and object's distinctiveness against its background. We propose a depth-adaptive convolutional Siamese network t… ▽ More

    Submitted 1 January, 2018; originally announced January 2018.

    Comments: presented at EMMCVPR 2017 in Venice, Italy

  45. arXiv:1712.01337  [pdf, other

    cs.CV

    Self-supervised Learning of Motion Capture

    Authors: Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki

    Abstract: Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like… ▽ More

    Submitted 4 December, 2017; originally announced December 2017.

    Comments: Neural Information Processing Systems (NIPS) 2017

  46. arXiv:1705.11166  [pdf, other

    cs.CV

    Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, William Seto, Katerina Fragkiadaki

    Abstract: Researchers have developed excellent feed-forward models that learn to map images to desired outputs, such as to the images' latent factors, or to other images, using supervised learning. Learning such mappings from unlabelled data, or improving upon supervised models by exploiting unlabelled data, remains elusive. We argue that there are two important parts to learning without annotations: (i) ma… ▽ More

    Submitted 1 September, 2017; v1 submitted 31 May, 2017; originally announced May 2017.

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4354-4362

  47. arXiv:1705.02082  [pdf, other

    cs.CV

    Motion Prediction Under Multimodality with Conditional Stochastic Networks

    Authors: Katerina Fragkiadaki, Jonathan Huang, Alex Alemi, Sudheendra Vijayanarasimhan, Susanna Ricco, Rahul Sukthankar

    Abstract: Given a visual history, multiple future outcomes for a video scene are equally probable, in other words, the distribution of future outcomes has multiple modes. Multimodality is notoriously hard to handle by standard regressors or classifiers: the former regress to the mean and the latter discretize a continuous high dimensional output space. In this work, we present stochastic neural network arch… ▽ More

    Submitted 5 May, 2017; originally announced May 2017.

  48. arXiv:1704.07804  [pdf, other

    cs.CV

    SfM-Net: Learning of Structure and Motion from Video

    Authors: Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

    Abstract: We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-to-frame pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), different… ▽ More

    Submitted 25 April, 2017; originally announced April 2017.

  49. arXiv:1511.07404  [pdf, other

    cs.CV

    Learning Visual Predictive Models of Physics for Playing Billiards

    Authors: Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, Jitendra Malik

    Abstract: The ability to plan and execute goal specific actions in varied, unexpected settings is a central requirement of intelligent agents. In this paper, we explore how an agent can be equipped with an internal model of the dynamics of the external world, and how it can use this model to plan novel actions by running multiple internal simulations ("visual imagination"). Our models directly process raw v… ▽ More

    Submitted 19 January, 2016; v1 submitted 23 November, 2015; originally announced November 2015.

  50. arXiv:1508.00271  [pdf, other

    cs.CV

    Recurrent Network Models for Human Dynamics

    Authors: Katerina Fragkiadaki, Sergey Levine, Panna Felsen, Jitendra Malik

    Abstract: We propose the Encoder-Recurrent-Decoder (ERD) model for recognition and prediction of human body pose in videos and motion capture. The ERD model is a recurrent neural network that incorporates nonlinear encoder and decoder networks before and after recurrent layers. We test instantiations of ERD architectures in the tasks of motion capture (mocap) generation, body pose labeling and body pose for… ▽ More

    Submitted 28 September, 2015; v1 submitted 2 August, 2015; originally announced August 2015.

    Comments: International Conference on Computer Vision 2015