Skip to main content

Showing 1–50 of 429 results for author: Wang, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.14968  [pdf, other

    cs.IR cs.CL

    MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

    Authors: Hao Jiang, Haoxiang Zhang, Qingshan Hou, Chaofeng Chen, Weisi Lin, Jingchang Zhang, Annan Wang

    Abstract: Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook indi… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  2. arXiv:2408.14527  [pdf, other

    cs.RO cs.AI cs.MA

    Multi-Agent Path Finding with Real Robot Dynamics and Interdependent Tasks for Automated Warehouses

    Authors: Vassilissa Lehoux-Lebacque, Tomi Silander, Christelle Loiodice, Seungjoon Lee, Albert Wang, Sofia Michel

    Abstract: Multi-Agent Path Finding (MAPF) is an important optimization problem underlying the deployment of robots in automated warehouses and factories. Despite the large body of work on this topic, most approaches make heavy simplifications, both on the environment and the agents, which make the resulting algorithms impractical for real-life scenarios. In this paper, we consider a realistic problem of onl… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Accepted to ECAI-2024. For related videos, see https://europe.naverlabs.com/research/publications/MAPF_IPP

  3. arXiv:2408.12928  [pdf, other

    cs.CV

    ParGo: Bridging Vision-Language with Partial and Global Views

    Authors: An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, Wei-Shi Zheng

    Abstract: This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the ove… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

  4. arXiv:2408.12817  [pdf, other

    cs.LG physics.chem-ph

    Data-Driven Parametrization of Molecular Mechanics Force Fields for Expansive Chemical Space Coverage

    Authors: Tianze Zheng, Ailun Wang, Xu Han, Yu Xia, Xingyuan Xu, Jiawei Zhan, Yu Liu, Yang Chen, Zhi Wang, Xiaojie Wu, Sheng Gong, Wen Yan

    Abstract: A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics' (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: ByteFF, a machine learning parametrized MMFF

  5. arXiv:2408.11208  [pdf, other

    cs.CV cs.LG

    PooDLe: Pooled and dense self-supervised learning from naturalistic videos

    Authors: Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren

    Abstract: Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-b… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Project page: https://poodle-ssl.github.io

  6. arXiv:2408.10681  [pdf, other

    cs.CL cs.LG

    HMoE: Heterogeneous Mixture of Experts for Language Modeling

    Authors: An Wang, Xingwu Sun, Ruobing Xie, Shuaipeng Li, Jiaqi Zhu, Zhen Yang, Pinxue Zhao, J. N. Han, Zhanhui Kang, Di Wang, Naoaki Okazaki, Cheng-zhong Xu

    Abstract: Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter util… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  7. arXiv:2408.09320  [pdf, other

    cs.HC cs.SD eess.AS

    Auptimize: Optimal Placement of Spatial Audio Cues for Extended Reality

    Authors: Hyunsung Cho, Alexander Wang, Divya Kartik, Emily Liying Xie, Yukang Yan, David Lindlbauer

    Abstract: Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimi… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

    Comments: UIST 2024

    ACM Class: H.5.1; H.5.2; H.5.5

  8. arXiv:2408.05517  [pdf, other

    cs.CL

    SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning

    Authors: Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, Yingda Chen

    Abstract: Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal task… ▽ More

    Submitted 18 August, 2024; v1 submitted 10 August, 2024; originally announced August 2024.

  9. arXiv:2408.04958  [pdf, other

    cs.CV cs.RO

    Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

    Authors: Long Bai, Guankun Wang, Mobarakol Islam, Lalithkumar Seenivasan, An Wang, Hongliang Ren

    Abstract: Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicat… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: Accepted by Information Fusion. Code and data availability: https://github.com/longbai1006/Surgical-VQLAPlus

  10. arXiv:2408.04593  [pdf, other

    cs.CV cs.RO eess.IV

    SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

    Authors: Jieming Yu, An Wang, Wenzhen Dong, Mengya Xu, Mobarakol Islam, Jie Wang, Long Bai, Hongliang Ren

    Abstract: The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-sh… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Empirical study. Previous work "SAM Meets Robotic Surgery" is accessible at: arXiv:2308.07156

  11. arXiv:2408.04426  [pdf, other

    cs.CV cs.RO

    A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery

    Authors: Mengya Xu, Ziqi Guo, An Wang, Long Bai, Hongliang Ren

    Abstract: As a crucial and intricate task in robotic minimally invasive surgery, reconstructing surgical scenes using stereo or monocular endoscopic video holds immense potential for clinical applications. NeRF-based techniques have recently garnered attention for the ability to reconstruct scenes implicitly. On the other hand, Gaussian splatting-based 3D-GS represents scenes explicitly using 3D Gaussians a… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: To appear in MICCAI 2024 EARTH Workshop. Code availability: https://github.com/Epsilon404/surgicalnerf

  12. arXiv:2408.02211  [pdf, other

    cs.GR

    SceneMotifCoder: Example-driven Visual Program Learning for Generating 3D Object Arrangements

    Authors: Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva

    Abstract: Despite advances in text-to-3D generation methods, generation of multi-object arrangements remains challenging. Current methods exhibit failures in generating physically plausible arrangements that respect the provided text description. We present SceneMotifCoder (SMC), an example-driven framework for generating 3D object arrangements through visual program learning. SMC leverages large language m… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  13. A Recipe for Success? Exploring Strategies for Improving Non-Visual Access to Cooking Instructions

    Authors: Franklin Mingzhe Li, Ashley Wang, Patrick Carrington, Shaun K. Kane

    Abstract: Cooking is an essential activity that enhances quality of life by enabling individuals to prepare their own meals. However, cooking often requires multitasking between cooking tasks and following instructions, which can be challenging to cooks with vision impairments if recipes or other instructions are inaccessible. To explore the practices and challenges of recipe access while cooking, we conduc… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: ASSETS 2024

  14. arXiv:2407.17035  [pdf, other

    cs.CV

    Q-Ground: Image Quality Grounding with Large Multi-modality Models

    Authors: Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

    Abstract: Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In thi… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: ACM Multimedia 2024 (Oral)

  15. arXiv:2407.13077  [pdf, other

    cs.CY

    Visions of a Discipline: Analyzing Introductory AI Courses on YouTube

    Authors: Severin Engelmann, Madiha Zahrah Choksi, Angelina Wang, Casey Fiesler

    Abstract: Education plays an indispensable role in fostering societal well-being and is widely regarded as one of the most influential factors in shaping the future of generations to come. As artificial intelligence (AI) becomes more deeply integrated into our daily lives and the workforce, educational institutions at all levels are directing their focus on resources that cater to AI education. Our work inv… ▽ More

    Submitted 30 May, 2024; originally announced July 2024.

  16. arXiv:2407.09271  [pdf, other

    cs.CV cs.LG

    iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning

    Authors: Tom Fischer, Yaoyao Liu, Artur Jesslen, Noor Ahmed, Prakhar Kaushik, Angtian Wang, Alan Yuille, Adam Kortylewski, Eddy Ilg

    Abstract: Different from human nature, it is still common practice today for vision tasks to train deep learning models only initially and on fixed datasets. A variety of approaches have recently addressed handling continual data streams. However, extending these methods to manage out-of-distribution (OOD) scenarios has not effectively been investigated. On the other hand, it has recently been shown that no… ▽ More

    Submitted 19 August, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

    Comments: ECCV-24

  17. arXiv:2407.06939  [pdf, other

    cs.RO cs.CV

    Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

    Authors: Sriram Yenamandra, Arun Ramachandran, Mukul Khanna, Karmesh Yadav, Jay Vakil, Andrew Melnik, Michael Büttner, Leon Harz, Lyon Brown, Gora Chand Nandi, Arjun PS, Gaurav Kumar Yadav, Rahul Kala, Robert Haschke, Yang Luo, Jinxin Zhu, Yansen Han, Bingyi Lu, Xuan Gu, Qinyuan Liu, Yaping Zhao, Qiting Ye, Chenxiao Dou, Yansong Chua, Volodymyr Kuzma , et al. (20 additional authors not shown)

    Abstract: In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface withi… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  18. arXiv:2407.05769  [pdf, other

    cs.CV

    Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework

    Authors: Hao Jing, Anhong Wang, Lijun Zhao, Yakun Yang, Donghan Bu, Jing Zhang, Yifan Zhang, Junhui Hou

    Abstract: In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information. However, traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference in 3D object detection. To address this, we propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-bran… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  19. arXiv:2407.00320  [pdf, other

    cs.CL cs.AI cs.LG

    LiteSearch: Efficacious Tree Search for LLM

    Authors: Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Dian Yu, Haitao Mi, Jinsong Su, Dong Yu

    Abstract: Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree s… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  20. arXiv:2406.16860  [pdf, other

    cs.CV

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Authors: Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie

    Abstract: We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Website at https://cambrian-mllm.github.io

  21. arXiv:2406.12723  [pdf, other

    cs.LG

    BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

    Authors: Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang

    Abstract: As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin… ▽ More

    Submitted 24 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

  22. arXiv:2406.12292  [pdf, other

    cs.SD cs.AI eess.AS

    JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning

    Authors: Boyu Chen, Peike Li, Yao Yao, Alex Wang

    Abstract: Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user requirements, particularly when the objective is to generate music that embodies a specific concept derived from a designated reference collection. In this paper… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  23. arXiv:2406.09321  [pdf, other

    cs.CR cs.AI cs.CL

    JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

    Authors: Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, Anyu Wang

    Abstract: Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions, presenting severe misuse threats to LLMs. Up to now, research into jailbreak attacks and defenses is emerging, however, there is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. In other words, the methods to assess the harmfulness of an LL… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Our code is available at https://github.com/ThuCCSLab/JailbreakEval

  24. arXiv:2406.08877  [pdf, other

    cs.CV cs.AI

    EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

    Authors: Yuan-Ming Li, Wei-Jin Huang, An-Lan Wang, Ling-An Zeng, Jing-Ke Meng, Wei-Shi Zheng

    Abstract: We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal bound… ▽ More

    Submitted 16 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by ECCV2024

  25. arXiv:2406.03035  [pdf, other

    cs.CV

    Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

    Authors: Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, Wei Liu, Mengyang Liu, Wenhan Luo

    Abstract: Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple characte… ▽ More

    Submitted 12 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  26. arXiv:2406.02547  [pdf, ps, other

    cs.CV

    Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

    Authors: Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou

    Abstract: Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 12 pages. The website is \url{https://fingerrec.github.io/visincontext}

  27. arXiv:2406.01908  [pdf, other

    cs.LG math.OC

    PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

    Authors: Bingheng Li, Linxin Yang, Yupeng Chen, Senmiao Wang, Qian Chen, Haitao Mao, Yao Ma, Akang Wang, Tian Ding, Jiliang Tang, Ruoyu Sun

    Abstract: Solving large-scale linear programming (LP) problems is an important task in various areas such as communication networks, power systems, finance and logistics. Recently, two distinct approaches have emerged to expedite LP solving: (i) First-order methods (FOMs); (ii) Learning to optimize (L2O). In this work, we propose an FOM-unrolled neural network (NN) called PDHG-Net, and propose a two-stage L… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  28. arXiv:2406.00622  [pdf, other

    cs.CV cs.AI

    Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

    Authors: Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, Alan Yuille

    Abstract: For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions within 3D scenes from video is crucial for effective reasoning. In this work, we introduce a video question answering dataset SuperCLEVR-Physics that focuses on the dynamics properties of objects. We concentrate on physical concepts -- velocity, acceleration, and collisions within 4D scenes, w… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  29. arXiv:2406.00061  [pdf, other

    cs.LG cs.AI cs.CL

    STAT: Shrinking Transformers After Training

    Authors: Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle

    Abstract: We present STAT: a simple algorithm to prune transformer models without any fine-tuning. STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer. Each layer block in the network is compressed using a series of principled matrix factorizations that preserve the network structure. Our entire algorithm t… ▽ More

    Submitted 29 May, 2024; originally announced June 2024.

  30. arXiv:2405.20448  [pdf, other

    cs.LG

    Knockout: A simple way to handle missing inputs

    Authors: Minh Nguyen, Batuhan K. Karaman, Heejong Kim, Alan Q. Wang, Fengbei Liu, Mert R. Sabuncu

    Abstract: Deep learning models can extract predictive and actionable information from complex inputs. The richer the inputs, the better these models usually perform. However, models that leverage rich inputs (e.g., multi-modality) can be difficult to deploy widely, because some inputs may be missing at inference. Current popular solutions to this problem include marginalization, imputation, and training mul… ▽ More

    Submitted 3 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  31. arXiv:2405.20334  [pdf, other

    cs.CV cs.GR

    VividDream: Generating 3D Scene with Ambient Dynamics

    Authors: Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y. Feng, Jia-Bin Huang

    Abstract: We introduce VividDream, a method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. VividDream first expands an input image into a static 3D point cloud through iterative inpainting and geometry merging. An ensemble of animated videos is then generated using video diffusion models with quality refinement techniques and conditioned on renderings of… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Project page: https://vivid-dream-4d.github.io

  32. arXiv:2405.17537  [pdf, other

    cs.AI cs.CL cs.CV

    BIOSCAN-CLIP: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

    Authors: ZeMing Gong, Austin T. Wang, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

    Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for the taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, DNA barcodes, and textual data in a unified embedding space. This allows for… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 16 pages with 9 figures

  33. arXiv:2405.17248  [pdf, other

    stat.ML cs.LG

    Transformer In-Context Learning for Categorical Data

    Authors: Aaron T. Wang, Ricardo Henao, Lawrence Carin

    Abstract: Recent research has sought to understand Transformers through the lens of in-context learning with functional data. We extend that line of work with the goal of moving closer to language models, considering categorical outcomes, nonlinear underlying models, and nonlinear attention. The contextual data are of the form $\textsf{C}=(x_1,c_1,\dots,x_N,c_{N})$ where each $c_i\in\{0,\dots,C-1\}$ is draw… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  34. arXiv:2405.14782  [pdf, other

    cs.CL

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Authors: Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan , et al. (5 additional authors not shown)

    Abstract: Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  35. arXiv:2405.14458  [pdf, other

    cs.CV

    YOLOv10: Real-Time End-to-End Object Detection

    Authors: Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding

    Abstract: Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum sup… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Code: https://github.com/THU-MIG/yolov10

  36. Towards Feature Engineering with Human and AI's Knowledge: Understanding Data Science Practitioners' Perceptions in Human&AI-Assisted Feature Engineering Design

    Authors: Qian Zhu, Dakuo Wang, Shuai Ma, April Yi Wang, Zixin Chen, Udayan Khurana, Xiaojuan Ma

    Abstract: As AI technology continues to advance, the importance of human-AI collaboration becomes increasingly evident, with numerous studies exploring its potential in various fields. One vital field is data science, including feature engineering (FE), where both human ingenuity and AI capabilities play pivotal roles. Despite the existence of AI-generated recommendations for FE, there remains a limited und… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Computational Notebooks, Human-AI Collaboration, Feature Recommendation

  37. arXiv:2405.14019  [pdf, other

    cs.CV

    BrainMorph: A Foundational Keypoint Model for Robust and Flexible Brain MRI Registration

    Authors: Alan Q. Wang, Rachit Saluja, Heejong Kim, Xinzi He, Adrian Dalca, Mert R. Sabuncu

    Abstract: We present a keypoint-based foundation model for general purpose brain MRI registration, based on the recently-proposed KeyMorph framework. Our model, called BrainMorph, serves as a tool that supports multi-modal, pairwise, and scalable groupwise registration. BrainMorph is trained on a massive dataset of over 100,000 3D volumes, skull-stripped and non-skull-stripped, from nearly 16,000 unique hea… ▽ More

    Submitted 24 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: text overlap with arXiv:2304.09941

  38. arXiv:2405.09713  [pdf, other

    cs.CV cs.AI cs.CL

    SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

    Authors: Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan

    Abstract: Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically withi… ▽ More

    Submitted 16 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: CVPR

  39. arXiv:2405.08672  [pdf, other

    eess.IV cs.CV

    EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

    Authors: Beilei Cui, Mobarakol Islam, Long Bai, An Wang, Hongliang Ren

    Abstract: Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adapt… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: early accepted by MICCAI 2024

  40. arXiv:2405.07518  [pdf, other

    cs.AR cs.AI

    SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

    Authors: Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Edison Chen, Kaizhao Liang, Swayambhoo Jain , et al. (5 additional authors not shown)

    Abstract: Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Expert… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  41. arXiv:2405.07060  [pdf, other

    cs.RO

    Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People

    Authors: Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima

    Abstract: Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  42. arXiv:2405.06201  [pdf, other

    cs.CV

    PhysMLE: Generalizable and Priors-Inclusive Multi-task Remote Physiological Measurement

    Authors: Jiyao Wang, Hao Lu, Ange Wang, Xiao Yang, Yingcong Chen, Dengbo He, Kaishun Wu

    Abstract: Remote photoplethysmography (rPPG) has been widely applied to measure heart rate from face videos. To increase the generalizability of the algorithms, domain generalization (DG) attracted increasing attention in rPPG. However, when rPPG is extended to simultaneously measure more vital signs (e.g., respiration and blood oxygen saturation), achieving generalizability brings new challenges. Although… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  43. arXiv:2405.04605  [pdf

    cs.CV cs.AI cs.LG

    AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets

    Authors: Fakrul Islam Tushar, Avivah Wang, Lavsen Dahal, Michael R. Harowicz, Kyle J. Lafata, Tina D. Tailor, Joseph Y. Lo

    Abstract: Lung cancer's high mortality rate can be mitigated by early detection, increasingly reliant on AI for diagnostic imaging. However, AI model performance depends on training and validation datasets. This study develops and validates AI models for both nodule detection and cancer classification tasks. For detection, two models (DLCSD-mD and LUNA16-mD) were developed using the Duke Lung Cancer Screeni… ▽ More

    Submitted 12 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

    Comments: 16 pages, 2 tables, 5 figures

  44. arXiv:2405.03855  [pdf, other

    cs.CY

    Strategies for Increasing Corporate Responsible AI Prioritization

    Authors: Angelina Wang, Teresa Datta, John P. Dickerson

    Abstract: Responsible artificial intelligence (RAI) is increasingly recognized as a critical concern. However, the level of corporate RAI prioritization has not kept pace. In this work, we conduct 16 semi-structured interviews with practitioners to investigate what has historically motivated companies to increase the prioritization of RAI. What emerges is a complex story of conflicting and varied factors, b… ▽ More

    Submitted 28 July, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: AAAI/ACM Conference on AI, Ethics, and Society (AIES) 2024

  45. arXiv:2405.03113  [pdf, other

    cs.RO cs.AI

    Robot Air Hockey: A Manipulation Testbed for Robot Learning with Reinforcement Learning

    Authors: Caleb Chuck, Carl Qi, Michael J. Munje, Shuozhe Li, Max Rudolph, Chang Shi, Siddhant Agarwal, Harshit Sikchi, Abhinav Peri, Sarthak Dayal, Evan Kuo, Kavan Mehta, Anthony Wang, Peter Stone, Amy Zhang, Scott Niekum

    Abstract: Reinforcement Learning is a promising tool for learning complex policies even in fast-moving and object-interactive domains where human teleoperation or hard-coded policies might fail. To effectively reflect this challenging category of tasks, we introduce a dynamic, interactive RL testbed based on robot air hockey. By augmenting air hockey with a large family of tasks ranging from easy tasks like… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  46. arXiv:2405.01691  [pdf, other

    cs.CV cs.LG cs.RO

    Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving

    Authors: Zhenjiang Mao, Dong-You Jhong, Ao Wang, Ivan Ruchkin

    Abstract: Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent represent… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Presented at the Robot Trust for Symbiotic Societies (RTSS) Workshop, co-located with ICRA 2024

  47. arXiv:2404.16506  [pdf, other

    cs.CL

    Building a Japanese Document-Level Relation Extraction Dataset Assisted by Cross-Lingual Transfer

    Authors: Youmi Ma, An Wang, Naoaki Okazaki

    Abstract: Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted LREC-COLING 2024

  48. arXiv:2404.05626  [pdf, other

    cs.CV

    Learning a Category-level Object Pose Estimator without Pose Annotations

    Authors: Fengrui Tian, Yaoyao Liu, Adam Kortylewski, Yueqi Duan, Shaoyi Du, Alan Yuille, Angtian Wang

    Abstract: 3D object pose estimation is a challenging task. Previous works always require thousands of object images with annotated poses for learning the 3D pose correspondence, which is laborious and time-consuming for labeling. In this paper, we propose to learn a category-level 3D object pose estimator without pose annotations. Instead of using manually annotated images, we leverage diffusion models (e.g… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  49. arXiv:2404.05188  [pdf, other

    cs.CR cs.AI cs.CL

    Have You Merged My Model? On The Robustness of Large Language Model IP Protection Methods Against Model Merging

    Authors: Tianshuo Cong, Delong Ran, Zesen Liu, Xinlei He, Jinyuan Liu, Yichen Gong, Qi Li, Anyu Wang, Xiaoyun Wang

    Abstract: Model merging is a promising lightweight model empowerment technique that does not rely on expensive computing devices (e.g., GPUs) or require the collection of specific training data. Instead, it involves editing different upstream model parameters to absorb their downstream task capabilities. However, uncertified model merging can infringe upon the Intellectual Property (IP) rights of the origin… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Technical Report

  50. arXiv:2404.05163  [pdf, other

    cs.CV

    Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos

    Authors: Fengrui Tian, Yueqi Duan, Angtian Wang, Jianfei Guo, Shaoyi Du

    Abstract: In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: Accepted by ICLR 2024, Codes are available at https://github.com/tianfr/Semantic-Flow/