Skip to main content

Showing 1–50 of 578 results for author: Wei, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.02483  [pdf, other

    cs.CV cs.AI

    TASAR: Transferable Attack on Skeletal Action Recognition

    Authors: Yunfeng Diao, Baiqi Wu, Ruixuan Zhang, Ajian Liu, Xingxing Wei, Meng Wang, He Wang

    Abstract: Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferabil… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: arXiv admin note: text overlap with arXiv:2407.08572

  2. arXiv:2408.16236  [pdf, other

    cs.CV

    Neural Spectral Decomposition for Dataset Distillation

    Authors: Shaolei Yang, Shen Cheng, Mingbo Hong, Haoqiang Fan, Xing Wei, Shuaicheng Liu

    Abstract: In this paper, we propose Neural Spectrum Decomposition, a generic decomposition framework for dataset distillation. Unlike previous methods, we consider the entire dataset as a high-dimensional observation that is low-rank across all dimensions. We aim to discover the low-rank representation of the entire dataset and perform distillation efficiently. Toward this end, we learn a set of spectrum te… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: ECCV 2024

  3. arXiv:2408.15876  [pdf, other

    cs.CV

    Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

    Authors: Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu

    Abstract: In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spa… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  4. arXiv:2408.15668  [pdf, ps, other

    cs.IT eess.SP

    Movable Antennas Meet Intelligent Reflecting Surface: When Do We Need Movable Antennas?

    Authors: Xin Wei, Weidong Mei, Qingqing Wu, Boyu Ning, Zhi Chen

    Abstract: Intelligent reflecting surface (IRS) and movable antenna (MA)/fluid antenna (FA) techniques have both received increasing attention in the realm of wireless communications due to their ability to reconfigure and improve wireless channel conditions. In this paper, we investigate the integration of MAs/FAs into an IRS-assisted wireless communication system. In particular, we consider the downlink tr… ▽ More

    Submitted 1 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 6 pages, 6 figures

  5. arXiv:2408.15542  [pdf, other

    cs.CV cs.AI cs.MM

    Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

    Authors: Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, Jie Hu

    Abstract: Rapid advancements have been made in extending Large Language Models (LLMs) to Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data remains a challenging endeavor, especially for long videos. Due to insufficient access to large-scale high-quality video data and the excessive compression of visual features, current methods exhibit limitations in effectively proce… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  6. arXiv:2408.12984  [pdf, other

    cond-mat.mtrl-sci cs.AI

    Zeoformer: Coarse-Grained Periodic Graph Transformer for OSDA-Zeolite Affinity Prediction

    Authors: Xiangxiang Shen, Zheng Wan, Lingfeng Wen, Licheng Sun, Ou Yang Ming Jie, Xuan Tang, Xian Zeng, Mingsong Chen, Xiao He, Xian Wei

    Abstract: To date, the International Zeolite Association Structure Commission (IZA-SC) has cataloged merely 255 distinct zeolite structures, with millions of theoretically possible structures yet to be discovered. The synthesis of a specific zeolite typically necessitates the use of an organic structure-directing agent (OSDA), since the selectivity for a particular zeolite is largely determined by the affin… ▽ More

    Submitted 25 August, 2024; v1 submitted 23 August, 2024; originally announced August 2024.

    Comments: 7 pages, 5 figures

  7. arXiv:2408.12454  [pdf, other

    cs.CV cs.AI

    Relaxed Rotational Equivariance via $G$-Biases in Vision

    Authors: Zhiqiang Wu, Licheng Sun, Yingjie Liu, Jian Yang, Hanlin Dong, Shing-Ho J. Lin, Xuan Tang, Jinpeng Mi, Bo Jin, Xian Wei

    Abstract: Group Equivariant Convolution (GConv) can effectively handle rotational symmetry data. They assume uniform and strict rotational symmetry across all features, as the transformations under the specific group. However, real-world data rarely conforms to strict rotational symmetry commonly referred to as Rotational Symmetry-Breaking in the system or dataset, making GConv unable to adapt effectively t… ▽ More

    Submitted 25 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

  8. arXiv:2408.11760  [pdf, other

    cs.CV cs.AI

    SBDet: A Symmetry-Breaking Object Detector via Relaxed Rotation-Equivariance

    Authors: Zhiqiang Wu, Yingjie Liu, Hanlin Dong, Xuan Tang, Jian Yang, Bo Jin, Mingsong Chen, Xian Wei

    Abstract: Introducing Group Equivariant Convolution (GConv) empowers models to explore symmetries hidden in visual data, improving their performance. However, in real-world scenarios, objects or scenes often exhibit perturbations of a symmetric system, specifically a deviation from a symmetric architecture, which can be characterized by a non-trivial action of a symmetry group, known as Symmetry-Breaking. T… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  9. arXiv:2408.11567  [pdf, other

    cs.CV

    Positional Prompt Tuning for Efficient 3D Representation Learning

    Authors: Shaochen Zhang, Zekun Qi, Runpei Dong, Xiuxiu Bai, Xing Wei

    Abstract: Point cloud analysis has achieved significant development and is well-performed in multiple downstream tasks like point cloud classification and segmentation, etc. Being conscious of the simplicity of the position encoding structure in Transformer-based architectures, we attach importance to the position encoding as a high-dimensional part and the patch encoder to offer multi-scale information. To… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: tech report

  10. arXiv:2408.10286  [pdf, other

    cs.LG cs.AI

    GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching

    Authors: Xiao Han, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong, Xuetao Wei, Liqiang Nie, Jieping Ye

    Abstract: As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ride-hailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficu… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  11. arXiv:2408.10198  [pdf, other

    cs.CV cs.GR

    MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

    Authors: Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Linghao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xiaoshuai Zhang, Isabella Liu, Hongzhi Wu, Hao Su

    Abstract: Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. S… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 20 pages, 9 figures

  12. arXiv:2408.09483  [pdf, other

    cs.AR

    CMD: A Cache-assisted GPU Memory Deduplication Architecture

    Authors: Wei Zhao, Dan Feng, Wei Tong, Xueliang Wei, Bing Wu

    Abstract: Massive off-chip accesses in GPUs are the main performance bottleneck, and we divided these accesses into three types: (1) Write, (2) Data-Read, and (3) Read-Only. Besides, We find that many writes are duplicate, and the duplication can be inter-dup and intra-dup. While inter-dup means different memory blocks are identical, and intra-dup means all the 4B elements in a line are the same. In this wo… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  13. arXiv:2408.09230  [pdf, other

    cs.AI

    Siamese Multiple Attention Temporal Convolution Networks for Human Mobility Signature Identification

    Authors: Zhipeng Zheng, Yuchen Jiang, Shiyao Zhang, Xuetao Wei

    Abstract: The Human Mobility Signature Identification (HuMID) problem stands as a fundamental task within the realm of driving style representation, dedicated to discerning latent driving behaviors and preferences from diverse driver trajectories for driver identification. Its solutions hold significant implications across various domains (e.g., ride-hailing, insurance), wherein their application serves to… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

    Comments: 27th IEEE International Conference on Intelligent Transportation Systems (ITSC) (ITSC 2024)

  14. HybridOcc: NeRF Enhanced Transformer-based Multi-Camera 3D Occupancy Prediction

    Authors: Xiao Zhao, Bo Chen, Mingyang Sun, Dingkang Yang, Youxing Wang, Xukun Zhang, Mingcheng Li, Dongliang Kou, Xiaoyi Wei, Lihua Zhang

    Abstract: Vision-based 3D semantic scene completion (SSC) describes autonomous driving scenes through 3D volume representations. However, the occlusion of invisible voxels by scene surfaces poses challenges to current SSC methods in hallucinating refined 3D geometry. This paper proposes HybridOcc, a hybrid 3D volume query proposal method generated by Transformer framework and NeRF representation and refined… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

    Comments: Accepted to IEEE RAL

  15. arXiv:2408.08322  [pdf, other

    eess.SP cs.IT

    Movable-Antenna Position Optimization for Physical-Layer Security via Discrete Sampling

    Authors: Weidong Mei, Xin Wei, Yijie Liu, Boyu Ning, Zhi Chen

    Abstract: Fluid antennas (FAs) and mobile antennas (MAs) are innovative technologies in wireless communications that are able to proactively improve channel conditions by dynamically adjusting the transmit/receive antenna positions within a given spatial region. In this paper, we investigate an MA-enhanced multiple-input single-output (MISO) secure communication system, aiming to maximize the secrecy rate b… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: This paper is accepted by IEEE Globecom 2024. arXiv admin note: substantial text overlap with arXiv:2403.16886

  16. arXiv:2408.02066  [pdf, other

    cs.CR

    PromptSAM+: Malware Detection based on Prompt Segment Anything Model

    Authors: Xingyuan Wei, Yichen Liu, Ce Li, Ning Li, Degang Sun, Yan Wang

    Abstract: Machine learning and deep learning (ML/DL) have been extensively applied in malware detection, and some existing methods demonstrate robust performance. However, several issues persist in the field of malware detection: (1) Existing work often overemphasizes accuracy at the expense of practicality, rarely considering false positive and false negative rates as important metrics. (2) Considering the… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

    Comments: 13pages, 10figures

    MSC Class: F.2.2; I.2.7 ACM Class: F.2.2; I.2.7

  17. arXiv:2408.01661  [pdf, other

    cs.CR

    Mitigating the Impact of Malware Evolution on API Sequence-based Windows Malware Detector

    Authors: Xingyuan Wei, Ce Li, Qiujian Lv, Ning Li, Degang Sun, Yan Wang

    Abstract: In dynamic Windows malware detection, deep learning models are extensively deployed to analyze API sequences. Methods based on API sequences play a crucial role in malware prevention. However, due to the continuous updates of APIs and the changes in API sequence calls leading to the constant evolution of malware variants, the detection capability of API sequence-based malware detection models sign… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

    Comments: 13pages, 11 figures

    ACM Class: F.2.2; I.2.7

  18. arXiv:2407.21475  [pdf, other

    cs.CV cs.AI

    Fine-gained Zero-shot Video Sampling

    Authors: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

    Abstract: Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets f… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  19. arXiv:2407.21428  [pdf, other

    cs.GR cs.AI

    Deformable 3D Shape Diffusion Model

    Authors: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

    Abstract: The Gaussian diffusion model, initially designed for image generation, has recently been adapted for 3D point cloud generation. However, these adaptations have not fully considered the intrinsic geometric characteristics of 3D shapes, thereby constraining the diffusion model's potential for 3D shape manipulation. To address this limitation, we introduce a novel deformable 3D shape diffusion model… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  20. arXiv:2407.21276  [pdf, other

    cs.AI cs.CL

    Multi-Level Querying using A Knowledge Pyramid

    Authors: Rubing Chen, Xulu Zhang, Jiaxin Wu, Wenqi Fan, Xiao-Yong Wei, Qing Li

    Abstract: This paper addresses the need for improved precision in existing Retrieval-Augmented Generation (RAG) methods that primarily focus on enhancing recall. We propose a multi-layer knowledge pyramid approach within the RAG framework to achieve a better balance between precision and recall. The knowledge pyramid consists of three layers: Ontologies, Knowledge Graphs (KGs), and chunk-based raw text. We… ▽ More

    Submitted 5 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

  21. arXiv:2407.20870  [pdf, other

    cs.CV cs.CG

    Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

    Authors: Tianyi Zhang, Wengyu Zhang, Xulu Zhang, Jiaxin Wu, Xiao-Yong Wei, Jiannong Cao, Qing Li

    Abstract: Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-sta… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  22. Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

    Authors: Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiaoyong Wei, Chang Wen Chen, Qing Li

    Abstract: In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct a… ▽ More

    Submitted 22 July, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

    Comments: Accepted to ACM Multimedia 2024

  23. arXiv:2407.14904  [pdf, other

    eess.IV cs.AI cs.CL cs.CV

    Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

    Authors: Chen Shen, Chunfeng Lian, Wanqing Zhang, Fan Wang, Jianhua Zhang, Shuanliang Fan, Xin Wei, Gongji Wang, Kehan Li, Hongshu Mu, Hao Wu, Xinggong Liang, Jianhua Ma, Zhenyuan Wang

    Abstract: Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi u… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

    Comments: 28 pages, 6 figures, under review

  24. arXiv:2407.10759  [pdf, other

    eess.AS cs.CL cs.LG

    Qwen2-Audio Technical Report

    Authors: Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou

    Abstract: We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data an… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: https://github.com/QwenLM/Qwen2-Audio. Checkpoints, codes and scripts will be opensoursed soon

  25. arXiv:2407.10671  [pdf, other

    cs.CL cs.AI

    Qwen2 Technical Report

    Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

    Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 25 pages, 1 figure

  26. arXiv:2407.09835  [pdf, other

    cs.CL

    Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis

    Authors: Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

    Abstract: State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and costs without significantly impacting performance. Our study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. In contra… ▽ More

    Submitted 24 July, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ICML 2024 Next Generation of Sequence Modeling Architectures Workshop. Short version of arXiv:2406.16450

  27. arXiv:2407.08739  [pdf, other

    cs.CV

    MAVIS: Mathematical Visual Instruction Tuning

    Authors: Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li

    Abstract: Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, a… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Work in progress. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

  28. arXiv:2407.08489  [pdf, other

    cs.CV

    Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

    Authors: Zeyang Zhao, Qilong Xue, Yuhang He, Yifan Bai, Xing Wei, Yihong Gong

    Abstract: This paper introduces the point-axis representation for oriented object detection, emphasizing its flexibility and geometrically intuitive nature with two key components: points and axes. 1) Points delineate the spatial extent and contours of objects, providing detailed shape descriptions. 2) Axes define the primary directionalities of objects, providing essential orientation cues crucial for prec… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 19 pages,7 figures,accpeted by ECCV24!

  29. arXiv:2407.04521  [pdf, ps, other

    math.OC cs.LG q-fin.CP

    Unified continuous-time q-learning for mean-field game and mean-field control problems

    Authors: Xiaoli Wei, Xiang Yu, Fengyi Yuan

    Abstract: This paper studies the continuous-time q-learning in the mean-field jump-diffusion models from the representative agent's perspective. To overcome the challenge when the population distribution may not be directly observable, we introduce the integrated q-function in decoupled form (decoupled Iq-function) and establish its martingale characterization together with the value function, which provide… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  30. arXiv:2407.02129  [pdf, other

    cs.HC

    ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction

    Authors: Bo Qian, Zhenhuan Wei, Jiashuo Li, Xing Wei

    Abstract: Efficiently estimating the full-body pose with minimal wearable devices presents a worthwhile research direction. Despite significant advancements in this field, most current research neglects to explore full-body avatar estimation under low-quality signal conditions, which is prevalent in practical usage. To bridge this gap, we summarize three scenarios that may be encountered in real-world appli… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  31. arXiv:2407.01445  [pdf, other

    cs.LG cs.CV

    FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

    Authors: Xiyuan Wei, Fanjiang Ye, Ori Yonay, Xingyu Chen, Baixi Sun, Dingwen Tao, Tianbao Yang

    Abstract: Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstra… ▽ More

    Submitted 29 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 24 pages

  32. arXiv:2406.16450  [pdf, other

    cs.CL

    Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

    Authors: Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

    Abstract: State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter count and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFN), which… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  33. arXiv:2406.15768  [pdf, other

    cs.CV

    MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

    Authors: Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

    Abstract: In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding,… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 14 pages, 8 figures

  34. arXiv:2406.12324  [pdf, other

    cs.RO

    AutoDSL: Automated domain-specific language design for structural representation of procedures with constraints

    Authors: Yu-Zhe Shi, Haofei Hou, Zhangqian Bi, Fanxu Meng, Xiang Wei, Lecheng Ruan, Qining Wang

    Abstract: Accurate representation of procedures in restricted scenarios, such as non-standardized scientific experiments, requires precise depiction of constraints. Unfortunately, Domain-specific Language (DSL), as an effective tool to express constraints structurally, often requires case-by-case hand-crafting, necessitating customized, labor-intensive efforts. To overcome this challenge, we introduce the A… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL'24)

  35. arXiv:2406.11833  [pdf, other

    cs.CV cs.AI cs.LG

    MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

    Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: This project is available at https://github.com/Liuziyu77/MMDU

  36. arXiv:2406.10527  [pdf, other

    cs.CV

    Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center

    Authors: Zichen Yu, Changyong Shu, Qianpu Sun, Junjie Linghu, Xiaobao Wei, Jiangyong Yu, Zongdai Liu, Dawei Yang, Hui Li, Yan Chen

    Abstract: Panoptic occupancy poses a novel challenge by aiming to integrate instance occupancy and semantic occupancy within a unified framework. However, there is still a lack of efficient solutions for panoptic occupancy. In this paper, we propose Panoptic-FlashOcc, a straightforward yet robust 2D feature framework that enables realtime panoptic occupancy. Building upon the lightweight design of FlashOcc,… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  37. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  38. arXiv:2406.07057  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

    Authors: Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu

    Abstract: Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchm… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 100 pages, 84 figures, 33 tables

  39. arXiv:2406.05485  [pdf, other

    cs.CV

    Training-Free Robust Interactive Video Object Segmentation

    Authors: Xiaoli Wei, Zhaoqing Wang, Yandong Guo, Chunxia Zhang, Tongliang Liu, Mingming Gong

    Abstract: Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  40. arXiv:2406.04743  [pdf, other

    cs.LG cs.CR cs.DC stat.AP

    When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain

    Authors: Lei Xu, Yulong Chen, Yuntian Chen, Longfeng Nie, Xuetao Wei, Liang Xue, Dongxiao Zhang

    Abstract: Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  41. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  42. arXiv:2406.03794  [pdf, other

    cs.LG

    Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models

    Authors: Zun Wang, Chang Liu, Nianlong Zou, He Zhang, Xinran Wei, Lin Huang, Lijun Wu, Bin Shao

    Abstract: In this study, we introduce a unified neural network architecture, the Deep Equilibrium Density Functional Theory Hamiltonian (DEQH) model, which incorporates Deep Equilibrium Models (DEQs) for predicting Density Functional Theory (DFT) Hamiltonians. The DEQH model inherently captures the self-consistency nature of Hamiltonian, a critical aspect often overlooked by traditional machine learning app… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  43. arXiv:2405.20323  [pdf, other

    cs.CV cs.AI

    $\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

    Authors: Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

    Abstract: Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle b… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Code is available at: https://github.com/nnanhuang/S3Gaussian/

  44. arXiv:2405.18860  [pdf, other

    cs.RO

    Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks

    Authors: Tianle Zhang, Dongjiang Li, Yihang Li, Zecui Zeng, Lin Zhao, Lei Sun, Yue Chen, Xuelong Wei, Yibing Zhan, Lusong Li, Xiaodong He

    Abstract: The advancements in embodied AI are increasingly enabling robots to tackle complex real-world tasks, such as household manipulation. However, the deployment of robots in these environments remains constrained by the lack of comprehensive bimanual-mobile robot manipulation data that can be learned. Existing datasets predominantly focus on single-arm manipulation tasks, while the few dual-arm datase… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  45. arXiv:2405.18729  [pdf, other

    cs.LG cs.AI

    Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

    Authors: Tianle Zhang, Jiayi Guan, Lin Zhao, Yihang Li, Dongjiang Li, Zecui Zeng, Lei Sun, Yue Chen, Xuelong Wei, Lusong Li, Xiaodong He

    Abstract: Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach opt… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  46. arXiv:2405.18361  [pdf, other

    cs.CV

    Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

    Authors: Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

    Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  47. arXiv:2405.17935  [pdf, other

    cs.CL cs.AI

    Tool Learning with Large Language Models: A Survey

    Authors: Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

    Abstract: Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

  48. arXiv:2405.16865  [pdf, other

    q-bio.NC cs.LG stat.ML

    An Investigation of Conformal Isometry Hypothesis for Grid Cells

    Authors: Dehong Xu, Ruiqi Gao, Wen-Hao Zhang, Xue-Xin Wei, Ying Nian Wu

    Abstract: This paper investigates the conformal isometry hypothesis as a potential explanation for the emergence of hexagonal periodic patterns in the response maps of grid cells. The hypothesis posits that the activities of the population of grid cells form a high-dimensional vector in the neural space, representing the agent's self-position in 2D physical space. As the agent moves in the 2D physical space… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.19192

  49. Towards Completeness-Oriented Tool Retrieval for Large Language Models

    Authors: Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

    Abstract: Recently, integrating external tools with Large Language Models (LLMs) has gained significant attention as an effective strategy to mitigate the limitations inherent in their pre-training data. However, real-world systems often incorporate a wide array of tools, making it impractical to input all tools into LLMs due to length limitations and latency constraints. Therefore, to fully exploit the pot… ▽ More

    Submitted 28 July, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

    Comments: Accepted by CIKM 2024; GitHub: https://github.com/quchangle1/COLT

  50. arXiv:2405.14702  [pdf, other

    cs.CV cs.AI

    G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

    Authors: Pengyue Jia, Yiding Liu, Xiaopeng Li, Xiangyu Zhao, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin

    Abstract: Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily con… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.