Skip to main content

Showing 1–50 of 1,376 results for author: Zhu, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03525  [pdf, other

    cs.CV

    FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

    Authors: Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao

    Abstract: Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements,… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 14 pages, 9 figures

  2. arXiv:2409.03439  [pdf, other

    cs.RO cs.AI cs.PL

    KiloBot: A Programming Language for Deploying Perception-Guided Industrial Manipulators at Scale

    Authors: Wei Gao, Jingqiang Wang, Xinv Zhu, Jun Zhong, Yue Shen, Youshuang Ding

    Abstract: We would like industrial robots to handle unstructured environments with cameras and perception pipelines. In contrast to traditional industrial robots that replay offline-crafted trajectories, online behavior planning is required for these perception-guided industrial applications. Aside from perception and planning algorithms, deploying perception-guided manipulators also requires substantial ef… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  3. arXiv:2409.02914  [pdf, other

    cs.CV

    Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving

    Authors: Yuhang Lu, Yichen Yao, Jiadong Tu, Jiangnan Shao, Yuexin Ma, Xinge Zhu

    Abstract: Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving d… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  4. arXiv:2409.02375  [pdf, other

    cs.CL

    How Privacy-Savvy Are Large Language Models? A Case Study on Compliance and Privacy Technical Review

    Authors: Xichou Zhu, Yang Liu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Bolong Yang, Manman Wang, Zongxing Xie, Peng Liu, Dan Cai, Junhui Wang

    Abstract: The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sens… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 8 pages, 4 figures

  5. arXiv:2409.02370  [pdf, other

    cs.CL cs.AI

    Do Large Language Models Possess Sensitive to Sentiment?

    Authors: Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhiyang Xu, Wei Luo, Junhui Wang

    Abstract: Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly criti… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 10 pages, 2 figures

  6. arXiv:2409.00353  [pdf, other

    cs.CV

    RI-MAE: Rotation-Invariant Masked AutoEncoders for Self-Supervised Point Cloud Representation Learning

    Authors: Kunming Su, Qiuxia Wu, Panpan Cai, Xiaogang Zhu, Xuequan Lu, Zhiyong Wang, Kun Hu

    Abstract: Masked point modeling methods have recently achieved great success in self-supervised learning for point cloud data. However, these methods are sensitive to rotations and often exhibit sharp performance drops when encountering rotational variations. In this paper, we propose a novel Rotation-Invariant Masked AutoEncoders (RI-MAE) to address two major challenges: 1) achieving rotation-invariant lat… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

  7. arXiv:2408.17207  [pdf, other

    cs.CV cs.RO

    NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar

    Authors: Runwei Guan, Jianan Liu, Liye Jia, Haocheng Zhao, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy Smith, Yutao Yue

    Abstract: Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG f… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 8 pages, 6 figures

  8. arXiv:2408.16768  [pdf, other

    cs.CV cs.AI cs.CL

    SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

    Authors: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Chengzhuo Tong, Peng Gao, Chunyuan Li, Pheng-Ann Heng

    Abstract: We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can gen… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Work in progress. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point

  9. arXiv:2408.15829  [pdf, other

    cs.CV

    SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

    Authors: Sicheng Liu, Lintao Wang, Xiaogan Zhu, Xuequan Lu, Zhiyong Wang, Kun Hu

    Abstract: Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summa… ▽ More

    Submitted 28 August, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

    Comments: 8 pages, 5 figures, submitted to ACM Multimedia Asia 2024

    ACM Class: I.2.10

  10. arXiv:2408.15491  [pdf, other

    cs.CL

    Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

    Authors: Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu

    Abstract: Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrel… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: 20 pages

  11. arXiv:2408.15122  [pdf, other

    cs.CV physics.ao-ph

    Machine Learning for Methane Detection and Quantification from Space - A survey

    Authors: Enno Tiemann, Shanyu Zhou, Alexander Kläser, Konrad Heidler, Rochelle Schneider, Xiao Xiang Zhu

    Abstract: Methane ($CH_4$) is a potent anthropogenic greenhouse gas, contributing 86 times more to global warming than Carbon Dioxide ($CO_2$) over 20 years, and it also acts as an air pollutant. Given its high radiative forcing potential and relatively short atmospheric lifetime (9$\pm$1 years), methane has important implications for climate change, therefore, cutting methane emissions is crucial for effec… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  12. arXiv:2408.14472  [pdf, other

    cs.RO cs.AI eess.SY

    Advancing Humanoid Locomotion: Mastering Challenging Terrains with Denoising World Model Learning

    Authors: Xinyang Gu, Yen-Jen Wang, Xiang Zhu, Chengming Shi, Yanjiang Guo, Yichen Liu, Jianyu Chen

    Abstract: Humanoid robots, with their human-like skeletal structure, are especially suited for tasks in human-centric environments. However, this structure is accompanied by additional challenges in locomotion controller design, especially in complex real-world environments. As a result, existing humanoid robots are limited to relatively simple terrains, either with model-based control or model-free reinfor… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Robotics: Science and Systems (RSS), 2024. (Best Paper Award Finalist)

  13. arXiv:2408.14419  [pdf, other

    cs.AI cs.CL cs.CV

    CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

    Authors: Shubham Bharti, Shiyun Cheng, Jihyun Rho, Martina Rao, Xiaojin Zhu

    Abstract: We introduce CHARTOM, a visual theory-of-mind benchmark for multimodal large language models. CHARTOM consists of specially designed data visualizing charts. Given a chart, a language model needs to not only correctly comprehend the chart (the FACT question) but also judge if the chart will be misleading to a human reader (the MIND question). Both questions have significant societal benefits. We d… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  14. arXiv:2408.13854  [pdf, other

    cs.CV cs.AI

    Tangram: A Challenging Benchmark for Geometric Element Recognizing

    Authors: Jiamin Tang, Chao Zhang, Xudong Zhu, Mengchi Liu

    Abstract: Significant advancements in Large Multimodal Models (LMMs) have enabled them to tackle complex problems involving visual-mathematical reasoning. However, their ability to identify geometric elements remains understudied. To bridge this gap, we introduce Tangram, a novel benchmark designed to evaluate the performance of LMMs on geometric element recognition. Tangram includes 1,080 diverse geometric… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: 12 pages, 7 figures

  15. arXiv:2408.12897  [pdf, other

    eess.IV cs.CV

    When Diffusion MRI Meets Diffusion Model: A Novel Deep Generative Model for Diffusion MRI Generation

    Authors: Xi Zhu, Wei Zhang, Yijie Li, Lauren J. O'Donnell, Fan Zhang

    Abstract: Diffusion MRI (dMRI) is an advanced imaging technique characterizing tissue microstructure and white matter structural connectivity of the human brain. The demand for high-quality dMRI data is growing, driven by the need for better resolution and improved tissue contrast. However, acquiring high-quality dMRI data is expensive and time-consuming. In this context, deep generative modeling emerges as… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

    Comments: 11 pages, 3 figures

  16. arXiv:2408.12353  [pdf, other

    stat.ML cs.LG math.ST

    Distributed quasi-Newton robust estimation under differential privacy

    Authors: Chuhan Wang, Lixing Zhu, Xuehu Zhu

    Abstract: For distributed computing with Byzantine machines under Privacy Protection (PP) constraints, this paper develops a robust PP distributed quasi-Newton estimation, which only requires the node machines to transmit five vectors to the central processor with high asymptotic relative efficiency. Compared with the gradient descent strategy which requires more rounds of transmission and the Newton iterat… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 38 pages, 6 figures

  17. arXiv:2408.08447  [pdf, other

    cs.CV cs.AI

    SpectralEarth: Training Hyperspectral Foundation Models at Scale

    Authors: Nassim Ait Ali Braham, Conrad M Albrecht, Julien Mairal, Jocelyn Chanussot, Yi Wang, Xiao Xiang Zhu

    Abstract: Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, we introduce SpectralEarth, a large-scale multi-temporal data… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  18. arXiv:2408.07971  [pdf, other

    cs.CL

    Predicting Lung Cancer Patient Prognosis with Large Language Models

    Authors: Danqing Hu, Bing Liu, Xiang Li, Xiaofeng Zhu, Nan Wu

    Abstract: Prognosis prediction is crucial for determining optimal treatment plans for lung cancer patients. Traditionally, such predictions relied on models developed from retrospective patient data. Recently, large language models (LLMs) have gained attention for their ability to process and generate text based on extensive learned knowledge. In this study, we evaluate the potential of GPT-4o mini and GPT-… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  19. arXiv:2408.05699  [pdf, other

    cs.CV

    MacFormer: Semantic Segmentation with Fine Object Boundaries

    Authors: Guoan Xu, Wenfeng Huang, Tao Wu, Ligeng Chen, Wenjing Jia, Guangwei Gao, Xiatian Zhu, Stuart Perry

    Abstract: Semantic segmentation involves assigning a specific category to each pixel in an image. While Vision Transformer-based models have made significant progress, current semantic segmentation methods often struggle with precise predictions in localized areas like object boundaries. To tackle this challenge, we introduce a new semantic segmentation architecture, ``MacFormer'', which features two key co… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 13 pages, 7 figures, submitted to TIP

  20. arXiv:2408.05456  [pdf, other

    cs.CL

    Path-LLM: A Shortest-Path-based LLM Learning for Unified Graph Representation

    Authors: Wenbo Shang, Xuliang Zhu, Xin Huang

    Abstract: Unified graph representation learning aims to produce node embeddings, which can be applied to multiple downstream applications. However, existing studies based on graph neural networks and language models either suffer from the limitations of numerous training needed toward specific downstream predictions or have shallow semantic features. In this work, we propose a novel Path-LLM model to learn… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: 12 pages, 8 figures

  21. arXiv:2408.05411  [pdf, other

    cs.CV

    How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

    Authors: Yuxin Zhu, Huiyu Duan, Kaiwei Zhang, Yucheng Zhu, Xilei Zhu, Long Teng, Xiongkuo Min, Guangtao Zhai

    Abstract: Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

  22. arXiv:2408.05075  [pdf, other

    cs.CV

    DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

    Authors: Zeyu Yang, Nan Song, Wei Li, Xiatian Zhu, Li Zhang, Philip H. S. Torr

    Abstract: Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-… ▽ More

    Submitted 15 August, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: Journal extension of NeurIPS 2022. arXiv admin note: text overlap with arXiv:2208.11112

  23. arXiv:2408.02718  [pdf, other

    cs.CV

    MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

    Authors: Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluatio… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Project Page: https://mmiu-bench.github.io/

  24. arXiv:2408.02710  [pdf, other

    cs.LG cs.CV

    RCDM: Enabling Robustness for Conditional Diffusion Model

    Authors: Weifeng Xu, Xiang Zhu, Xiaoyong Li

    Abstract: The conditional diffusion model (CDM) enhances the standard diffusion model by providing more control, improving the quality and relevance of the outputs, and making the model adaptable to a wider range of complex tasks. However, inaccurate conditional inputs in the inverse process of CDM can easily lead to generating fixed errors in the neural network, which diminishes the adaptability of a well-… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  25. arXiv:2408.01653  [pdf, other

    cs.CV

    MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas

    Authors: Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs

    Abstract: We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a two-stage framework for omnidirectional depth estimation via stereo matching between multiple cylindrical panoramas. MCPDepth uses cylindrical panoramas for initial stereo matching and then fuses the resulting depth maps across views. A circular attention module is employed to overcome the distortion along the vertical axis. M… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  26. arXiv:2408.01218  [pdf, other

    cs.CV

    S2TD-Face: Reconstruct a Detailed 3D Face with Controllable Texture from a Single Sketch

    Authors: Zidu Wang, Xiangyu Zhu, Jiang Yu, Tianshuo Zhang, Zhen Lei

    Abstract: 3D textured face reconstruction from sketches applicable in many scenarios such as animation, 3D avatars, artistic design, missing people search, etc., is a highly promising but underdeveloped research topic. On the one hand, the stylistic diversity of sketches leads to existing sketch-to-3D-face methods only being able to handle pose-limited and realistically shaded sketches. On the other hand, t… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: ACM MM 2024

  27. arXiv:2408.00418  [pdf, other

    cs.CV

    Towards Reliable Advertising Image Generation Using Human Feedback

    Authors: Zhenbang Du, Wei Feng, Haohan Wang, Yaoyu Li, Jingsen Wang, Jian Li, Zheng Zhang, Jingjing Lv, Xin Zhu, Junsheng Jin, Junjie Shen, Zhangang Lin, Jingping Shao

    Abstract: In the e-commerce realm, compelling advertising images are pivotal for attracting customer attention. While generative models automate image generation, they often produce substandard images that may mislead customers and require significant labor costs to inspect. This paper delves into increasing the rate of available generated images. We first introduce a multi-modal Reliable Feedback Network (… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: ECCV2024

  28. arXiv:2408.00038  [pdf, other

    cs.IR

    MIMNet: Multi-Interest Meta Network with Multi-Granularity Target-Guided Attention for Cross-domain Recommendation

    Authors: Xiaofei Zhu, Yabo Yin, Li Wang

    Abstract: Cross-domain recommendation (CDR) plays a critical role in alleviating the sparsity and cold-start problem and substantially boosting the performance of recommender systems. Existing CDR methods prefer to either learn a common preference bridge shared by all users or a personalized preference bridge tailored for each user to transfer user preference from the source domain to the target domain. Alt… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

  29. arXiv:2407.21497  [pdf, other

    cs.CV

    Mitral Regurgitation Recogniton based on Unsupervised Out-of-Distribution Detection with Residual Diffusion Amplification

    Authors: Zhe Liu, Xiliang Zhu, Tong Han, Yuhao Huang, Jian Wang, Lian Liu, Fang Wang, Dong Ni, Zhongshan Gou, Xin Yang

    Abstract: Mitral regurgitation (MR) is a serious heart valve disease. Early and accurate diagnosis of MR via ultrasound video is critical for timely clinical decision-making and surgical intervention. However, manual MR diagnosis heavily relies on the operator's experience, which may cause misdiagnosis and inter-observer variability. Since MR data is limited and has large intra-class variability, we propose… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by MICCAI MLMI 2024, 11 pages, 3 figures

  30. arXiv:2407.21363  [pdf, other

    cs.CV cs.MM

    ESIQA: Perceptual Quality Assessment of Vision-Pro-based Egocentric Spatial Images

    Authors: Xilei Zhu, Liu Yang, Huiyu Duan, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet

    Abstract: With the development of eXtended Reality (XR), head-mounted shooting and display technology have experienced significant advancement and gained considerable attention. Egocentric spatial images and videos are emerging as a compelling form of stereoscopic XR content. Different from traditional 2D images, egocentric spatial images present challenges for perceptual quality assessment due to their spe… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: 8 pages, 8 figures

  31. arXiv:2407.21320  [pdf

    cs.AI physics.flu-dyn

    MetaOpenFOAM: an LLM-based multi-agent framework for CFD

    Authors: Yuxuan Chen, Xu Zhu, Hua Zhou, Zhuyin Ren

    Abstract: Remarkable progress has been made in automated problem solving through societies of agents based on large language models (LLMs). Computational fluid dynamics (CFD), as a complex problem, presents unique challenges in automated simulations that require sophisticated solutions. MetaOpenFOAM, as a novel multi-agent collaborations framework, aims to complete CFD simulation tasks with only natural lan… ▽ More

    Submitted 7 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: 31 pages,11 figures, 11 tables

  32. arXiv:2407.21033  [pdf, other

    cs.IR cs.AI cs.CL cs.CV

    Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

    Authors: Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Xiangwei Zhu, Jian Yin

    Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utili… ▽ More

    Submitted 21 August, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: 13 pages, 7 figures

  33. arXiv:2407.17900  [pdf, other

    cs.CL cs.LG

    The Power of Combining Data and Knowledge: GPT-4o is an Effective Interpreter of Machine Learning Models in Predicting Lymph Node Metastasis of Lung Cancer

    Authors: Danqing Hu, Bing Liu, Xiaofeng Zhu, Nan Wu

    Abstract: Lymph node metastasis (LNM) is a crucial factor in determining the initial treatment for patients with lung cancer, yet accurate preoperative diagnosis of LNM remains challenging. Recently, large language models (LLMs) have garnered significant attention due to their remarkable text generation capabilities. Leveraging the extensive medical knowledge learned from vast corpora, LLMs can estimate pro… ▽ More

    Submitted 14 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

  34. arXiv:2407.17731  [pdf, other

    econ.GN cs.GT cs.LG

    Optimal Trade and Industrial Policies in the Global Economy: A Deep Learning Framework

    Authors: Zi Wang, Xingcheng Xu, Yanqing Yang, Xiaodong Zhu

    Abstract: We propose a deep learning framework, DL-opt, designed to efficiently solve for optimal policies in quantifiable general equilibrium trade models. DL-opt integrates (i) a nested fixed point (NFXP) formulation of the optimization problem, (ii) automatic implicit differentiation to enhance gradient descent for solving unilateral optimal policies, and (iii) a best-response dynamics approach for findi… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  35. arXiv:2407.16977  [pdf, other

    cs.CV cs.MM

    Selective Vision-Language Subspace Projection for Few-shot CLIP

    Authors: Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang

    Abstract: Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification… ▽ More

    Submitted 26 July, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

    Comments: Accepted as an Oral Paper at ACM Multimedia 2024

  36. arXiv:2407.15881  [pdf, ps, other

    cs.GT cs.LG

    Data Sharing for Mean Estimation Among Heterogeneous Strategic Agents

    Authors: Alex Clinton, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy

    Abstract: We study a collaborative learning problem where $m$ agents estimate a vector $μ\in\mathbb{R}^d$ by collecting samples from normal distributions, with each agent $i$ incurring a cost $c_{i,k} \in (0, \infty]$ to sample from the $k^{\text{th}}$ distribution $\mathcal{N}(μ_k, σ^2)$. Instead of working on their own, agents can collect data that is cheap to them, and share it with others in exchange fo… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

  37. arXiv:2407.15838  [pdf, other

    cs.CV

    MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

    Authors: Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai

    Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, s… ▽ More

    Submitted 7 August, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: 18 pages, 8 figures, technical report

  38. arXiv:2407.13642  [pdf, other

    cs.CV

    Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

    Authors: Xiaoyu Zhu, Hao Zhou, Pengfei Xing, Long Zhao, Hao Xu, Junwei Liang, Alexander Hauptmann, Ting Liu, Andrew Gallagher

    Abstract: In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  39. arXiv:2407.12005  [pdf, other

    cs.MM cs.CV

    VCEval: Rethinking What is a Good Educational Video and How to Automatically Evaluate It

    Authors: Xiaoxuan Zhu, Zhouhong Gu, Sihang Jiang, Zhixu Li, Hongwei Feng, Yanghua Xiao

    Abstract: Online courses have significantly lowered the barrier to accessing education, yet the varying content quality of these videos poses challenges. In this work, we focus on the task of automatically evaluating the quality of video course content. We have constructed a dataset with a substantial collection of video courses and teaching materials. We propose three evaluation principles and design a new… ▽ More

    Submitted 15 June, 2024; originally announced July 2024.

  40. arXiv:2407.11843  [pdf, other

    cs.CL cs.AI

    InferAct: Inferring Safe Actions for LLM-Based Agents Through Preemptive Evaluation and Human Feedback

    Authors: Haishuo Fang, Xiaodan Zhu, Iryna Gurevych

    Abstract: A crucial requirement for deploying LLM-based agents in real-life applications is robustness against risky or irreversible mistakes. However, existing research lacks a focus on the preemptive evaluation of reasoning trajectories performed by LLM agents, leading to a gap in ensuring safe and reliable operations. To explore better solutions, this paper introduces InferAct, a novel approach that leve… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  41. arXiv:2407.11840  [pdf, other

    cs.CV

    MVG-Splatting: Multi-View Guided Gaussian Splatting with Adaptive Quantile-Based Geometric Consistency Densification

    Authors: Zhuoxiao Li, Shanliang Yao, Yijie Chu, Angel F. Garcia-Fernandez, Yong Yue, Eng Gee Lim, Xiaohui Zhu

    Abstract: In the rapidly evolving field of 3D reconstruction, 3D Gaussian Splatting (3DGS) and 2D Gaussian Splatting (2DGS) represent significant advancements. Although 2DGS compresses 3D Gaussian primitives into 2D Gaussian surfels to effectively enhance mesh extraction quality, this compression can potentially lead to a decrease in rendering quality. Additionally, unreliable densification processes and th… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: https://mvgsplatting.github.io

  42. arXiv:2407.11770  [pdf, other

    cs.CL

    Robust Utility-Preserving Text Anonymization Based on Large Language Models

    Authors: Tianyu Yang, Xiaodan Zhu, Iryna Gurevych

    Abstract: Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models (LLMs), which have shown advanced capability in memorizing detailed information and patterns as well as connecting disparate pieces of information. In defending against LLM-based re-identification attacks,… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  43. arXiv:2407.11677  [pdf, other

    cs.CV

    Video-Language Alignment via Spatio-Temporal Graph Transformer

    Authors: Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin

    Abstract: Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships a… ▽ More

    Submitted 23 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

    Comments: under review

  44. arXiv:2407.11298  [pdf, other

    cs.RO

    ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

    Authors: Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt

    Abstract: Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even wh… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Project Website:(https://h-freax.github.io/thinkgrasp_page/)

  45. arXiv:2407.11158  [pdf, other

    cs.LG math.NA

    Physics-embedded Fourier Neural Network for Partial Differential Equations

    Authors: Qingsong Xu, Nils Thuerey, Yilei Shi, Jonathan Bamber, Chaojun Ouyang, Xiao Xiang Zhu

    Abstract: We consider solving complex spatiotemporal dynamical systems governed by partial differential equations (PDEs) using frequency domain-based discrete learning approaches, such as Fourier neural operators. Despite their widespread use for approximating nonlinear PDEs, the majority of these methods neglect fundamental physical laws and lack interpretability. We address these shortcomings by introduci… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 29 pages,18 figures

  46. arXiv:2407.10167  [pdf, other

    cs.CL cs.AI

    Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

    Authors: Xunyu Zhu, Jian Li, Can Ma, Weiping Wang

    Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in mathematical reasoning tasks due to their extensive parameter counts and training on vast datasets. Despite these capabilities, deploying LLMs is hindered by their computational demands. Distilling LLM mathematical reasoning into Smaller Language Models (SLMs) has emerged as a solution to this challenge, although these small… ▽ More

    Submitted 30 July, 2024; v1 submitted 14 July, 2024; originally announced July 2024.

    Comments: Modify the description error in the experiment settings, i.e., the teacher LLM changes deepseek-v2 from GPT-4

  47. arXiv:2407.09751  [pdf, other

    cs.CV

    TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation

    Authors: Xiaopei Wu, Yuenan Hou, Xiaoshui Huang, Binbin Lin, Tong He, Xinge Zhu, Yuexin Ma, Boxi Wu, Haifeng Liu, Deng Cai, Wanli Ouyang

    Abstract: Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However, previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint, and they also ignore the informative tempo… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Accepted by CVPR 2024

  48. arXiv:2407.08153  [pdf, other

    cs.CV

    Lifelong Histopathology Whole Slide Image Retrieval via Distance Consistency Rehearsal

    Authors: Xinyu Zhu, Zhiguo Jiang, Kun Wu, Jun Shi, Yushan Zheng

    Abstract: Content-based histopathological image retrieval (CBHIR) has gained attention in recent years, offering the capability to return histopathology images that are content-wise similar to the query one from an established database. However, in clinical practice, the continuously expanding size of WSI databases limits the practical application of the current CBHIR methods. In this paper, we propose a Li… ▽ More

    Submitted 12 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted for MICCAI 2024

  49. arXiv:2407.07958  [pdf, other

    cs.CV

    Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

    Authors: Zhi Qin Tan, Olga Isupova, Gustavo Carneiro, Xiatian Zhu, Yunpeng Li

    Abstract: Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  50. arXiv:2407.07457  [pdf, other

    cs.LG cs.CL

    GLBench: A Comprehensive Benchmark for Graph with Large Language Models

    Authors: Yuhan Li, Peisong Wang, Xiao Zhu, Aochuan Chen, Haiyun Jiang, Deng Cai, Victor Wai Kin Chan, Jia Li

    Abstract: The emergence of large language models (LLMs) has revolutionized the way we interact with graphs, leading to a new paradigm called GraphLLM. Despite the rapid development of GraphLLM methods in recent years, the progress and understanding of this field remain unclear due to the lack of a benchmark with consistent experimental protocols. To bridge this gap, we introduce GLBench, the first comprehen… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2306.10280 by other authors