Search | arXiv e-print repository

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang

Abstract: Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understandi… ▽ More Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. △ Less

Submitted 5 September, 2024; originally announced September 2024.

Comments: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D

arXiv:2409.03752 [pdf, other]

Attention Heads of Large Language Models: A Survey

Authors: Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li

Abstract: Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aimi… ▽ More Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aiming to identify the essence of their reasoning bottlenecks, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the interpretability and underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions. Our reference list is open-sourced at \url{https://github.com/IAAR-Shanghai/Awesome-Attention-Heads}. △ Less

Submitted 5 September, 2024; originally announced September 2024.

Comments: 20 pages, 11 figures, 4 tables

arXiv:2409.03512 [pdf, other]

From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

Authors: Jifan Yu, Zheyuan Zhang, Daniel Zhang-li, Shangqing Tu, Zhanxin Hao, Rui Miao Li, Haoxuan Li, Yuanchun Wang, Hanming Li, Linlu Gong, Jie Cao, Jiayin Lin, Jinchang Zhou, Fei Qin, Haohua Wang, Jianxiao Jiang, Lijun Deng, Yisi Zhan, Chaojun Xiao, Xusheng Dai, Xuan Yan, Nianyi Lin, Nan Zhang, Ruixin Ni, Yang Dang , et al. (8 additional authors not shown)

Abstract: Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integ… ▽ More Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03393 [pdf, other]

VQ-DeepVSC: A Dual-Stage Vector Quantization Framework for Video Semantic Communication

Authors: Yongyi Miao, Zhongdang Li, Yang Wang, Die Hu, Jun Yan, Youfang Wang

Abstract: In response to the rapid growth of global videomtraffic and the limitations of traditional wireless transmission systems, we propose a novel dual-stage vector quantization framework, VQ-DeepVSC, tailored to enhance video transmission over wireless channels. In the first stage, we design the adaptive keyframe extractor and interpolator, deployed respectively at the transmitter and receiver, which i… ▽ More In response to the rapid growth of global videomtraffic and the limitations of traditional wireless transmission systems, we propose a novel dual-stage vector quantization framework, VQ-DeepVSC, tailored to enhance video transmission over wireless channels. In the first stage, we design the adaptive keyframe extractor and interpolator, deployed respectively at the transmitter and receiver, which intelligently select key frames to minimize inter-frame redundancy and mitigate the cliff-effect under challenging channel conditions. In the second stage, we propose the semantic vector quantization encoder and decoder, placed respectively at the transmitter and receiver, which efficiently compress key frames using advanced indexing and spatial normalization modules to reduce redundancy. Additionally, we propose adjustable index selection and recovery modules, enhancing compression efficiency and enabling flexible compression ratio adjustment. Compared to the joint source-channel coding (JSCC) framework, the proposed framework exhibits superior compatibility with current digital communication systems. Experimental results demonstrate that VQ-DeepVSC achieves substantial improvements in both Multi-Scale Structural Similarity (MS-SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) metrics than the H.265 standard, particularly under low channel signal-to-noise ratio (SNR) or multi-path channels, highlighting the significantly enhanced transmission capabilities of our approach. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03386 [pdf, other]

Movable Antennas: Channel Measurement, Modeling, and Performance Evaluation

Authors: Yiqin Wang, Heyin Shen, Chong Han, Meixia Tao

Abstract: Since decades ago, multi-antenna has become a key enabling technology in the evolution of wireless communication systems. In contrast to conventional multi-antenna systems that contain antennas at fixed positions, position-flexible antenna systems have been proposed to fully utilize the spatial variation of wireless channels. In this paper, movable antenna (MA) systems are analyzed from channel me… ▽ More Since decades ago, multi-antenna has become a key enabling technology in the evolution of wireless communication systems. In contrast to conventional multi-antenna systems that contain antennas at fixed positions, position-flexible antenna systems have been proposed to fully utilize the spatial variation of wireless channels. In this paper, movable antenna (MA) systems are analyzed from channel measurement, modeling, position optimization to performance evaluation. First, a broadband channel measurement system with physical MAs is developed, for which the extremely high movable resolution reaches 0.02 mm. A practical two-ray model is constructed based on the channel measurement for a two-dimensional movable antenna system across 32$\times$32 planar port positions at 300 GHz. In light of the measurement results, spatial-correlated channel models for the two-dimensional MA system are proposed, which are statistically parameterized by the covariance matrix of measured channels. Finally, the signal-to-interference-and-noise ratio (SINR)-maximized position selection algorithm is proposed, which achieves 99% of the optimal performance. The performance of different MA systems in terms of spectral efficiency are evaluated and compared for both planar and linear MA systems. Extensive results demonstrate the advantage of MAs over fixed-position antennas in coping with the multi-path fading and improving the spectral efficiency by 10% in a 300 GHz measured channel. △ Less

Submitted 5 September, 2024; originally announced September 2024.

Comments: 12 pages, 31 figures

arXiv:2409.03365 [pdf, other]

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Authors: Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

Abstract: Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the hete… ▽ More Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03363 [pdf, other]

Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding

Authors: Cheng Wang, Yiwei Wang, Bryan Hooi, Yujun Cai, Nanyun Peng, Kai-Wei Chang

Abstract: The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member a… ▽ More The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03346 [pdf, other]

Sketch: A Toolkit for Streamlining LLM Operations

Authors: Xin Jiang, Xiang Li, Wenjia Ma, Xuezhi Fang, Yiqun Yao, Naitong Yu, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang

Abstract: Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work,… ▽ More Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at https://github.com/cofe-ai/Sketch. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03271 [pdf, other]

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Authors: Yu Wang, Shiwan Zhao, Zhihu Wang, Heyuan Huang, Ming Fan, Yubo Zhang, Zhixing Wang, Haijun Wang, Ting Liu

Abstract: The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge,… ▽ More The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbf{Strategic Chain-of-Thought} (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05\% increase on the GSM8K dataset and 24.13\% on the Tracking\_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks. △ Less

Submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.03223 [pdf, other]

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

Authors: Chenguang Zhu, Shan Gao, Huafeng Chen, Guangqian Guo, Chaowei Wang, Yaoxing Wang, Chen Shu Lei, Quanjiang Fan

Abstract: Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features.… ▽ More Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process. △ Less