-
Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge
Authors:
Shuiyun Liu,
Yuxiang Kong,
Pengcheng Guo,
Weiji Zhuang,
Peng Gao,
Yujun Wang,
Lei Xie
Abstract:
Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our s…
▽ More
Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our system improves performance from two key perspectives: audio modeling and dual-filter strategy. For audio modeling, we propose an innovative 2branch-d2v2 model based on the pre-trained data2vec2 (d2v2), which can simultaneously model automatic speech recognition (ASR) and wake-up word spotting (WWS) tasks through a unified multi-task finetuning paradigm. Additionally, a dual-filter strategy is introduced to reduce the false accept rate (FAR) while maintaining the same false reject rate (FRR). Experimental results demonstrate that our PD-DWS system achieves an FAR of 0.00321 and an FRR of 0.005, with a total score of 0.00821 on the test-B eval set, securing first place in the challenge.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs
Authors:
Jian Zhao,
Shenao Wang,
Yanjie Zhao,
Xinyi Hou,
Kailong Wang,
Peiming Gao,
Yuanchao Zhang,
Chen Wei,
Haoyu Wang
Abstract:
The proliferation of pre-trained models (PTMs) and datasets has led to the emergence of centralized model hubs like Hugging Face, which facilitate collaborative development and reuse. However, recent security reports have uncovered vulnerabilities and instances of malicious attacks within these platforms, highlighting growing security concerns. This paper presents the first systematic study of mal…
▽ More
The proliferation of pre-trained models (PTMs) and datasets has led to the emergence of centralized model hubs like Hugging Face, which facilitate collaborative development and reuse. However, recent security reports have uncovered vulnerabilities and instances of malicious attacks within these platforms, highlighting growing security concerns. This paper presents the first systematic study of malicious code poisoning attacks on pre-trained model hubs, focusing on the Hugging Face platform. We conduct a comprehensive threat analysis, develop a taxonomy of model formats, and perform root cause analysis of vulnerable formats. While existing tools like Fickling and ModelScan offer some protection, they face limitations in semantic-level analysis and comprehensive threat detection. To address these challenges, we propose MalHug, an end-to-end pipeline tailored for Hugging Face that combines dataset loading script extraction, model deserialization, in-depth taint analysis, and heuristic pattern matching to detect and classify malicious code poisoning attacks in datasets and models. In collaboration with Ant Group, a leading financial technology company, we have implemented and deployed MalHug on a mirrored Hugging Face instance within their infrastructure, where it has been operational for over three months. During this period, MalHug has monitored more than 705K models and 176K datasets, uncovering 91 malicious models and 9 malicious dataset loading scripts. These findings reveal a range of security threats, including reverse shell, browser credential theft, and system reconnaissance. This work not only bridges a critical gap in understanding the security of the PTM supply chain but also provides a practical, industry-tested solution for enhancing the security of pre-trained model hubs.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments
Authors:
Xinyi Zheng,
Chen Wei,
Shenao Wang,
Yanjie Zhao,
Peiming Gao,
Yuanchao Zhang,
Kailong Wang,
Haoyu Wang
Abstract:
The exponential growth of open-source package ecosystems, particularly NPM and PyPI, has led to an alarming increase in software supply chain poisoning attacks. Existing static analysis methods struggle with high false positive rates and are easily thwarted by obfuscation and dynamic code execution techniques. While dynamic analysis approaches offer improvements, they often suffer from capturing n…
▽ More
The exponential growth of open-source package ecosystems, particularly NPM and PyPI, has led to an alarming increase in software supply chain poisoning attacks. Existing static analysis methods struggle with high false positive rates and are easily thwarted by obfuscation and dynamic code execution techniques. While dynamic analysis approaches offer improvements, they often suffer from capturing non-package behaviors and employing simplistic testing strategies that fail to trigger sophisticated malicious behaviors. To address these challenges, we present OSCAR, a robust dynamic code poisoning detection pipeline for NPM and PyPI ecosystems. OSCAR fully executes packages in a sandbox environment, employs fuzz testing on exported functions and classes, and implements aspect-based behavior monitoring with tailored API hook points. We evaluate OSCAR against six existing tools using a comprehensive benchmark dataset of real-world malicious and benign packages. OSCAR achieves an F1 score of 0.95 in NPM and 0.91 in PyPI, confirming that OSCAR is as effective as the current state-of-the-art technologies. Furthermore, for benign packages exhibiting characteristics typical of malicious packages, OSCAR reduces the false positive rate by an average of 32.06% in NPM (from 34.63% to 2.57%) and 39.87% in PyPI (from 41.10% to 1.23%), compared to other tools, significantly reducing the workload of manual reviews in real-world deployments. In cooperation with Ant Group, a leading financial technology company, we have deployed OSCAR on its NPM and PyPI mirrors since January 2023, identifying 10,404 malicious NPM packages and 1,235 malicious PyPI packages over 18 months. This work not only bridges the gap between academic research and industrial application in code poisoning detection but also provides a robust and practical solution that has been thoroughly tested in a real-world industrial setting.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Bridging Domain Gap of Point Cloud Representations via Self-Supervised Geometric Augmentation
Authors:
Li Yu,
Hongchao Zhong,
Longkun Zou,
Ke Chen,
Pan Gao
Abstract:
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. I…
▽ More
Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the relational self-supervised learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Authors:
Qi Yang,
Binjie Mao,
Zili Wang,
Xing Nie,
Pengfei Gao,
Ying Guo,
Cheng Zhen,
Pengfei Yan,
Shiming Xiang
Abstract:
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated a…
▽ More
Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
Real-Time Human Action Recognition on Embedded Platforms
Authors:
Ruiqi Wang,
Zichen Wang,
Peiqi Gao,
Mingzhen Li,
Jaehwan Jeong,
Yihang Xu,
Yejin Lee,
Carolyn M. Baum,
Lisa Tabor Connor,
Chenyang Lu
Abstract:
With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Opt…
▽ More
With advancements in computer vision and deep learning, video-based human action recognition (HAR) has become practical. However, due to the complexity of the computation pipeline, running HAR on live video streams incurs excessive delays on embedded platforms. This work tackles the real-time performance challenges of HAR with four contributions: 1) an experimental study identifying a standard Optical Flow (OF) extraction technique as the latency bottleneck in a state-of-the-art HAR pipeline, 2) an exploration of the latency-accuracy tradeoff between the standard and deep learning approaches to OF extraction, which highlights the need for a novel, efficient motion feature extractor, 3) the design of Integrated Motion Feature Extractor (IMFE), a novel single-shot neural network architecture for motion feature extraction with drastic improvement in latency, 4) the development of RT-HARE, a real-time HAR system tailored for embedded platforms. Experimental results on an Nvidia Jetson Xavier NX platform demonstrated that RT-HARE realizes real-time HAR at a video frame rate of 30 frames per second while delivering high levels of recognition accuracy.
△ Less
Submitted 11 September, 2024; v1 submitted 9 September, 2024;
originally announced September 2024.
-
MarsCode Agent: AI-native Automated Bug Fixing
Authors:
Yizhou Liu,
Pengfei Gao,
Xinchen Wang,
Jie Liu,
Yexuan Shi,
Zhao Zhang,
Chao Peng
Abstract:
Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent, a novel framework tha…
▽ More
Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent, a novel framework that leverages LLMs to automatically identify and repair bugs in software code. MarsCode Agent combines the power of LLMs with advanced code analysis techniques to accurately localize faults and generate patches. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a comprehensive benchmark of real-world software projects, and our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.
△ Less
Submitted 4 September, 2024; v1 submitted 1 September, 2024;
originally announced September 2024.
-
Learning linear acyclic causal model including Gaussian noise using ancestral relationships
Authors:
Ming Cai,
Penggang Gao,
Hisayuki Hara
Abstract:
This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algori…
▽ More
This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algorithm and LiNGAM, can identify up to the distribution-equivalence pattern of a linear causal model, even in the presence of Gaussian disturbances. However, in the worst case, the PC-LiNGAM has factorial time complexity for the number of variables. In this paper, we propose an algorithm for learning the distribution-equivalence patterns of a linear causal model with a lower time complexity than PC-LiNGAM, using the causal ancestor finding algorithm in Maeda and Shimizu, which is generalized to account for Gaussian disturbances.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
Authors:
Ziyu Guo,
Renrui Zhang,
Xiangyang Zhu,
Chengzhuo Tong,
Peng Gao,
Chunyuan Li,
Pheng-Ann Heng
Abstract:
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can gen…
▽ More
We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: https://huggingface.co/spaces/ZiyuG/SAM2Point . Code: https://github.com/ZiyuGuo99/SAM2Point .
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Diff-PCC: Diffusion-based Neural Compression for 3D Point Clouds
Authors:
Kai Liu,
Kang You,
Pan Gao
Abstract:
Stable diffusion networks have emerged as a groundbreaking development for their ability to produce realistic and detailed visual content. This characteristic renders them ideal decoders, capable of producing high-quality and aesthetically pleasing reconstructions. In this paper, we introduce the first diffusion-based point cloud compression method, dubbed Diff-PCC, to leverage the expressive powe…
▽ More
Stable diffusion networks have emerged as a groundbreaking development for their ability to produce realistic and detailed visual content. This characteristic renders them ideal decoders, capable of producing high-quality and aesthetically pleasing reconstructions. In this paper, we introduce the first diffusion-based point cloud compression method, dubbed Diff-PCC, to leverage the expressive power of the diffusion model for generative and aesthetically superior decoding. Different from the conventional autoencoder fashion, a dual-space latent representation is devised in this paper, in which a compressor composed of two independent encoding backbones is considered to extract expressive shape latents from distinct latent spaces. At the decoding side, a diffusion-based generator is devised to produce high-quality reconstructions by considering the shape latents as guidance to stochastically denoise the noisy point clouds. Experiments demonstrate that the proposed Diff-PCC achieves state-of-the-art compression performance (e.g., 7.711 dB BD-PSNR gains against the latest G-PCC standard at ultra-low bitrate) while attaining superior subjective quality. Source code will be made publicly available.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
BernGraph: Probabilistic Graph Neural Networks for EHR-based Medication Recommendations
Authors:
Xihao Piao,
Pei Gao,
Zheng Chen,
Lingwei Zhu,
Yasuko Matsubara,
Yasushi Sakurai,
Jimeng Sun
Abstract:
The medical community believes binary medical event outcomes in EHR data contain sufficient information for making a sensible recommendation. However, there are two challenges to effectively utilizing such data: (1) modeling the relationship between massive 0,1 event outcomes is difficult, even with expert knowledge; (2) in practice, learning can be stalled by the binary values since the equally i…
▽ More
The medical community believes binary medical event outcomes in EHR data contain sufficient information for making a sensible recommendation. However, there are two challenges to effectively utilizing such data: (1) modeling the relationship between massive 0,1 event outcomes is difficult, even with expert knowledge; (2) in practice, learning can be stalled by the binary values since the equally important 0 entries propagate no learning signals. Currently, there is a large gap between the assumed sufficient information and the reality that no promising results have been shown by utilizing solely the binary data: visiting or secondary information is often necessary to reach acceptable performance. In this paper, we attempt to build the first successful binary EHR data-oriented drug recommendation system by tackling the two difficulties, making sensible drug recommendations solely using the binary EHR medical records. To this end, we take a statistical perspective to view the EHR data as a sample from its cohorts and transform them into continuous Bernoulli probabilities. The transformed entries not only model a deterministic binary event with a distribution but also allow reflecting \emph{event-event} relationship by conditional probability. A graph neural network is learned on top of the transformation. It captures event-event correlations while emphasizing \emph{event-to-patient} features. Extensive results demonstrate that the proposed method achieves state-of-the-art performance on large-scale databases, outperforming baseline methods that use secondary information by a large margin. The source code is available at \url{https://github.com/chenzRG/BEHRMecom}
△ Less
Submitted 10 September, 2024; v1 submitted 18 August, 2024;
originally announced August 2024.
-
RepoMasterEval: Evaluating Code Completion via Real-World Repositories
Authors:
Qinyun Wu,
Chao Peng,
Pengfei Gao,
Ruida Hu,
Haoyu Gan,
Bo Jiang,
Jinhe Tang,
Zhiwen Deng,
Zhanming Guan,
Cuiyun Gao,
Xia Liu,
Ping Yang
Abstract:
With the growing reliance on automated code completion tools in software development, the need for robust evaluation benchmarks has become critical. However, existing benchmarks focus more on code generation tasks in function and class level and provide rich text description to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion ca…
▽ More
With the growing reliance on automated code completion tools in software development, the need for robust evaluation benchmarks has become critical. However, existing benchmarks focus more on code generation tasks in function and class level and provide rich text description to prompt the model. By contrast, such descriptive prompt is commonly unavailable in real development and code completion can occur in wider range of situations such as in the middle of a function or a code block. These limitations makes the evaluation poorly align with the practical scenarios of code completion tools. In this paper, we propose RepoMasterEval, a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories. Each benchmark datum is generated by masking a code snippet (ground truth) from one source code file with existing test suites. To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases and we manually crafted new test cases for those test suites with low mutation score. Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark and RepoMasterEval is able to report difference in model performance in real-world scenarios. The deployment of RepoMasterEval in a collaborated company for one month also revealed that the benchmark is useful to give accurate feedback during model training and the score is in high correlation with the model's performance in practice. Based on our findings, we call for the software engineering community to build more LLM benchmarks tailored for code generation tools taking the practical and complex development environment into consideration.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
A Differential Smoothness-based Compact-Dynamic Graph Convolutional Network for Spatiotemporal Signal Recovery
Authors:
Pengcheng Gao,
Zicheng Gao,
Ye Yuan
Abstract:
High quality spatiotemporal signal is vitally important for real application scenarios like energy management, traffic planning and cyber security. Due to the uncontrollable factors like abrupt sensors breakdown or communication fault, the spatiotemporal signal collected by sensors is always incomplete. A dynamic graph convolutional network (DGCN) is effective for processing spatiotemporal signal…
▽ More
High quality spatiotemporal signal is vitally important for real application scenarios like energy management, traffic planning and cyber security. Due to the uncontrollable factors like abrupt sensors breakdown or communication fault, the spatiotemporal signal collected by sensors is always incomplete. A dynamic graph convolutional network (DGCN) is effective for processing spatiotemporal signal recovery. However, it adopts a static GCN and a sequence neural network to explore the spatial and temporal patterns, separately. Such a separated two-step processing is loose spatiotemporal, thereby failing to capture the complex inner spatiotemporal correlation. To address this issue, this paper proposes a Compact-Dynamic Graph Convolutional Network (CDGCN) for spatiotemporal signal recovery with the following two-fold ideas: a) leveraging the tensor M-product to build a unified tensor graph convolution framework, which considers both spatial and temporal patterns simultaneously; and b) constructing a differential smoothness-based objective function to reduce the noise interference in spatiotemporal signal, thereby further improve the recovery accuracy. Experiments on real-world spatiotemporal datasets demonstrate that the proposed CDGCN significantly outperforms the state-of-the-art models in terms of recovery accuracy.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Authors:
Dongyang Liu,
Shitian Zhao,
Le Zhuo,
Weifeng Lin,
Yu Qiao,
Hongsheng Li,
Peng Gao
Abstract:
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key ins…
▽ More
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Non-Overlapping Placement of Macro Cells based on Reinforcement Learning in Chip Design
Authors:
Tao Yu,
Peng Gao,
Fei Wang,
Ru-Yue Yuan
Abstract:
Due to the increasing complexity of chip design, existing placement methods still have many shortcomings in dealing with macro cells coverage and optimization efficiency. Aiming at the problems of layout overlap, inferior performance, and low optimization efficiency in existing chip design methods, this paper proposes an end-to-end placement method, SRLPlacer, based on reinforcement learning. Firs…
▽ More
Due to the increasing complexity of chip design, existing placement methods still have many shortcomings in dealing with macro cells coverage and optimization efficiency. Aiming at the problems of layout overlap, inferior performance, and low optimization efficiency in existing chip design methods, this paper proposes an end-to-end placement method, SRLPlacer, based on reinforcement learning. First, the placement problem is transformed into a Markov decision process by establishing the coupling relationship graph model between macro cells to learn the strategy for optimizing layouts. Secondly, the whole placement process is optimized after integrating the standard cell layout. By assessing on the public benchmark ISPD2005, the proposed SRLPlacer can effectively solve the overlap problem between macro cells while considering routing congestion and shortening the total wire length to ensure routability.
△ Less
Submitted 30 July, 2024; v1 submitted 26 July, 2024;
originally announced July 2024.
-
CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image Segmentation
Authors:
Xiao Liu,
Peng Gao,
Tao Yu,
Fei Wang,
Ru-Yue Yuan
Abstract:
Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-ra…
▽ More
Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-range semantic details, they suffer from high computational demands. In this study, we propose CSWin-UNet, a novel U-shaped segmentation method that incorporates the CSWin self-attention mechanism into the UNet to facilitate horizontal and vertical stripes self-attention. This method significantly enhances both computational efficiency and receptive field interactions. Additionally, our innovative decoder utilizes a content-aware reassembly operator that strategically reassembles features, guided by predicted kernels, for precise image resolution restoration. Our extensive empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet maintains low model complexity while delivering high segmentation accuracy.
△ Less
Submitted 12 August, 2024; v1 submitted 25 July, 2024;
originally announced July 2024.
-
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Authors:
Yuxiang Chai,
Siyuan Huang,
Yazhe Niu,
Han Xiao,
Liang Liu,
Dingyu Zhang,
Peng Gao,
Shuai Ren,
Hongsheng Li
Abstract:
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly in…
▽ More
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly interacting with the graphical user interface (GUI) on mobile devices are trained and evaluated with the proposed dataset. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. Unlike existing mobile device-control datasets, e.g., MoTIF, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions, each averaging 13 steps with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we develop a baseline model SPHINX Agent and compare its performance across state-of-the-art agents trained on other datasets. To facilitate further research, we open-source our dataset, models, and relevant evaluation tools. The project is available at https://yuxiangchai.github.io/AMEX/
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Authors:
Mengzhao Chen,
Wenqi Shao,
Peng Xu,
Jiahao Wang,
Peng Gao,
Kaipeng Zhang,
Yu Qiao,
Ping Luo
Abstract:
Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize mode…
▽ More
Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting
Authors:
Penglei Gao,
Kai Yao,
Tiandi Ye,
Steven Wang,
Yuan Yao,
Xiaofeng Wang
Abstract:
In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur subs…
▽ More
In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Global Attention-Guided Dual-Domain Point Cloud Feature Learning for Classification and Segmentation
Authors:
Zihao Li,
Pan Gao,
Kang You,
Chuan Yan,
Manoranjan Paul
Abstract:
Previous studies have demonstrated the effectiveness of point-based neural models on the point cloud analysis task. However, there remains a crucial issue on producing the efficient input embedding for raw point coordinates. Moreover, another issue lies in the limited efficiency of neighboring aggregations, which is a critical component in the network stem. In this paper, we propose a Global Atten…
▽ More
Previous studies have demonstrated the effectiveness of point-based neural models on the point cloud analysis task. However, there remains a crucial issue on producing the efficient input embedding for raw point coordinates. Moreover, another issue lies in the limited efficiency of neighboring aggregations, which is a critical component in the network stem. In this paper, we propose a Global Attention-guided Dual-domain Feature Learning network (GAD) to address the above-mentioned issues. We first devise the Contextual Position-enhanced Transformer (CPT) module, which is armed with an improved global attention mechanism, to produce a global-aware input embedding that serves as the guidance to subsequent aggregations. Then, the Dual-domain K-nearest neighbor Feature Fusion (DKFF) is cascaded to conduct effective feature aggregation through novel dual-domain feature learning which appreciates both local geometric relations and long-distance semantic connections. Extensive experiments on multiple point cloud analysis tasks (e.g., classification, part segmentation, and scene semantic segmentation) demonstrate the superior performance of the proposed method and the efficacy of the devised modules.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
MAVIS: Mathematical Visual Instruction Tuning
Authors:
Renrui Zhang,
Xinyu Wei,
Dongzhi Jiang,
Yichi Zhang,
Ziyu Guo,
Chengzhuo Tong,
Jiaming Liu,
Aojun Zhou,
Bin Wei,
Shanghang Zhang,
Peng Gao,
Hongsheng Li
Abstract:
Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, a…
▽ More
Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Multi-User Localization and Tracking with Spatiotemporal Correlation in Multi-RIS-Assisted Systems
Authors:
Ronghua Peng,
Peng Gao,
Jing You,
Lixiang Lian
Abstract:
As a promising technique, reconfigurable intelligent surfaces (RISs) exhibit its tremendous potential for high accuracy positioning. In this paper, we investigates multi-user localization and tracking problem in multi-RISs-assisted system. In particular, we incorporate statistical spatiotemporal correlation of multi-user locations and develop a general spatiotemporal Markov random field model (ST-…
▽ More
As a promising technique, reconfigurable intelligent surfaces (RISs) exhibit its tremendous potential for high accuracy positioning. In this paper, we investigates multi-user localization and tracking problem in multi-RISs-assisted system. In particular, we incorporate statistical spatiotemporal correlation of multi-user locations and develop a general spatiotemporal Markov random field model (ST-+MRF) to capture multi-user dynamic motion states. To achieve superior performance, a novel multi-user tracking algorithm is proposed based on Bayesian inference to effectively utilize the correlation among users. Besides that, considering the necessity of RISs configuration for tracking performance, we further propose a predictive RISs beamforming optimization scheme via semidefinite relaxation (SDR). Compared to other pioneering work, finally, we confirm that the proposed strategy by alternating tracking algorithm and RISs optimization, can achieve significant performance gains over benchmark schemes.
△ Less
Submitted 14 June, 2024;
originally announced July 2024.
-
VEnhancer: Generative Space-Time Enhancement for Video Generation
Authors:
Jingwen He,
Tianfan Xue,
Dongyang Liu,
Xinqi Lin,
Peng Gao,
Dahua Lin,
Yu Qiao,
Wanli Ouyang,
Ziwei Liu
Abstract:
We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffu…
▽ More
We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Invisible sweat sensor: ultrathin membrane mimics skin for stress monitoring
Authors:
Yuchen Feng,
Andreas Kenny Oktavius,
Reno Adley Prawoto,
Hing Ni Ko,
Qiao Gu,
Ping Gao
Abstract:
Epidermal skin sensors have emerged as a promising approach for continuous and noninvasive monitoring of vital health signals, but to maximize their performance, these sensors must integrate seamlessly with the skin, minimizing impedance while maintaining the skin's natural protective and regulatory functions.In this study, we introduce an imperceptible sweat sensor that achieves this seamless ski…
▽ More
Epidermal skin sensors have emerged as a promising approach for continuous and noninvasive monitoring of vital health signals, but to maximize their performance, these sensors must integrate seamlessly with the skin, minimizing impedance while maintaining the skin's natural protective and regulatory functions.In this study, we introduce an imperceptible sweat sensor that achieves this seamless skin integration through interpenetrating networks formed by a porous, ultra-thin, ultra-high molecular weight polyethylene (UHMWPE) nanomembrane. Upon attachment to the skin by van der Waals force, the amphiphilic sweat extrudates infuse into the interconnected nanopores inside the hydrophobic UHWMPE nanomembrane, forming "pseudo skin" nanochannels for continuous sweat perspiration. This integration is further enhanced by the osmotic pressure generated during water evaporation. Leveraging the efficient transport of biomarkers through the "skin" channels within the porous membrane, we developed an organic electrochemical transducer (OECT) cortisol sensor via in-situ synthesis of a molecularly imprinted polymer (MIP) and poly(3,4 ethylenedioxythiophene) (PEDOT) within the nanomembrane. This demonstrates the capability to detect cortisol concentrations from 0.05 to 0.5 μM for seamless monitoring of stress levels. This work represents a significant advancement in self-adhesive sweat sensors that offer imperceptible and real-time non-invasive health monitoring capabilities.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment
Authors:
Jinsong Shi,
Pan Gao,
Xiaojiang Peng,
Jie Qin
Abstract:
Image quality assessment (IQA) has long been a fundamental challenge in image understanding. In recent years, deep learning-based IQA methods have shown promising performance. However, the lack of large amounts of labeled data in the IQA field has hindered further advancements in these methods. This paper introduces DSMix, a novel data augmentation technique specifically designed for IQA tasks, ai…
▽ More
Image quality assessment (IQA) has long been a fundamental challenge in image understanding. In recent years, deep learning-based IQA methods have shown promising performance. However, the lack of large amounts of labeled data in the IQA field has hindered further advancements in these methods. This paper introduces DSMix, a novel data augmentation technique specifically designed for IQA tasks, aiming to overcome this limitation. DSMix leverages the distortion-induced sensitivity map (DSM) of an image as prior knowledge. It applies cut and mix operations to diverse categories of synthetic distorted images, assigning confidence scores to class labels based on the aforementioned prior knowledge. In the pre-training phase using DSMix-augmented data, knowledge distillation is employed to enhance the model's ability to extract semantic features. Experimental results on both synthetic and authentic IQA datasets demonstrate the significant predictive and generalization performance achieved by DSMix, without requiring fine-tuning of the full model. Code is available at \url{https://github.com/I2-Multimedia-Lab/DSMix}.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
Authors:
Le Zhuo,
Ruoyi Du,
Han Xiao,
Yangguang Li,
Dongyang Liu,
Rongjie Huang,
Wenze Liu,
Lirui Zhao,
Fu-Yun Wang,
Zhanyu Ma,
Xu Luo,
Zehan Wang,
Kaipeng Zhang,
Xiangyang Zhu,
Si Liu,
Xiangyu Yue,
Dingning Liu,
Wanli Ouyang,
Ziwei Liu,
Yu Qiao,
Hongsheng Li,
Peng Gao
Abstract:
Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu…
▽ More
Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Authors:
Jiangshu Du,
Yibo Wang,
Wenting Zhao,
Zhongfen Deng,
Shuaiqi Liu,
Renze Lou,
Henry Peng Zou,
Pranav Narayanan Venkit,
Nan Zhang,
Mukund Srinath,
Haoran Ranran Zhang,
Vipul Gupta,
Yinghui Li,
Tao Li,
Fei Wang,
Qin Liu,
Tianlin Liu,
Pengzhi Gao,
Congying Xia,
Chen Xing,
Jiayang Cheng,
Zhaowei Wang,
Ying Su,
Raj Sanjay Shah,
Ruohao Guo
, et al. (15 additional authors not shown)
Abstract:
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as th…
▽ More
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?
This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
△ Less
Submitted 25 June, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Meta Reasoning for Large Language Models
Authors:
Peizhong Gao,
Ao Xie,
Shaoguang Mao,
Wenshan Wu,
Yan Xia,
Haipeng Mi,
Furu Wei
Abstract:
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding…
▽ More
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task, optimizing both performance and computational efficiency. With MRP, LLM reasoning operates in two phases. Initially, the LLM identifies the most appropriate reasoning method using task input cues and objective descriptions of available methods. Subsequently, it applies the chosen method to complete the task. This dynamic strategy mirrors human meta-reasoning, allowing the model to excel in a wide range of problem domains. We evaluate the effectiveness of MRP through comprehensive benchmarks. The results demonstrate that MRP achieves or approaches state-of-the-art performance across diverse tasks. MRP represents a significant advancement in enabling LLMs to identify cognitive challenges across problems and leverage benefits across different reasoning approaches, enhancing their ability to handle diverse and complex problem domains efficiently. Every LLM deserves a Meta-Reasoning Prompting to unlock its full potential and ensure adaptability in an ever-evolving landscape of challenges and applications.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
A3VLM: Actionable Articulation-Aware Vision Language Model
Authors:
Siyuan Huang,
Haonan Chang,
Yuhan Liu,
Yimeng Zhu,
Hao Dong,
Peng Gao,
Abdeslam Boularias,
Hongsheng Li
Abstract:
Vision Language Models (VLMs) have received significant attention in recent years in the robotics community. VLMs are shown to be able to perform complex visual reasoning and scene understanding tasks, which makes them regarded as a potential universal solution for general robotics problems such as manipulation and navigation. However, previous VLMs for robotics such as RT-1, RT-2, and ManipLLM ha…
▽ More
Vision Language Models (VLMs) have received significant attention in recent years in the robotics community. VLMs are shown to be able to perform complex visual reasoning and scene understanding tasks, which makes them regarded as a potential universal solution for general robotics problems such as manipulation and navigation. However, previous VLMs for robotics such as RT-1, RT-2, and ManipLLM have focused on directly learning robot-centric actions. Such approaches require collecting a significant amount of robot interaction data, which is extremely costly in the real world. Thus, we propose A3VLM, an object-centric, actionable, articulation-aware vision language model. A3VLM focuses on the articulation structure and action affordances of objects. Its representation is robot-agnostic and can be translated into robot actions using simple action primitives. Extensive experiments in both simulation benchmarks and real-world settings demonstrate the effectiveness and stability of A3VLM. We release our code and other materials at https://github.com/changhaonan/A3VLM.
△ Less
Submitted 13 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network
Authors:
Sizhe Zheng,
Pan Gao,
Peng Zhou,
Jie Qin
Abstract:
Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. A…
▽ More
Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. Although transformers can better model the relationship between content and style images, they require high-cost hardware and time-consuming inference. To address these issues, we design a novel transformer model that includes only the encoder, thus significantly reducing the computational cost. In addition, we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization, we design a content feature extractor and a style feature extractor, based on which pure content and style images can be fed to the transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure content and style feature fusion network. Through qualitative and quantitative experiments, we demonstrate the advantages of our model compared to state-of-the-art ones in the literature.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Phased Consistency Model
Authors:
Fu-Yun Wang,
Zhaoyang Huang,
Alexander William Bergman,
Dazhong Shen,
Peng Gao,
Michael Lingelbach,
Keqiang Sun,
Weikang Bian,
Guanglu Song,
Yu Liu,
Hongsheng Li,
Xiaogang Wang
Abstract:
The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phas…
▽ More
The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Authors:
Xudong Lu,
Aojun Zhou,
Yuhui Xu,
Renrui Zhang,
Peng Gao,
Hongsheng Li
Abstract:
Large Language Models (LLMs) have become pivotal in advancing the field of artificial intelligence, yet their immense sizes pose significant challenges for both fine-tuning and deployment. Current post-training pruning methods, while reducing the sizes of LLMs, often fail to maintain their original performance. To address these challenges, this paper introduces SPP, a Sparsity-Preserved Parameter-…
▽ More
Large Language Models (LLMs) have become pivotal in advancing the field of artificial intelligence, yet their immense sizes pose significant challenges for both fine-tuning and deployment. Current post-training pruning methods, while reducing the sizes of LLMs, often fail to maintain their original performance. To address these challenges, this paper introduces SPP, a Sparsity-Preserved Parameter-efficient fine-tuning method. Different from existing post-training pruning approaches that struggle with performance retention, SPP proposes to employ lightweight learnable column and row matrices to optimize sparse LLM weights, keeping the structure and sparsity of pruned pre-trained models intact. By element-wise multiplication and residual addition, SPP ensures the consistency of model sparsity pattern and ratio during both training and weight-merging processes. We demonstrate the effectiveness of SPP by applying it to the LLaMA and LLaMA-2 model families with recent post-training pruning methods. Our results show that SPP significantly enhances the performance of models with different sparsity patterns (i.e. unstructured and N:M sparsity), especially for those with high sparsity ratios (e.g. 75%), making it a promising solution for the efficient fine-tuning of sparse LLMs. Code will be made available at https://github.com/Lucky-Lance/SPP.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
P4Control: Line-Rate Cross-Host Attack Prevention via In-Network Information Flow Control Enabled by Programmable Switches and eBPF
Authors:
Osama Bajaber,
Bo Ji,
Peng Gao
Abstract:
Modern targeted attacks such as Advanced Persistent Threats use multiple hosts as stepping stones and move laterally across them to gain deeper access to the network. However, existing defenses lack end-to-end information flow visibility across hosts and cannot block cross-host attack traffic in real time. In this paper, we propose P4Control, a network defense system that precisely confines end-to…
▽ More
Modern targeted attacks such as Advanced Persistent Threats use multiple hosts as stepping stones and move laterally across them to gain deeper access to the network. However, existing defenses lack end-to-end information flow visibility across hosts and cannot block cross-host attack traffic in real time. In this paper, we propose P4Control, a network defense system that precisely confines end-to-end information flows in a network and prevents cross-host attacks at line rate. P4Control introduces a novel in-network decentralized information flow control (DIFC) mechanism and is the first work that enforces DIFC at the network level at network line rate. This is achieved through: (1) an in-network primitive based on programmable switches for tracking inter-host information flows and enforcing line-rate DIFC policies; (2) a lightweight eBPF-based primitive deployed on hosts for tracking intra-host information flows. P4Control also provides an expressive policy framework for specifying DIFC policies against different attack scenarios. We conduct extensive evaluations to show that P4Control can effectively prevent cross-host attacks in real time, while maintaining line-rate network performance and imposing minimal overhead on the network and host machines. It is also noteworthy that P4Control can facilitate the realization of a zero trust architecture through its fine-grained least-privilege network access control.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
TerDiT: Ternary Diffusion Models with Transformers
Authors:
Xudong Lu,
Aojun Zhou,
Ziyi Lin,
Qi Liu,
Yuhui Xu,
Renrui Zhang,
Yafei Wen,
Shuai Ren,
Peng Gao,
Junchi Yan,
Hongsheng Li
Abstract:
Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability.…
▽ More
Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion models based on transformer architecture (DiTs). Among these diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their extensive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we propose TerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B. Our work contributes to the exploration of efficient deployment strategies for large-scale DiT models, demonstrating the feasibility of training extremely low-bit diffusion transformer models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code will be available at https://github.com/Lucky-Lance/TerDiT.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification
Authors:
Peng Gao,
Yujian Lee,
Hui Zhang,
Xubo Liu,
Yiyang Hu,
Guquan Jing
Abstract:
Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. VI-ReID is a challenging task due to the large differences in individual appearance under different modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level, which lacks exploring the discriminative embeddings. Ef…
▽ More
Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. VI-ReID is a challenging task due to the large differences in individual appearance under different modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level, which lacks exploring the discriminative embeddings. Effectively minimizing these cross-modal discrepancies relies on obtaining representations that are guided by identity and consistent across modalities, while also filtering out representations that are irrelevant to identity. To address these challenges, we introduce a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings, facilitating effective bridging the gap between different modalities. Specifically, in DIAN, to pursue a semantically richer representation, we first use orthogonal projection to fuse the features from two connected coarse and fine layers. Furthermore, we first use dynamic convolution kernels to mine identity-guided and modality-consistent representations. More notably, a cross embedding balancing loss is introduced to effectively bridge cross-modal discrepancies by above embeddings. Experimental results on SYSU-MM01 and RegDB datasets show that DIAN achieves state-of-the-art performance. Specifically, for indoor search on SYSU-MM01, our method achieves 86.28% rank-1 accuracy and 87.41% mAP, respectively. Our code will be available soon.
△ Less
Submitted 22 July, 2024; v1 submitted 21 May, 2024;
originally announced May 2024.
-
Detecting Complex Multi-step Attacks with Explainable Graph Neural Network
Authors:
Wei Liu,
Peng Gao,
Haotian Zhang,
Ke Li,
Weiyong Yang,
Xingshen Wei,
Jiwu Shu
Abstract:
Complex multi-step attacks have caused significant damage to numerous critical infrastructures. To detect such attacks, graph neural network based methods have shown promising results by modeling the system's events as a graph. However, existing methods still face several challenges when deployed in practice. First, there is a lack of sufficient real attack data especially considering the large vo…
▽ More
Complex multi-step attacks have caused significant damage to numerous critical infrastructures. To detect such attacks, graph neural network based methods have shown promising results by modeling the system's events as a graph. However, existing methods still face several challenges when deployed in practice. First, there is a lack of sufficient real attack data especially considering the large volume of normal data. Second, the modeling of event graphs is challenging due to their dynamic and heterogeneous nature. Third, the lack of explanation in learning models undermines the trustworthiness of such methods in production environments. To address the above challenges, in this paper, we propose an attack detection method, Trace2Vec. The approach first designs an erosion function to augment rare attack samples, and integrates them into the event graphs. Next, it models the event graphs via a continuous-time dynamic heterogeneous graph neural network. Finally, it employs the Monte Carlo tree search algorithm to identify events with greater contributions to the attack, thus enhancing the explainability of the detection result. We have implemented a prototype for Trace2Vec, and the experimental evaluations demonstrate its superior detection and explanation performance compared to existing methods.
△ Less
Submitted 13 June, 2024; v1 submitted 18 May, 2024;
originally announced May 2024.
-
CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution
Authors:
Qingguo Liu,
Chenyi Zhuang,
Pan Gao,
Jie Qin
Abstract:
Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details…
▽ More
Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details, and thus we introduce a diffusion-based module $CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low- and high-resolution images, and then approximate the real distribution given only low-resolution information. Moreover, we apply an adaptive SR network $CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods, we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods, establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at \href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}.
△ Less
Submitted 30 June, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Sparse Sampling is All You Need for Fast Wrong-way Cycling Detection in CCTV Videos
Authors:
Jing Xu,
Wentao Shi,
Sheng Ren,
Pan Gao,
Peng Zhou,
Jie Qin
Abstract:
In the field of transportation, it is of paramount importance to address and mitigate illegal actions committed by both motor and non-motor vehicles. Among those actions, wrong-way cycling (i.e., riding a bicycle or e-bike in the opposite direction of the designated traffic flow) poses significant risks to both cyclists and other road users. To this end, this paper formulates a problem of detectin…
▽ More
In the field of transportation, it is of paramount importance to address and mitigate illegal actions committed by both motor and non-motor vehicles. Among those actions, wrong-way cycling (i.e., riding a bicycle or e-bike in the opposite direction of the designated traffic flow) poses significant risks to both cyclists and other road users. To this end, this paper formulates a problem of detecting wrong-way cycling ratios in CCTV videos. Specifically, we propose a sparse sampling method called WWC-Predictor to efficiently solve this problem, addressing the inefficiencies of direct tracking methods. Our approach leverages both detection-based information, which utilizes the information from bounding boxes, and orientation-based information, which provides insights into the image itself, to enhance instantaneous information capture capability. On our proposed benchmark dataset consisting of 35 minutes of video sequences and minute-level annotation, our method achieves an average error rate of a mere 1.475% while taking only 19.12% GPU time of straightforward tracking methods under the same detection model. This remarkable performance demonstrates the effectiveness of our approach in identifying and predicting instances of wrong-way cycling.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Authors:
Peng Gao,
Le Zhuo,
Dongyang Liu,
Ruoyi Du,
Xu Luo,
Longtian Qiu,
Yuhang Zhang,
Chen Lin,
Rongjie Huang,
Shijie Geng,
Renrui Zhang,
Junlin Xi,
Wenqi Shao,
Zhengkai Jiang,
Tianshuo Yang,
Weicai Ye,
He Tong,
Jingwen He,
Yu Qiao,
Hongsheng Li
Abstract:
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f…
▽ More
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.
△ Less
Submitted 13 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Authors:
Zehan Wang,
Ziang Zhang,
Xize Cheng,
Rongjie Huang,
Luping Liu,
Zhenhui Ye,
Haifeng Huang,
Yang Zhao,
Tao Jin,
Peng Gao,
Zhou Zhao
Abstract:
Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space…
▽ More
Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.
△ Less
Submitted 10 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape
Authors:
Sifat Muhammad Abdullah,
Aravind Cheruvu,
Shravya Kanchi,
Taejoong Chung,
Peng Gao,
Murtuza Jadliwala,
Bimal Viswanath
Abstract:
Deepfake or synthetic images produced using deep generative models pose serious risks to online platforms. This has triggered several research efforts to accurately detect deepfake images, achieving excellent performance on publicly available deepfake datasets. In this work, we study 8 state-of-the-art detectors and argue that they are far from being ready for deployment due to two recent developm…
▽ More
Deepfake or synthetic images produced using deep generative models pose serious risks to online platforms. This has triggered several research efforts to accurately detect deepfake images, achieving excellent performance on publicly available deepfake datasets. In this work, we study 8 state-of-the-art detectors and argue that they are far from being ready for deployment due to two recent developments. First, the emergence of lightweight methods to customize large generative models, can enable an attacker to create many customized generators (to create deepfakes), thereby substantially increasing the threat surface. We show that existing defenses fail to generalize well to such \emph{user-customized generative models} that are publicly available today. We discuss new machine learning approaches based on content-agnostic features, and ensemble modeling to improve generalization performance against user-customized models. Second, the emergence of \textit{vision foundation models} -- machine learning models trained on broad data that can be easily adapted to several downstream tasks -- can be misused by attackers to craft adversarial deepfakes that can evade existing defenses. We propose a simple adversarial attack that leverages existing foundation models to craft adversarial samples \textit{without adding any adversarial noise}, through careful semantic manipulation of the image content. We highlight the vulnerabilities of several defenses against our attack, and explore directions leveraging advanced foundation models and adversarial training to defend against this new threat.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Authors:
Kaining Ying,
Fanqing Meng,
Jin Wang,
Zhiqian Li,
Han Lin,
Yue Yang,
Hao Zhang,
Wenbo Zhang,
Yuqi Lin,
Shuo Liu,
Jiayi Lei,
Quanfeng Lu,
Runjian Chen,
Peng Xu,
Renrui Zhang,
Haozhe Zhang,
Peng Gao,
Yali Wang,
Yu Qiao,
Ping Luo,
Kaipeng Zhang,
Wenqi Shao
Abstract:
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to…
▽ More
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Unified Unsupervised Salient Object Detection via Knowledge Transfer
Authors:
Yao Yuan,
Wutao Liu,
Pan Gao,
Qun Dai,
Jie Qin
Abstract:
Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling…
▽ More
Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.
△ Less
Submitted 13 July, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes
Authors:
Kang You,
Kai Liu,
Li Yu,
Pan Gao,
Dandan Ding
Abstract:
Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance an…
▽ More
Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90$\sim$160$\times$ faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Authors:
Yiwen Tang,
Ray Zhang,
Jiaming Liu,
Zoey Guo,
Dong Wang,
Zhigang Wang,
Bin Zhao,
Shanghang Zhang,
Peng Gao,
Hongsheng Li,
Xuelong Li
Abstract:
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantl…
▽ More
Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.
△ Less
Submitted 30 May, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression
Authors:
Kang You,
Pan Gao,
Zhan Ma
Abstract:
The past several years have witnessed the emergence of learned point cloud compression (PCC) techniques. However, current learning-based lossless point cloud attribute compression (PCAC) methods either suffer from high computational complexity or deteriorated compression performance. Moreover, the significant variations in point cloud scale and sparsity encountered in real-world applications make…
▽ More
The past several years have witnessed the emergence of learned point cloud compression (PCC) techniques. However, current learning-based lossless point cloud attribute compression (PCAC) methods either suffer from high computational complexity or deteriorated compression performance. Moreover, the significant variations in point cloud scale and sparsity encountered in real-world applications make developing an all-in-one neural model a challenging task. In this paper, we propose PoLoPCAC, an efficient and generic lossless PCAC method that achieves high compression efficiency and strong generalizability simultaneously. We formulate lossless PCAC as the task of inferring explicit distributions of attributes from group-wise autoregressive priors. A progressive random grouping strategy is first devised to efficiently resolve the point cloud into groups, and then the attributes of each group are modeled sequentially from accumulated antecedents. A locality-aware attention mechanism is utilized to exploit prior knowledge from context windows in parallel. Since our method directly operates on points, it can naturally avoids distortion caused by voxelization, and can be executed on point clouds with arbitrary scale and density. Experiments show that our method can be instantly deployed once trained on a Synthetic 2k-ShapeNet dataset while enjoying continuous bit-rate reduction over the latest G-PCCv23 on various datasets (ShapeNet, ScanNet, MVUB, 8iVFB). Meanwhile, our method reports shorter coding time than G-PCCv23 on the majority of sequences with a lightweight model size (2.6MB), which is highly attractive for practical applications. Dataset, code and trained model are available at https://github.com/I2-Multimedia-Lab/PoLoPCAC.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
Authors:
Xiangyang Zhu,
Renrui Zhang,
Bowei He,
Ziyu Guo,
Jiaming Liu,
Han Xiao,
Chaoyou Fu,
Hao Dong,
Peng Gao
Abstract:
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes, and then evaluate their generalization performance on 'unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' c…
▽ More
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes, and then evaluate their generalization performance on 'unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' classes. To tackle these issues, we propose a Non-parametric Network for few-shot 3D Segmentation, Seg-NN, and its Parametric variant, Seg-PN. Without training, Seg-NN extracts dense representations by hand-crafted filters and achieves comparable performance to existing parametric models. Due to the elimination of pre-training, Seg-NN can alleviate the domain gap issue and save a substantial amount of time. Based on Seg-NN, Seg-PN only requires training a lightweight QUEry-Support Transferring (QUEST) module, which enhances the interaction between the support set and query set. Experiments suggest that Seg-PN outperforms previous state-of-the-art method by +4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively, while reducing training time by -90%, indicating its effectiveness and efficiency.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Multi-Robot Collaborative Navigation with Formation Adaptation
Authors:
Zihao Deng,
Peng Gao,
Williard Joshua Jose,
Hao Zhang
Abstract:
Multi-robot collaborative navigation is an essential ability where teamwork and synchronization are keys. In complex and uncertain environments, adaptive formation is vital, as rigid formations prove to be inadequate. The ability of robots to dynamically adjust their formation enables navigation through unpredictable spaces, maintaining cohesion, and effectively responding to environmental challen…
▽ More
Multi-robot collaborative navigation is an essential ability where teamwork and synchronization are keys. In complex and uncertain environments, adaptive formation is vital, as rigid formations prove to be inadequate. The ability of robots to dynamically adjust their formation enables navigation through unpredictable spaces, maintaining cohesion, and effectively responding to environmental challenges. In this paper, we introduce a novel approach that uses bi-level learning framework. Specifically, we use graph learning at a high level for group coordination and reinforcement learning for individual navigation. We innovate by integrating a spring-damper model within the reinforcement learning reward mechanism, addressing the rigidity of traditional formation control methods. During execution, our approach enables a team of robots to successfully navigate challenging environments, maintain a desired formation shape, and dynamically adjust their formation scale based on environmental information. We conduct extensive experiments to evaluate our approach across three distinct formation scenarios in multi-robot navigation: circle, line, and wedge. Experimental results show that our approach achieves promising results and scalability on multi-robot navigation with formation adaptation.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Authors:
Weifeng Lin,
Xinyu Wei,
Ruichuan An,
Peng Gao,
Bocheng Zou,
Yulin Luo,
Siyuan Huang,
Shanghang Zhang,
Hongsheng Li
Abstract:
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand p…
▽ More
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. In this paper, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose SPHINX-V, a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM for various visual prompts (points, bounding boxes, and free-form shape) and language understanding. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique image-visual prompt-text instruction-following samples, including natural images, document images, OCR images, mobile screenshots, web screenshots, and multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and challenging benchmark to assess a model's capability in understanding visual prompting instructions. Our experiments demonstrate SPHINX-V's impressive multimodal interaction capabilities through visual prompting, revealing significant improvements in detailed pixel-level description and question-answering abilities.
△ Less
Submitted 31 March, 2024; v1 submitted 29 March, 2024;
originally announced March 2024.
-
CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation
Authors:
Yongrui Yu,
Hanyu Chen,
Zitian Zhang,
Qiong Xiao,
Wenhui Lei,
Linrui Dai,
Yu Fu,
Hui Tan,
Guan Wang,
Peng Gao,
Xiaofan Zhang
Abstract:
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node…
▽ More
Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.