Skip to main content

Showing 1–50 of 570 results for author: Yan, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.15777  [pdf, other

    cs.CV

    A Survey on Facial Expression Recognition of Static and Dynamic Emotions

    Authors: Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, Zhongxue Gan

    Abstract: Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  2. arXiv:2408.13574  [pdf, other

    cs.CV

    PointDGMamba: Domain Generalization of Point Cloud Classification via Generalized State Space Model

    Authors: Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Fengqi Liu, Xuequan Lu, Lizhuang Ma, Shuicheng Yan

    Abstract: Domain Generalization (DG) has been recently explored to improve the generalizability of point cloud classification (PCC) models toward unseen domains. However, they often suffer from limited receptive fields or quadratic complexity due to the use of convolution neural networks or vision Transformers. In this paper, we present the first work that studies the generalizability of state space models… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  3. arXiv:2408.12003  [pdf

    cs.CL

    RAG-Optimized Tibetan Tourism LLMs: Enhancing Accuracy and Personalization

    Authors: Jinhu Qi, Shuai Yan, Yibo Zhang, Wentao Zhang, Rong Jin, Yuwei Hu, Ke Wang

    Abstract: With the development of the modern social economy, tourism has become an important way to meet people's spiritual needs, bringing development opportunities to the tourism industry. However, existing large language models (LLMs) face challenges in personalized recommendation capabilities and the generation of content that can sometimes produce hallucinations. This study proposes an optimization sch… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Accepted by AIPR 2024

    ACM Class: I.2.7

  4. arXiv:2408.10947  [pdf, other

    cs.AI cs.CL cs.CY

    Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

    Authors: Yuyan Chen, Chenwei Wu, Songzhou Yan, Panjun Liu, Haoyu Zhou, Yanghua Xiao

    Abstract: Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Accepted to ACL 2024

  5. arXiv:2408.05159  [pdf, other

    cs.CV

    EasyInv: Toward Fast and Better DDIM Inversion

    Authors: Ziyue Zhang, Mingbao Lin, Shuicheng Yan, Rongrong Ji

    Abstract: This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process.… ▽ More

    Submitted 13 August, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: 9 pages not including reference

  6. arXiv:2408.03221  [pdf, other

    cs.NI eess.SP

    DRL-Assisted Dynamic QoT-Aware Service Provisioning in Multi-Band Elastic Optical Networks

    Authors: Yiran Teng, Carlos Natalino, Farhad Arpanaei, Alfonso Sánchez-Macián, Paolo Monti, Shuangyi Yan, Dimitra Simeonidou

    Abstract: We propose a DRL-assisted approach for service provisioning in multi-band elastic optical networks. Our simulation environment uses an accurate QoT estimator based on the GN/EGN model. Results show that the proposed approach reduces request blocking by 50% compared with heuristics from the literature.

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: This paper has been accepted by 50th European Conference on Optical Communications (ECOC 2O24)

  7. Adversarial Safety-Critical Scenario Generation using Naturalistic Human Driving Priors

    Authors: Kunkun Hao, Yonggang Luo, Wen Cui, Yuqiao Bai, Jucheng Yang, Songyang Yan, Yuxi Pan, Zijiang Yang

    Abstract: Evaluating the decision-making system is indispensable in developing autonomous vehicles, while realistic and challenging safety-critical test scenarios play a crucial role. Obtaining these scenarios is non-trivial, thanks to the long-tailed distribution, sparsity, and rarity in real-world data sets. To tackle this problem, in this paper, we introduce a natural adversarial scenario generation solu… ▽ More

    Submitted 6 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: Published in IEEE Transactions on Intelligent Vehicles, 2023

    Journal ref: IEEE Transactions on Intelligent Vehicles (2023)

  8. arXiv:2408.01044  [pdf, other

    cs.CV

    Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

    Authors: Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang

    Abstract: Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation u… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV2024

  9. arXiv:2407.20585  [pdf, other

    cs.NI eess.SP

    A UAV-Enabled Time-Sensitive Data Collection Scheme for Grassland Monitoring Edge Networks

    Authors: Dongbin Jiao, Zihao Wang, Wen Fan, Weibo Yang, Peng Yang, Zhanhuan Shang, Shi Yan

    Abstract: Grassland monitoring is essential for the sustainable development of grassland resources. Traditional Internet of Things (IoT) devices generate critical ecological data, making data loss unacceptable, but the harsh environment complicates data collection. Unmanned Aerial Vehicle (UAV) and mobile edge computing (MEC) offer efficient data collection solutions, enhancing performance on resource-limit… ▽ More

    Submitted 10 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

  10. arXiv:2407.17152  [pdf, other

    cs.CV cs.AI

    XMeCap: Meme Caption Generation with Sub-Image Adaptability

    Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao

    Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framewo… ▽ More

    Submitted 31 July, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

    Comments: Accepted to MM 2024

  11. arXiv:2407.16406  [pdf, other

    cs.CV cs.LG

    Hi-EF: Benchmarking Emotion Forecasting in Human-interaction

    Authors: Haoran Wang, Xinji Mai, Zeng Tao, Yan Wang, Jiawen Yu, Ziheng Zhou, Xuan Tong, Shaoqi Yan, Qing Zhao, Shuyong Gao, Wenqiang Zhang

    Abstract: Affective Forecasting, a research direction in psychology that predicts individuals future emotions, is often constrained by numerous external factors like social influence and temporal distance. To address this, we transform Affective Forecasting into a Deep Learning problem by designing an Emotion Forecasting paradigm based on two-party interactions. We propose a novel Emotion Forecasting (EF) t… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  12. arXiv:2407.15590  [pdf, other

    cs.CV

    All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism

    Authors: Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, Qing Zhao, Shuyong Gao, Wenqiang Zhang

    Abstract: In the field of affective computing, fully leveraging information from a variety of sensory modalities is essential for the comprehensive understanding and processing of human emotions. Inspired by the process through which the human brain handles emotions and the theory of cross-modal plasticity, we propose UMBEnet, a brain-like unified modal affective processing network. The primary design of UM… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  13. arXiv:2407.14710  [pdf, other

    cs.LG cs.CR

    Universally Harmonizing Differential Privacy Mechanisms for Federated Learning: Boosting Accuracy and Convergence

    Authors: Shuya Feng, Meisam Mohammady, Hanbin Hong, Shenao Yan, Ashish Kundu, Binghui Wang, Yuan Hong

    Abstract: Differentially private federated learning (DP-FL) is a promising technique for collaborative model training while ensuring provable privacy for clients. However, optimizing the tradeoff between privacy and accuracy remains a critical challenge. To our best knowledge, we propose the first DP-FL framework (namely UDP-FL), which universally harmonizes any randomization mechanism (e.g., an optimal one… ▽ More

    Submitted 23 July, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

  14. arXiv:2407.13561  [pdf, other

    cs.CL

    Research on Tibetan Tourism Viewpoints information generation system based on LLM

    Authors: Jinhu Qi, Shuai Yan, Wentao Zhang, Yibo Zhang, Zirui Liu, Ke Wang

    Abstract: Tibet, ensconced within China's territorial expanse, is distinguished by its labyrinthine and heterogeneous topography, a testament to its profound historical heritage, and the cradle of a unique religious ethos. The very essence of these attributes, however, has impeded the advancement of Tibet's tourism service infrastructure, rendering existing smart tourism services inadequate for the region's… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Journal ref: ICWOC 2024

  15. arXiv:2407.13431  [pdf, other

    cs.LG cs.AI

    Improving Out-of-Distribution Generalization of Trajectory Prediction for Autonomous Driving via Polynomial Representations

    Authors: Yue Yao, Shengchao Yan, Daniel Goehring, Wolfram Burgard, Joerg Reichardt

    Abstract: Robustness against Out-of-Distribution (OoD) samples is a key performance indicator of a trajectory prediction model. However, the development and ranking of state-of-the-art (SotA) models are driven by their In-Distribution (ID) performance on individual competition datasets. We present an OoD testing protocol that homogenizes datasets and prediction tasks across two large-scale motion datasets.… ▽ More

    Submitted 26 August, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

  16. arXiv:2407.11325  [pdf, other

    cs.CV

    VISA: Reasoning Video Object Segmentation via Large Language Models

    Authors: Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, Efstratios Gavves

    Abstract: Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implic… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  17. arXiv:2407.11096  [pdf, other

    cs.LG cs.AI

    Static and multivariate-temporal attentive fusion transformer for readmission risk prediction

    Authors: Zhe Sun, Runzhi Li, Jing Wang, Gang Chen, Siyu Yan, Lihong Ma

    Abstract: Background: Accurate short-term readmission prediction of ICU patients is significant in improving the efficiency of resource assignment by assisting physicians in making discharge decisions. Clinically, both individual static static and multivariate temporal data collected from ICU monitors play critical roles in short-term readmission prediction. Informative static and multivariate temporal feat… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  18. arXiv:2407.09862  [pdf, other

    cs.CV

    ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency

    Authors: Shaocheng Yan, Pengcheng Shi, Jiayuan Li

    Abstract: Recent advances in point cloud registration mostly leverage geometric information. Although these methods have yielded promising results, they still struggle with problems of low overlap, thus limiting their practical usage. In this paper, we propose ML-SemReg, a plug-and-play point cloud registration framework that fully exploits semantic information. Our key insight is that mismatches can be cat… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  19. arXiv:2407.08348  [pdf, other

    cs.AI cs.CL cs.LG

    Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

    Authors: Liang Zeng, Liangjun Zhong, Liang Zhao, Tianwen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui Hu, Yang Liu, Shuicheng Yan, Han Fang, Yahui Zhou

    Abstract: In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model… ▽ More

    Submitted 17 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  20. arXiv:2407.05021  [pdf, other

    cs.CV

    Incremental Multiview Point Cloud Registration

    Authors: Xiaoya Cheng, Yu Liu, Maojun Zhang, Shen Yan

    Abstract: In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse sc… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  21. arXiv:2407.00945  [pdf, other

    cs.LG

    Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

    Authors: Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

    Abstract: The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster in… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  22. arXiv:2407.00904  [pdf, other

    cs.CE

    Background-aware Multi-source Fusion Financial Trend Forecasting Mechanism

    Authors: Fengting Mo, Shanshan Yan, Yinhao Xiao

    Abstract: Stock prices, as an economic indicator, reflect changes in economic development and market conditions. Traditional stock price prediction models often only consider time-series data and are limited by the mechanisms of the models themselves. Some deep learning models have high computational costs, depend on a large amount of high-quality data, and have poor interpretations, making it difficult to… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  23. arXiv:2407.00497  [pdf, other

    cs.CL

    LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

    Authors: Jiahao Ying, Mingbao Lin, Yixin Cao, Wei Tang, Bo Wang, Qianru Sun, Xuanjing Huang, Shuicheng Yan

    Abstract: This paper introduces the innovative "LLMs-as-Instructors" framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  24. arXiv:2406.19435  [pdf, other

    cs.CV

    A Sanity Check for AI-generated Image Detection

    Authors: Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, Weidi Xie

    Abstract: With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify th… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Project page: https://shilinyan99.github.io/AIDE Code: https://github.com/shilinyan99/AIDE

  25. arXiv:2406.19389  [pdf, other

    cs.CV

    OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

    Authors: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

    Abstract: Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual p… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  26. arXiv:2406.19369  [pdf, other

    cs.CV

    Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

    Authors: Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, Chen Change Loy

    Abstract: Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifica… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 16 pages; 8 figures

  27. Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

    Authors: Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, Shuicheng Yan

    Abstract: While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-tempo… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted by IEEE TPAMI 2024

    Journal ref: [J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  28. arXiv:2406.18173  [pdf, other

    cs.CL

    UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

    Authors: Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji

    Abstract: Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a c… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  29. arXiv:2406.18074  [pdf, other

    cs.CV cs.AI

    Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

    Authors: Song Tang, Shaxu Yan, Xiaozhi Qi, Jianxin Gao, Mao Ye, Jianwei Zhang, Xiatian Zhu

    Abstract: Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal fo… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  30. arXiv:2406.16473  [pdf, other

    cs.CV cs.AI

    Seeking Certainty In Uncertainty: Dual-Stage Unified Framework Solving Uncertainty in Dynamic Facial Expression Recognition

    Authors: Haoran Wang, Xinji Mai, Zeng Tao, Xuan Tong, Junxiong Lin, Yan Wang, Jiawen Yu, Boyang Wang, Shaoqi Yan, Qing Zhao, Ziheng Zhou, Shuyong Gao, Wenqiang Zhang

    Abstract: The contemporary state-of-the-art of Dynamic Facial Expression Recognition (DFER) technology facilitates remarkable progress by deriving emotional mappings of facial expressions from video content, underpinned by training on voluminous datasets. Yet, the DFER datasets encompass a substantial volume of noise data. Noise arises from low-quality captures that defy logical labeling, and instances that… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  31. arXiv:2406.16459  [pdf, other

    cs.CV

    Suppressing Uncertainties in Degradation Estimation for Blind Super-Resolution

    Authors: Junxiong Lin, Zeng Tao, Xuan Tong, Xinji Mai, Haoran Wang, Boyang Wang, Yan Wang, Qing Zhao, Jiawen Yu, Yuxuan Lin, Shaoqi Yan, Shuyong Gao, Wenqiang Zhang

    Abstract: The problem of blind image super-resolution aims to recover high-resolution (HR) images from low-resolution (LR) images with unknown degradation modes. Most existing methods model the image degradation process using blur kernels. However, this explicit modeling approach struggles to cover the complex and varied degradation processes encountered in the real world, such as high-order combinations of… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  32. arXiv:2406.14909  [pdf, other

    cs.LG cs.AI cs.CL

    MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

    Authors: Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

    Abstract: Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring thei… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: 10 pages

    ACM Class: I.2.7

  33. arXiv:2406.14283  [pdf, other

    cs.AI

    Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

    Authors: Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An

    Abstract: Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing… ▽ More

    Submitted 22 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  34. arXiv:2406.11836  [pdf, other

    cs.CV cs.GR

    RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians

    Authors: Bingling Li, Shengyi Chen, Luchao Wang, Kaimin Liao, Sijie Yan, Yuanjun Xiong

    Abstract: In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS i… ▽ More

    Submitted 22 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  35. arXiv:2406.08552  [pdf, other

    cs.CV

    DiTFastAttn: Attention Compression for Diffusion Transformer Models

    Authors: Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

    Abstract: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention's quadratic complexity. We propose DiTFastAttn, a novel post-training compression method to alleviate DiT's computational bottleneck. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  36. arXiv:2406.07471  [pdf, other

    cs.CV

    OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

    Authors: Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, Jun Cheng, Chi Liu, Kaijing Zhou, Zongyuan Ge

    Abstract: Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase cate… ▽ More

    Submitted 19 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by ECCV 2024

  37. arXiv:2406.06822  [pdf, other

    cs.CR cs.AI cs.SE

    An LLM-Assisted Easy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection

    Authors: Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, Yuan Hong

    Abstract: Large Language Models (LLMs) have transformed code completion tasks, providing context-based suggestions to boost developer productivity in software engineering. As users often fine-tune these models for specific applications, poisoning and backdoor attacks can covertly alter the model outputs. To address this critical security challenge, we introduce CodeBreaker, a pioneering LLM-assisted backdoo… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: To appear in USENIX Security '24

  38. arXiv:2406.06579  [pdf, other

    cs.CL cs.AI cs.CV

    From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models

    Authors: Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jieping Ye

    Abstract: Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language… ▽ More

    Submitted 13 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  39. arXiv:2406.06563  [pdf, other

    cs.CL cs.AI

    Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

    Authors: Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

    Abstract: In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initi… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  40. arXiv:2406.06367  [pdf, other

    cs.CV

    MVGamba: Unify 3D Content Generation as State Space Sequence Modeling

    Authors: Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang

    Abstract: Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-vi… ▽ More

    Submitted 20 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  41. arXiv:2406.05127  [pdf, other

    cs.CV

    Towards Semantic Equivalence of Tokenization in Multimodal LLM

    Authors: Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, r… ▽ More

    Submitted 27 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: Technical Report. The project page: https://chocowu.github.io/SeTok-web/

  42. arXiv:2406.04835  [pdf, other

    cs.RO

    SLR: Learning Quadruped Locomotion without Privileged Information

    Authors: Shiyi Chen, Zeyu Wan, Shiyang Yan, Chun Zhang, Weiyi Zhang, Qiang Li, Debing Zhang, Fasih Ud Din Farrukh

    Abstract: Traditional reinforcement learning control for quadruped robots often relies on privileged information, demanding meticulous selection and precise estimation, thereby imposing constraints on the development process. This work proposes a Self-learning Latent Representation (SLR) method, which achieves high-performance control policy learning without the need for privileged information. To enhance t… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  43. arXiv:2406.02540  [pdf, other

    cs.CV

    ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

    Authors: Tianchen Zhao, Tongcheng Fang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

    Abstract: Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an ef… ▽ More

    Submitted 30 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Project Page: https://a-suozhang.xyz/viditq.github.io/

  44. arXiv:2406.00605  [pdf, other

    cs.CL cs.AI

    LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

    Authors: Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, Yimeng Gan, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

    Abstract: We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  45. arXiv:2405.20339  [pdf, other

    cs.CV

    Visual Perception by Large Language Model's Weights

    Authors: Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

    Abstract: Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational eff… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  46. arXiv:2405.19487  [pdf, other

    cs.CL

    A Full-duplex Speech Dialogue Scheme Based On Large Language Models

    Authors: Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Yuanjun Xiong, Wei Xia

    Abstract: We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allo… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  47. arXiv:2405.19333  [pdf, other

    cs.CV

    Multi-Modal Generative Embedding Model

    Authors: Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

    Abstract: Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generativ… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  48. arXiv:2405.18769  [pdf, other

    cs.CV

    OUS: Scene-Guided Dynamic Facial Expression Recognition

    Authors: Xinji Mai, Haoran Wang, Zeng Tao, Junxiong Lin, Shaoqi Yan, Yan Wang, Jing Liu, Jiawen Yu, Xuan Tong, Yating Li, Wenqiang Zhang

    Abstract: Dynamic Facial Expression Recognition (DFER) is crucial for affective computing but often overlooks the impact of scene context. We have identified a significant issue in current DFER tasks: human annotators typically integrate emotions from various angles, including environmental cues and body language, whereas existing DFER methods tend to consider the scene as noise that needs to be filtered ou… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 12 pages, 6 figures, 6 tables

    ACM Class: I.4; I.5.1

  49. arXiv:2405.17873  [pdf, other

    cs.CV cs.AI

    MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

    Authors: Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

    Abstract: Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantiz… ▽ More

    Submitted 29 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Project Page: https://a-suozhang.xyz/mixdq.github.io/

  50. arXiv:2405.17427  [pdf, other

    cs.CV

    Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

    Authors: Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang

    Abstract: Recent advancements in multimodal large language models (LLMs) have shown their potential in various domains, especially concept reasoning. Despite these developments, applications in understanding 3D environments remain limited. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D takes point cloud data and text prompts as input to produce textual resp… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Project Page: https://KuanchihHuang.github.io/project/reason3d