Skip to main content

Showing 1–50 of 1,083 results for author: Liu, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.12191  [pdf, other

    cs.CV cs.AI cs.CL

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Authors: Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

    Abstract: We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more eff… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Comments: Code is available at https://github.com/QwenLM/Qwen2-VL

  2. arXiv:2409.12186  [pdf, other

    cs.CL

    Qwen2.5-Coder Technical Report

    Authors: Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin

    Abstract: In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data genera… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

  3. arXiv:2409.12122  [pdf, other

    cs.CL cs.AI cs.LG

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Authors: An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang

    Abstract: In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, h… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

  4. arXiv:2409.11749  [pdf, other

    cs.CV cs.RO

    RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework

    Authors: Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui

    Abstract: 3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-came… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Comments: RockTrack establishes a new state-of-the-art with 59.1% AMOTA on the nuScenes vision-only test leaderboard with ResNet50-level backbone

  5. arXiv:2409.11593  [pdf, other

    cs.LG cs.AI cs.CV cs.ET cs.NE

    Self-Contrastive Forward-Forward Algorithm

    Authors: Xing Chen, Dongshu Liu, Jeremie Laydevant, Julie Grollier

    Abstract: The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit rec… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  6. arXiv:2409.11111  [pdf, other

    eess.IV cs.CV

    Few-Shot Domain Adaptation for Learned Image Compression

    Authors: Tianyu Zhang, Haotian Zhang, Yuqi Li, Li Li, Dong Liu

    Abstract: Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adapt… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  7. arXiv:2409.10516  [pdf, other

    cs.LG cs.CL

    RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

    Authors: Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu

    Abstract: Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation… ▽ More

    Submitted 18 September, 2024; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: 16 pages

  8. arXiv:2409.09893  [pdf, other

    cs.CV

    Resolving Inconsistent Semantics in Multi-Dataset Image Segmentation

    Authors: Qilong Zhangli, Di Liu, Abhishek Aich, Dimitris Metaxas, Samuel Schulter

    Abstract: Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well-defined ground truth with non-overlapping mask layouts and mutually exclusive semantics. However, merging them for multi-dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "per… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  9. arXiv:2409.09591  [pdf, other

    cs.LG cs.AI

    Open-World Test-Time Training: Self-Training with Contrast Learning

    Authors: Houcheng Su, Mengzhu Wang, Jiao Li, Bingli Wang, Daixian Liu, Zeheng Wang

    Abstract: Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set, limiting their applicability in real-world scenarios characterized by infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: 10page

  10. arXiv:2409.09289  [pdf, other

    cs.SD cs.MM eess.AS

    DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

    Authors: Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

    Abstract: Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality re… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  11. arXiv:2409.09284  [pdf, other

    cs.SD cs.MM eess.AS

    M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

    Authors: Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

    Abstract: With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modele… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  12. arXiv:2409.09282  [pdf, other

    cs.LG cs.MM

    Turbo your multi-modal classification with contrastive learning

    Authors: Zhiyu Zhang, Da Liu, Shengqiang Liu, Anna Wang, Jie Gao, Yali Li

    Abstract: Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  13. arXiv:2409.09170  [pdf, ps, other

    cs.CL

    Towards Precision Characterization of Communication Disorders using Models of Perceived Pragmatic Similarity

    Authors: Nigel G. Ward, Andres Segura, Georgina Bugarini, Heike Lehnert-LeHouillier, Dancheng Liu, Jinjun Xiong, Olac Fuentes

    Abstract: The diagnosis and treatment of individuals with communication disorders offers many opportunities for the application of speech technology, but research so far has not adequately considered: the diversity of conditions, the role of pragmatic deficits, and the challenges of limited data. This paper explores how a general-purpose model of perceived pragmatic similarity may overcome these limitations… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: submitted to IEEE ICASSP 2025

  14. arXiv:2409.09009  [pdf, other

    cs.CL

    Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach

    Authors: Siqi Li, Danni Liu, Jan Niehues

    Abstract: Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences, impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these v… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  15. arXiv:2409.08481  [pdf, other

    eess.IV cs.CV

    USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020s

    Authors: Zhuoyuan Li, Junqi Liao, Chuanbo Tang, Haotian Zhang, Yuqi Li, Yifan Bian, Xihua Sheng, Xinmin Feng, Yao Li, Changsheng Gao, Li Li, Dong Liu, Feng Wu

    Abstract: Image/video coding has been a remarkable research area for both academia and industry for many years. Testing datasets, especially high-quality image/video datasets are desirable for the justified evaluation of coding-related research, practical applications, and standardization activities. We put forward a test dataset namely USTC-TD, which has been successfully adopted in the practical end-to-en… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 24 pages. Project Page: https://esakak.github.io/USTC-TD

  16. arXiv:2409.06062  [pdf, other

    eess.AS cs.SD

    Retrieval Augmented Correction of Named Entity Speech Recognition Errors

    Authors: Ernest Pusateri, Anmol Walia, Anirudh Kashi, Bortik Bandyopadhyay, Nadia Hyder, Sayantan Mahinder, Raviteja Anantha, Daben Liu, Sashank Gondala

    Abstract: In recent years, end-to-end automatic speech recognition (ASR) systems have proven themselves remarkably accurate and performant, but these systems still have a significant error rate for entity names which appear infrequently in their training data. In parallel to the rise of end-to-end ASR systems, large language models (LLMs) have proven to be a versatile tool for various natural language proce… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  17. arXiv:2409.04494  [pdf, other

    eess.IV cs.CV

    Diff-INR: Generative Regularization for Electrical Impedance Tomography

    Authors: Bowen Tong, Junwu Wang, Dong Liu

    Abstract: Electrical Impedance Tomography (EIT) is a non-invasive imaging technique that reconstructs conductivity distributions within a body from boundary measurements. However, EIT reconstruction is hindered by its ill-posed nonlinear inverse problem, which complicates accurate results. To tackle this, we propose Diff-INR, a novel method that combines generative regularization with Implicit Neural Repres… ▽ More

    Submitted 10 September, 2024; v1 submitted 6 September, 2024; originally announced September 2024.

  18. arXiv:2409.03514  [pdf, other

    cs.CV

    Blended Latent Diffusion under Attention Control for Real-World Video Editing

    Authors: Deyin Liu, Lin Yuanbo Wu, Xianghua Xie

    Abstract: Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area ba… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  19. arXiv:2409.03332  [pdf, other

    cs.RO

    Masked Sensory-Temporal Attention for Sensor Generalization in Quadruped Locomotion

    Authors: Dikai Liu, Tianwei Zhang, Jianxiong Yin, Simon See

    Abstract: With the rising focus on quadrupeds, a generalized policy capable of handling different robot models and sensory inputs will be highly beneficial. Although several methods have been proposed to address different morphologies, it remains a challenge for learning-based policies to manage various combinations of proprioceptive information. This paper presents Masked Sensory-Temporal Attention (MSTA),… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: Project website for video: https://johnliudk.github.io/msta/

  20. arXiv:2409.02897  [pdf, other

    cs.CL

    LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

    Authors: Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

    Abstract: Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-graine… ▽ More

    Submitted 10 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

  21. arXiv:2409.02795  [pdf, other

    cs.CL

    Towards a Unified View of Preference Learning for Large Language Models: A Survey

    Authors: Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng, Wen Xiao, Ge Zhang, Daoguang Zan, Keming Lu, Bowen Yu, Dayiheng Liu, Zeyu Cui, Jian Yang, Lei Sha, Houfeng Wang, Zhifang Sui, Peiyi Wang, Tianyu Liu, Baobao Chang

    Abstract: Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to unde… ▽ More

    Submitted 9 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

    Comments: 23 pages, 6 figures

  22. arXiv:2409.01990  [pdf, ps, other

    cs.DC cs.LG

    Contemporary Model Compression on Large Language Models Inference

    Authors: Dong Liu

    Abstract: Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks. However, the computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications, particularly on resource-constrained devices. Efficient inference is crucial for scali… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  23. arXiv:2409.01315  [pdf, other

    physics.comp-ph cs.AI cs.LG

    Multi-frequency Neural Born Iterative Method for Solving 2-D Inverse Scattering Problems

    Authors: Daoqi Liu, Tao Shan, Maokun Li, Fan Yang, Shenheng Xu

    Abstract: In this work, we propose a deep learning-based imaging method for addressing the multi-frequency electromagnetic (EM) inverse scattering problem (ISP). By combining deep learning technology with EM physical laws, we have successfully developed a multi-frequency neural Born iterative method (NeuralBIM), guided by the principles of the single-frequency NeuralBIM. This method integrates multitask lea… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    MSC Class: 35Q61 ACM Class: I.2.6; G.1.8; G.1.3

  24. arXiv:2409.01212  [pdf, other

    cs.CV

    MobileIQA: Exploiting Mobile-level Diverse Opinion Network For No-Reference Image Quality Assessment Using Knowledge Distillation

    Authors: Zewen Chen, Sunhan Xu, Yun Zeng, Haochen Guo, Jian Guo, Shuai Liu, Juan Wang, Bing Li, Weiming Hu, Dehua Liu, Hesong Li

    Abstract: With the rising demand for high-resolution (HR) images, No-Reference Image Quality Assessment (NR-IQA) gains more attention, as it can ecaluate image quality in real-time on mobile devices and enhance user experience. However, existing NR-IQA methods often resize or crop the HR images into small resolution, which leads to a loss of important details. And most of them are of high computational comp… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by ECCV Workshop 2024

  25. arXiv:2409.00917  [pdf, other

    cs.CV

    Large Scale Unsupervised Brain MRI Image Registration Solution for Learn2Reg 2024

    Authors: Yuxi Zhang, Xiang Chen, Jiazheng Wang, Min Liu, Yaonan Wang, Dongdong Liu, Renjiu Hu, Hang Zhang

    Abstract: In this paper, we summarize the methods and experimental results we proposed for Task 2 in the learn2reg 2024 Challenge. This task focuses on unsupervised registration of anatomical structures in brain MRI images between different patients. The difficulty lies in: (1) without segmentation labels, and (2) a large amount of data. To address these challenges, we built an efficient backbone network an… ▽ More

    Submitted 4 September, 2024; v1 submitted 1 September, 2024; originally announced September 2024.

    Comments: MICCAI Learn2Reg 2024 Challenge & WBIR 2024 Workshop on Biomedical Imaging Registration

  26. arXiv:2409.00054  [pdf, other

    cs.CL cs.AI

    Automating Knowledge Discovery from Scientific Literature via LLMs: A Dual-Agent Approach with Progressive Ontology Prompting

    Authors: Yuting Hu, Dancheng Liu, Qingyun Wang, Charles Yu, Heng Ji, Jinjun Xiong

    Abstract: To address the challenge of automating knowledge discovery from a vast volume of literature, in this paper, we introduce a novel framework based on large language models (LLMs) that combines a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo, designed to enhance the automation of knowledge extraction from scientific articles. The POP algorithm utilizes a prior… ▽ More

    Submitted 20 August, 2024; originally announced September 2024.

    Comments: in submission

  27. arXiv:2408.16500  [pdf, other

    cs.CV

    CogVLM2: Visual Language Models for Image and Video Understanding

    Authors: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

    Abstract: Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  28. arXiv:2408.14789  [pdf, other

    cs.CV

    Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

    Authors: Mingyu Sheng, Jianan Fan, Dongnan Liu, Ron Kikinis, Weidong Cai

    Abstract: Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  29. arXiv:2408.12599  [pdf, other

    cs.CL

    Controllable Text Generation for Large Language Models: A Survey

    Authors: Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, Zhiyu Li

    Abstract: In Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated high text generation quality. However, in real-world applications, LLMs must meet increasingly complex requirements. Beyond avoiding misleading or inappropriate content, LLMs are also expected to cater to specific user needs, such as imitating particular writing styles or generating text with poetic richness. Thes… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: 52 pages, 11 figures, 7 tables, 11 equations

    ACM Class: A.2; I.2.7

  30. arXiv:2408.12088  [pdf, other

    cs.CY

    Mental-Perceiver: Audio-Textual Multimodal Learning for Mental Health Assessment

    Authors: Jinghui Qin, Changsong Liu, Tianchi Tang, Dahuang Liu, Minghao Wang, Qianying Huang, Yang Xu, Rumin Zhang

    Abstract: Mental disorders, such as anxiety and depression, have become a global issue that affects the regular lives of people across different ages. Without proper detection and treatment, anxiety and depression can hinder the sufferer's study, work, and daily life. Fortunately, recent advancements of digital and AI technologies provide new opportunities for better mental health care and many efforts have… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  31. arXiv:2408.11623  [pdf, other

    cs.IR cs.LG

    End-to-End Cost-Effective Incentive Recommendation under Budget Constraint with Uplift Modeling

    Authors: Zexu Sun, Hao Yang, Dugang Liu, Yunpeng Weng, Xing Tang, Xiuqiang He

    Abstract: In modern online platforms, incentives are essential factors that enhance user engagement and increase platform revenue. Over recent years, uplift modeling has been introduced as a strategic approach to assign incentives to individual customers. Especially in many real-world applications, online platforms can only incentivize customers with specific budget constraints. This problem can be reformul… ▽ More

    Submitted 24 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: Accepted by RecSys 2024

  32. arXiv:2408.11194  [pdf, other

    cs.CV

    Compress Guidance in Conditional Diffusion Sampling

    Authors: Anh-Dung Dinh, Daochang Liu, Chang Xu

    Abstract: Enforcing guidance throughout the entire sampling process often proves counterproductive due to the model-fitting issue., where samples are generated to match the classifier's parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing th… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 10 pages, 5 figures, Computer Vision and Machine Learning

    ACM Class: I.4

  33. arXiv:2408.08992  [pdf, other

    cs.HC

    SpreadLine: Visualizing Egocentric Dynamic Influence

    Authors: Yun-Hsin Kuo, Dongyu Liu, Kwan-Liu Ma

    Abstract: Egocentric networks, often visualized as node-link diagrams, portray the complex relationship (link) dynamics between an entity (node) and others. However, common analytics tasks are multifaceted, encompassing interactions among four key aspects: strength, function, structure, and content. Current node-link visualization designs may fall short, focusing narrowly on certain aspects and neglecting t… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: To appear in VIS 2024 and IEEE Transactions on Visualization and Computer Graphics

  34. arXiv:2408.08862  [pdf, other

    cs.LG

    Visual Agents as Fast and Slow Thinkers

    Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu

    Abstract: Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident respon… ▽ More

    Submitted 6 September, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  35. arXiv:2408.08645  [pdf, other

    cs.CV

    Extracting polygonal footprints in off-nadir images with Segment Anything Model

    Authors: Kai Li, Jingbo Chen, Yupeng Deng, Yu Meng, Diyou Liu, Junxian Ma, Chenhao Wang

    Abstract: Building Footprint Extraction (BFE) in off-nadir aerial images often relies on roof segmentation and roof-to-footprint offset prediction, then drugging roof-to-footprint via the offset. However, the results from this multi-stage inference are not applicable in data production, because of the low quality of masks given by prediction. To solve this problem, we proposed OBMv2 in this paper, which sup… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  36. arXiv:2408.08604  [pdf, other

    cs.CV

    Bi-Directional Deep Contextual Video Compression

    Authors: Xihua Sheng, Li Li, Dong Liu, Shiqi Wang

    Abstract: Deep video compression has made remarkable process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this paper, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, te… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  37. arXiv:2408.08585  [pdf, other

    cs.IR cs.LG

    OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction

    Authors: Yunpeng Weng, Xing Tang, Zhenhao Xu, Fuyuan Lyu, Dugang Liu, Zexu Sun, Xiuqiang He

    Abstract: Customer Lifetime Value (CLTV) prediction is a critical task in business applications. Accurately predicting CLTV is challenging in real-world business scenarios, as the distribution of CLTV is complex and mutable. Firstly, there is a large number of users without any consumption consisting of a long-tailed part that is too complex to fit. Secondly, the small set of high-value users spent orders o… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: CIKM 2024

  38. arXiv:2408.08105  [pdf, other

    cs.CV cs.AI

    Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

    Authors: Zhiyuan Li, Heng Wang, Dongnan Liu, Chaoyi Zhang, Ao Ma, Jieting Long, Weidong Cai

    Abstract: Large Language Models (LLMs) have showcased exceptional ability in causal reasoning from textual information. However, will these causalities remain straightforward for Vision Large Language Models (VLLMs) when only visual hints are provided? Motivated by this, we propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge VLLMs to infer semantic cause-and-effect relationship… ▽ More

    Submitted 30 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

    Comments: 20 pages 19 figures

  39. arXiv:2408.05533  [pdf, other

    cs.CV

    Radiance Field Learners As UAV First-Person Viewers

    Authors: Liqi Yan, Qifan Wang, Junhan Zhao, Qiang Guan, Zheng Tang, Jianhui Zhang, Dongfang Liu

    Abstract: First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs), offering an exhilarating avenue for navigating complex building structures. Yet, traditional Neural Radiance Field (NeRF) methods face challenges such as sampling single points per iteration and requiring an extensive array of views for supervision. UAV videos exacerbate these iss… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted to ECCV 2024

    Journal ref: European Conference on Computer Vision (ECCV 2024)

  40. arXiv:2408.03220  [pdf, other

    cs.LG cs.DC

    Masked Random Noise for Communication Efficient Federaetd Learning

    Authors: Shiwei Li, Yingyi Cheng, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Dugang Liu, Xiuqiang He, and Ruixuan Li

    Abstract: Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters withi… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted by MM 2024

  41. arXiv:2408.03085  [pdf, ps, other

    quant-ph cs.LG

    Matrix Multiplication on Quantum Computer

    Authors: Jiaqi Yao, Ding Liu

    Abstract: This paper introduces an innovative and practical approach to universal quantum matrix multiplication. We designed optimized quantum adders and multipliers based on Quantum Fourier Transform (QFT), which significantly reduced the number of gates used compared to classical adders and multipliers. Subsequently, we construct a basic universal quantum matrix multiplication and extend it to the Strasse… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  42. arXiv:2408.02693  [pdf, other

    physics.comp-ph cs.AI

    Diff-PIC: Revolutionizing Particle-In-Cell Simulation for Advancing Nuclear Fusion with Diffusion Models

    Authors: Chuan Liu, Chunshu Wu, Shihui Cao, Mingkai Chen, James Chenhao Liang, Ang Li, Michael Huang, Chuang Ren, Dongfang Liu, Ying Nian Wu, Tong Geng

    Abstract: Sustainable energy is a crucial global challenge, and recent breakthroughs in nuclear fusion ignition underscore the potential of harnessing energy extracted from nuclear fusion in everyday life, thereby drawing significant attention to fusion ignition research, especially Laser-Plasma Interaction (LPI). Unfortunately, the complexity of LPI at ignition scale renders theory-based analysis nearly im… ▽ More

    Submitted 19 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

  43. arXiv:2408.02657  [pdf, other

    cs.CV

    Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

    Authors: Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key ins… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Code available at: https://github.com/Alpha-VLLM/Lumina-mGPT

  44. arXiv:2408.02634  [pdf, other

    cs.GT q-fin.MF q-fin.TR

    CLVR Ordering of Transactions on AMMs

    Authors: Robert McLaughlin, Nir Chemaya, Dingyue Liu, Dahlia Malkhi

    Abstract: Trading on decentralized exchanges via an Automated Market Maker (AMM) mechanism has been massively adopted, with a daily trading volume reaching $1B. This trading method has also received close attention from researchers, central banks, and financial firms, who have the potential to adopt it to traditional financial markets such as foreign exchanges and stock markets. A critical challenge of AMM-… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  45. arXiv:2408.01779  [pdf, other

    cs.CL

    MathLearner: A Large Language Model Agent Framework for Learning to Solve Mathematical Problems

    Authors: Wenbei Xie, Donglin Liu, Haoran Yan, Wenjie Wu, Zongyang Liu

    Abstract: With the development of artificial intelligence (AI), large language models (LLM) are widely used in many fields. However, the reasoning ability of LLM is still very limited when it comes to mathematical reasoning. Mathematics plays an important role in all aspects of human society and is a technical guarantee in the fields of healthcare, transport and aerospace, for this reason, the development o… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  46. arXiv:2408.01701  [pdf, other

    cs.CV

    Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics

    Authors: Naichuan Zheng, Hailun Xia, Dapeng Liu

    Abstract: In skeletal-based action recognition, Graph Convolutional Networks (GCNs) based methods face limitations due to their complexity and high energy consumption. Spiking Neural Networks (SNNs) have gained attention in recent years for their low energy consumption, but existing methods combining GCNs and SNNs fail to fully utilize the temporal characteristics of skeletal sequences, leading to increased… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  47. arXiv:2408.00790  [pdf, other

    cs.NE cs.AI

    Improving Air Mobility for Pre-Disaster Planning with Neural Network Accelerated Genetic Algorithm

    Authors: Kamal Acharya, Alvaro Velasquez, Yongxin Liu, Dahai Liu, Liang Sun, Houbing Song

    Abstract: Weather disaster related emergency operations pose a great challenge to air mobility in both aircraft and airport operations, especially when the impact is gradually approaching. We propose an optimized framework for adjusting airport operational schedules for such pre-disaster scenarios. We first, aggregate operational data from multiple airports and then determine the optimal count of evacuation… ▽ More

    Submitted 17 July, 2024; originally announced August 2024.

    Comments: 7 pages, 8 figures, ITSC 2024

  48. FlowGPT: Exploring Domains, Output Modalities, and Goals of Community-Generated AI Chatbots

    Authors: Xian Li, Yuanning Han, Di Liu, Pengcheng An, Shuo Niu

    Abstract: The advent of Generative AI and Large Language Models has not only enhanced the intelligence of interactive applications but also catalyzed the formation of communities passionate about customizing these AI capabilities. FlowGPT, an emerging platform for sharing AI prompts and use cases, exemplifies this trend, attracting many creators who develop and share chatbots with a broader community. Despi… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: To appear at CSCW Companion '24

  49. arXiv:2407.21714  [pdf, other

    cs.AI q-bio.QM

    UMMAN: Unsupervised Multi-graph Merge Adversarial Network for Disease Prediction Based on Intestinal Flora

    Authors: Dingkun Liu, Hongjie Zhou, Yilu Qu, Huimei Zhang, Yongdong Xu

    Abstract: The abundance of intestinal flora is closely related to human diseases, but diseases are not caused by a single gut microbe. Instead, they result from the complex interplay of numerous microbial entities. This intricate and implicit connection among gut microbes poses a significant challenge for disease prediction using abundance information from OTU data. Recently, several methods have shown pote… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  50. arXiv:2407.21500  [pdf, other

    cs.RO

    DIABLO: A 6-DoF Wheeled Bipedal Robot Composed Entirely of Direct-Drive Joints

    Authors: Dingchuan Liu, Fangfang Yang, Xuanhong Liao, Ximin Lyu

    Abstract: Wheeled bipedal robots offer the advantages of both wheeled and legged robots, combining the ability to traverse a wide range of terrains and environments with high efficiency. However, the conventional approach in existing wheeled bipedal robots involves motor-driven joints with high-ratio gearboxes. While this approach provides specific benefits, it also presents several challenges, including in… ▽ More

    Submitted 11 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: This paper has already been accepted by IROS 2024