Skip to main content

Showing 1–50 of 1,755 results for author: Xu, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03605  [pdf, other

    cs.CV cs.MM

    SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

    Authors: Lingyu Xiong, Xize Cheng, Jintao Tan, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihu Hu

    Abstract: Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate r… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 10 pages, 7 figures, 3 tables

  2. arXiv:2409.03420  [pdf, other

    cs.CV

    mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

    Authors: Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

    Abstract: Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to add… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: 15 pages, 7 figures

  3. arXiv:2409.02968  [pdf, other

    cs.DB cs.CR

    A Comprehensive Survey of Blockchain Scalability: Shaping Inner-Chain and Inter-Chain Perspectives

    Authors: Baochao Chen, Liyuan Ma, Hao Xu, Juncheng Ma, Dengcheng Hu, Xiulong Liu, Jie Wu, Jianrong Wang, Keqiu Li

    Abstract: Blockchain is widely applied in logistics, finance, and agriculture. As single blockchain users grow, scalability becomes crucial. However, existing works lack a comprehensive summary of blockchain scalability. They focus on single chains or cross-chain technologies. This survey summarizes scalability across the physical and logical layers, as well as inner-chain, inter-chain, and technology dimen… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  4. arXiv:2409.02657  [pdf, other

    cs.CV cs.AI cs.MM

    PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

    Authors: Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song

    Abstract: While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is usi… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: 7+5 pages, 15 figures

  5. arXiv:2409.02041  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

    Authors: Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

    Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  6. arXiv:2409.01672  [pdf, other

    cs.CV cs.AI cs.LG

    Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

    Authors: Avraham Chapman, Haiming Xu, Lingqiao Liu

    Abstract: Training a fine-grained image recognition model with limited data presents a significant challenge, as the subtle differences between categories may not be easily discernible amidst distracting noise patterns. One commonly employed strategy is to leverage pretrained neural networks, which can generate effective feature representations for constructing an image classification model with a restricte… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  7. arXiv:2409.01600  [pdf, other

    cs.SE

    MCBA: A Matroid Constraint-Based Approach for Composite Service Recommendation Considering Compatibility and Diversity

    Authors: Ying Sun, Xiao Wang, Hanchuan Xu, Zhongjie Wang

    Abstract: With the growing popularity of microservices, many companies are encapsulating their business processes as Web APIs for remote invocation. These lightweight Web APIs offer mashup developers an efficient way to achieve complex functionalities without starting from scratch. However, this also presents challenges, such as the concentration of developers'search results on popular APIs limiting diversi… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 12 pages, 4 figures

  8. arXiv:2409.00494  [pdf, other

    cs.AI cs.SE

    GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems

    Authors: Haowen Xu, Jinghui Yuan, Anye Zhou, Guanhao Xu, Wan Li, Xuegang Ban, Xinyue Ye

    Abstract: Leveraging recent advances in generative AI, multi-agent systems are increasingly being developed to enhance the functionality and efficiency of smart city applications. This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval-Augmented Generation (RAG) technologies in Intelligent Transportation Systems (ITS), paving the way for innovative solutions t… ▽ More

    Submitted 4 September, 2024; v1 submitted 31 August, 2024; originally announced September 2024.

  9. arXiv:2408.15558  [pdf, ps, other

    cs.IT

    New quantum codes from constacyclic codes over finite chain rings

    Authors: Yongsheng Tang, Ting Yao, Heqian Xu, Xiaoshan Kai

    Abstract: Let $R$ be the finite chain ring $\mathbb{F}_{p^{2m}}+{u}\mathbb{F}_{p^{2m}}$, where $\mathbb{F}_{p^{2m}}$ is the finite field with $p^{2m}$ elements, $p$ is a prime, $m$ is a non-negative integer and ${u}^{2}=0.$ In this paper, we firstly define a class of Gray maps, which changes the Hermitian self-orthogonal property of linear codes over $\mathbb{F}_{2^{2m}}+{u}\mathbb{F}_{2^{2m}}$ into the H… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  10. arXiv:2408.15034  [pdf, other

    cs.LG

    MONAS: Efficient Zero-Shot Neural Architecture Search for MCUs

    Authors: Ye Qiao, Haocheng Xu, Yifan Zhang, Sitao Huang

    Abstract: Neural Architecture Search (NAS) has proven effective in discovering new Convolutional Neural Network (CNN) architectures, particularly for scenarios with well-defined accuracy optimization goals. However, previous approaches often involve time-consuming training on super networks or intensive architecture sampling and evaluations. Although various zero-cost proxies correlated with CNN model accur… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  11. arXiv:2408.14969  [pdf, other

    cs.IT eess.SP

    Secrecy Performance Analysis of RIS-Aided Fluid Antenna Systems

    Authors: Farshad Rostami Ghadi, Kai-Kit Wong, Masoud Kaveh, F. Javier Lopez-Martinez, Wee Kiat New, Hao Xu

    Abstract: This paper examines the impact of emerging fluid antenna systems (FAS) on reconfigurable intelligent surface (RIS)-aided secure communications. Specifically, we consider a classic wiretap channel, where a fixed-antenna transmitter sends confidential information to an FAS-equipped legitimate user with the help of an RIS, while an FAS-equipped eavesdropper attempts to decode the message. To evaluate… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  12. arXiv:2408.14776  [pdf, other

    cs.CV cs.AI

    MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation

    Authors: Yuanbing Zhu, Bingke Zhu, Zhen Chen, Huan Xu, Ming Tang, Jinqiao Wang

    Abstract: Open-vocabulary semantic segmentation aims to segment and recognize semantically meaningful regions based on text-based descriptions during inference. A typical solution to address this task is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between open- and close-vocabulary recognition. As VLMs are usually pretrained with low-resolution images (e.g.… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

    Comments: Technical report

  13. arXiv:2408.14187  [pdf, other

    cs.CV

    Ensemble Predicate Decoding for Unbiased Scene Graph Generation

    Authors: Jiasong Feng, Lichun Wang, Hongbo Xu, Kai Xu, Baocai Yin

    Abstract: Scene Graph Generation (SGG) aims to generate a comprehensive graphical representation that accurately captures the semantic information of a given scenario. However, the SGG model's performance in predicting more fine-grained predicates is hindered by a significant predicate bias. According to existing works, the long-tail distribution of predicates in training data results in the biased scene gr… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  14. arXiv:2408.13772  [pdf, ps, other

    cs.OS

    FRAP: A Flexible Resource Accessing Protocol for Multiprocessor Real-Time Systems

    Authors: Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang

    Abstract: Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systems are widely found in real-time applications, where spin-based protocols are often deployed to manage the mutually exclusive access of shared resources. Unfortunately, existing approaches either enforce rigid spin priority rules for resource accessing or carry significant pessimism in the schedulability analysis, imposing su… ▽ More

    Submitted 27 August, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

  15. arXiv:2408.13005  [pdf, other

    cs.CV

    EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

    Authors: Cong Wang, Jiaxi Gu, Panwen Hu, Haoyu Zhao, Yuanfan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

    Abstract: Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described w… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

  16. arXiv:2408.12321  [pdf, other

    cs.CL cs.CV cs.MM

    MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

    Authors: Chaoya Jiang, Jia Hongrui, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

    Abstract: This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete… ▽ More

    Submitted 26 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

  17. arXiv:2408.11518  [pdf, other

    cs.CV

    EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

    Authors: Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

    Abstract: The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, ter… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  18. arXiv:2408.11447  [pdf, other

    cs.CV

    GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

    Authors: Wanshui Gan, Fang Liu, Hongbin Xu, Ningkai Mo, Naoto Yokoya

    Abstract: We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP)… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Project page: https://ganwanshui.github.io/GaussianOcc/

  19. arXiv:2408.11227  [pdf

    eess.IV cs.AI cs.CV

    OCTCube: A 3D foundation model for optical coherence tomography that improves cross-dataset, cross-disease, cross-device and cross-modality analysis

    Authors: Zixuan Liu, Hanwen Xu, Addie Woicik, Linda G. Shapiro, Marian Blazes, Yue Wu, Cecilia S. Lee, Aaron Y. Lee, Sheng Wang

    Abstract: Optical coherence tomography (OCT) has become critical for diagnosing retinal diseases as it enables 3D images of the retina and optic nerve. OCT acquisition is fast, non-invasive, affordable, and scalable. Due to its broad applicability, massive numbers of OCT images have been accumulated in routine exams, making it possible to train large-scale foundation models that can generalize to various di… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  20. arXiv:2408.10680  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

    Authors: Tianyi Xu, Kaixun Huang, Pengcheng Guo, Yu Zhou, Longtao Huang, Hui Xue, Lei Xie

    Abstract: Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, w… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  21. arXiv:2408.10096  [pdf, other

    cs.SD cs.AI eess.AS

    Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

    Authors: Zhijun Jia, Huaying Xue, Xiulian Peng, Yan Lu

    Abstract: Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative mode… ▽ More

    Submitted 22 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

    Comments: 9 pages, ACM MM2024(accepted)

  22. arXiv:2408.09865  [pdf, other

    cs.LG cs.CL cs.IR

    MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation

    Authors: Ching-Wen Yang, Che Wei Chen, Kun-da Wu, Hao Xu, Jui-Feng Yao, Hung-Yu Kao

    Abstract: Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models treat review-generation as a proxy of explainable recommendation. Although they are able to generate fluent and grammatical sentences, they suffer from generality and hallucination issues. We propose a personalized, aspect-controlled mo… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 8 main pages, 10 pages for appendix. Under review

  23. arXiv:2408.09722  [pdf, other

    cs.LG stat.ML

    Towards Few-Shot Learning in the Open World: A Review and Beyond

    Authors: Hui Xue, Yuexuan An, Yongchun Qin, Wenqian Li, Yixin Wu, Yongjuan Che, Pengfei Fang, Minling Zhang

    Abstract: Human intelligence is characterized by our ability to absorb and apply knowledge from the world around us, especially in rapidly acquiring new concepts from minimal examples, underpinned by prior knowledge. Few-shot learning (FSL) aims to mimic this capacity by enabling significant generalizations and transferability. However, traditional FSL frameworks often rely on assumptions of clean, complete… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  24. Language-Driven Interactive Shadow Detection

    Authors: Hongqiu Wang, Wei Wang, Haipeng Zhou, Huihui Xu, Shaozhi Wu, Lei Zhu

    Abstract: Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrar… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: ACM MM 2024

  25. arXiv:2408.07532  [pdf, other

    eess.IV cs.CV

    Improved 3D Whole Heart Geometry from Sparse CMR Slices

    Authors: Yiyang Xu, Hao Xu, Matthew Sinclair, Esther Puyol-Antón, Steven A Niederer, Amedeo Chiribiri, Steven E Williams, Michelle C Williams, Alistair A Young

    Abstract: Cardiac magnetic resonance (CMR) imaging and computed tomography (CT) are two common non-invasive imaging methods for assessing patients with cardiovascular disease. CMR typically acquires multiple sparse 2D slices, with unavoidable respiratory motion artefacts between slices, whereas CT acquires isotropic dense data but uses ionising radiation. In this study, we explore the combination of Slice S… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: 13 pages, STACOM2024

  26. arXiv:2408.04840  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    Authors: Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

    Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenario… ▽ More

    Submitted 13 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

  27. arXiv:2408.03865  [pdf, other

    cs.LG

    PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

    Authors: Haoran Xu, Ziqian Liu, Rong Fu, Zhongling Su, Zerui Wang, Zheng Cai, Zhilin Pei, Xingcheng Zhang

    Abstract: With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory com… ▽ More

    Submitted 21 August, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

  28. arXiv:2408.03091  [pdf, other

    cs.IR

    Modeling User Intent Beyond Trigger: Incorporating Uncertainty for Trigger-Induced Recommendation

    Authors: Jianxing Ma, Zhibo Xiao, Luwei Yang, Hansheng Xue, Xuanzhou Liu, Wen Jiang, Wei Ning, Guannan Zhang

    Abstract: To cater to users' desire for an immersive browsing experience, numerous e-commerce platforms provide various recommendation scenarios, with a focus on Trigger-Induced Recommendation (TIR) tasks. However, the majority of current TIR methods heavily rely on the trigger item to understand user intent, lacking a higher-level exploration and exploitation of user intent (e.g., popular items and complem… ▽ More

    Submitted 7 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted at CIKM 2024

  29. arXiv:2408.02966  [pdf, other

    cs.CV eess.IV

    Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

    Authors: Hao Xu, Xi Zhang, Xiaolin Wu

    Abstract: Compressing a set of unordered points is far more challenging than compressing images/videos of regular sample grids, because of the difficulties in characterizing neighboring relations in an irregular layout of points. Many researchers resort to voxelization to introduce regularity, but this approach suffers from quantization loss. In this research, we use the KNN method to determine the neighbor… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  30. arXiv:2408.01826  [pdf, other

    cs.CV

    GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

    Authors: Yihong Lin, Zhaoxin Fan, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Xianjia Wu, Songju Lei, Huang Xu

    Abstract: Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in… ▽ More

    Submitted 16 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 9 pages, 5 figures

  31. arXiv:2408.01077  [pdf, other

    cs.CV

    PhysMamba: State Space Duality Model for Remote Physiological Measurement

    Authors: Zhixin Yan, Yan Zhong, Hongbin Xu, Wenjun Zhang, Lin Shu, Hongbin Xu, Wenxiong Kang

    Abstract: Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing rPPG methods. To address this, we propose PhysMa… ▽ More

    Submitted 17 August, 2024; v1 submitted 2 August, 2024; originally announced August 2024.

  32. arXiv:2408.00438  [pdf, other

    cs.CV

    MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

    Authors: Youjia Fu, Zihao Xu, Junsong Fu, Huixia Xue, Shuqiu Tan, Lei Li

    Abstract: Recent advancements in transformer-based monocular 3D object detection techniques have exhibited exceptional performance in inferring 3D attributes from single 2D images. However, most existing methods rely on resource-intensive transformer architectures, which often lead to significant drops in computational efficiency and performance when handling long sequence data. To address these challenges… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  33. arXiv:2408.00310  [pdf, other

    cs.LG math.OC

    Online Linear Programming with Batching

    Authors: Haoran Xu, Peter W. Glynn, Yinyu Ye

    Abstract: We study Online Linear Programming (OLP) with batching. The planning horizon is cut into $K$ batches, and the decisions on customers arriving within a batch can be delayed to the end of their associated batch. Compared with OLP without batching, the ability to delay decisions brings better operational performance, as measured by regret. Two research questions of interest are: (1) What is a lower b… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  34. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  35. RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

    Authors: Hongtao Wu, Yijun Yang, Huihui Xu, Weiming Wang, Jinni Zhou, Lei Zhu

    Abstract: The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited recep… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: ACM Multimedia 2024

  36. arXiv:2407.20981  [pdf, other

    cs.GT

    Escape Sensing Games: Detection-vs-Evasion in Security Applications

    Authors: Niclas Boehmer, Minbiao Han, Haifeng Xu, Milind Tambe

    Abstract: Traditional game-theoretic research for security applications primarily focuses on the allocation of external protection resources to defend targets. This work puts forward the study of a new class of games centered around strategically arranging targets to protect them against a constrained adversary, with motivations from varied domains such as peacekeeping resource transit and cybersecurity. Sp… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  37. arXiv:2407.20893  [pdf, other

    cs.LG cs.AI eess.SP

    MambaCapsule: Towards Transparent Cardiac Disease Diagnosis with Electrocardiography Using Mamba Capsule Network

    Authors: Yinlong Xu, Xiaoqiang Liu, Zitai Kong, Yixuan Wu, Yue Wang, Yingzhou Lu, Honghao Gao, Jian Wu, Hongxia Xu

    Abstract: Cardiac arrhythmia, a condition characterized by irregular heartbeats, often serves as an early indication of various heart ailments. With the advent of deep learning, numerous innovative models have been introduced for diagnosing arrhythmias using Electrocardiogram (ECG) signals. However, recent studies solely focus on the performance of models, neglecting the interpretation of their results. Thi… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  38. arXiv:2407.20251  [pdf, other

    eess.SP cond-mat.mtrl-sci cs.LG

    An Uncertainty-aware Deep Learning Framework-based Robust Design Optimization of Metamaterial Units

    Authors: Zihan Wang, Anindya Bhaduri, Hongyi Xu, Liping Wang

    Abstract: Mechanical metamaterials represent an innovative class of artificial structures, distinguished by their extraordinary mechanical characteristics, which are beyond the scope of traditional natural materials. The use of deep generative models has become increasingly popular in the design of metamaterial units. The effectiveness of using deep generative models lies in their capacity to compress compl… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  39. arXiv:2407.20109  [pdf, other

    cs.LG cs.AI

    Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

    Authors: Liyuan Mao, Haoran Xu, Weinan Zhang, Xianyuan Zhan, Amy Zhang

    Abstract: One important property of DIstribution Correction Estimation (DICE) methods is that the solution is the optimal stationary distribution ratio between the optimized and data collection policy. In this work, we show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution. Based on this, we propose a novel approach, Diffusion-DICE, t… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Preprint, under review

  40. arXiv:2407.19852  [pdf

    quant-ph cs.LG q-bio.BM

    Quantum Long Short-Term Memory for Drug Discovery

    Authors: Liang Zhang, Yin Xu, Mohan Wu, Liang Wang, Hua Xu

    Abstract: Quantum computing combined with machine learning (ML) is an extremely promising research area, with numerous studies demonstrating that quantum machine learning (QML) is expected to solve scientific problems more effectively than classical ML. In this work, we successfully apply QML to drug discovery, showing that QML can significantly improve model performance and achieve faster convergence compa… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  41. arXiv:2407.19456  [pdf, other

    cs.MM

    An Inverse Partial Optimal Transport Framework for Music-guided Movie Trailer Generation

    Authors: Yutong Wang, Sidan Zhu, Hongteng Xu, Dixin Luo

    Abstract: Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based… ▽ More

    Submitted 30 July, 2024; v1 submitted 28 July, 2024; originally announced July 2024.

    Comments: acmmm2024

  42. arXiv:2407.19302  [pdf, other

    cs.CL cs.MM

    IBMEA: Exploring Variational Information Bottleneck for Multi-modal Entity Alignment

    Authors: Taoyu Su, Jiawei Sheng, Shicheng Wang, Xinghua Zhang, Hongbo Xu, Tingwen Liu

    Abstract: Multi-modal entity alignment (MMEA) aims to identify equivalent entities between multi-modal knowledge graphs (MMKGs), where the entities can be associated with related images. Most existing studies integrate multi-modal information heavily relying on the automatically-learned fusion module, rarely suppressing the redundant information for MMEA explicitly. To this end, we explore variational infor… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM MM 2024

  43. arXiv:2407.19296  [pdf, other

    cs.AI

    Multi-Modal CLIP-Informed Protein Editing

    Authors: Mingze Yin, Hanjing Zhou, Yiheng Zhu, Miao Lin, Yixuan Wu, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jintai Chen, Jian Wu

    Abstract: Proteins govern most biological functions essential for life, but achieving controllable protein discovery and optimization remains challenging. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: 13 pages, 7 figures, 5 tables

  44. arXiv:2407.19256  [pdf

    cs.AI cs.CL cs.LG

    Stochastic Parrots or ICU Experts? Large Language Models in Critical Care Medicine: A Scoping Review

    Authors: Tongyue Shi, Jun Ma, Zihan Yu, Haowei Xu, Minqi Xiong, Meirong Xiao, Yilin Li, Huiying Zhao, Guilan Kong

    Abstract: With the rapid development of artificial intelligence (AI), large language models (LLMs) have shown strong capabilities in natural language understanding, reasoning, and generation, attracting amounts of research interest in applying LLMs to health and medicine. Critical care medicine (CCM) provides diagnosis and treatment for critically ill patients who often require intensive monitoring and inte… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: 28 pages, 5 figures

  45. WorkR: Occupation Inference for Intelligent Task Assistance

    Authors: Yonchanok Khaokaew, Hao Xue, Mohammad Saiedur Rahaman, Flora D. Salim

    Abstract: Occupation information can be utilized by digital assistants to provide occupation-specific personalized task support, including interruption management, task planning, and recommendations. Prior research in the digital workplace assistant domain requires users to input their occupation information for effective support. However, as many individuals switch between multiple occupations daily, curre… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

  46. arXiv:2407.18148  [pdf, other

    cs.DC cs.LG

    StraightLine: An End-to-End Resource-Aware Scheduler for Machine Learning Application Requests

    Authors: Cheng-Wei Ching, Boyuan Guan, Hailu Xu, Liting Hu

    Abstract: The life cycle of machine learning (ML) applications consists of two stages: model development and model deployment. However, traditional ML systems (e.g., training-specific or inference-specific systems) focus on one particular stage or phase of the life cycle of ML applications. These systems often aim at optimizing model training or accelerating model inference, and they frequently assume homog… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: 6 pages, 8 figures, to appear in AIoTC'24

  47. arXiv:2407.16697  [pdf, other

    cs.CV

    AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking

    Authors: Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro R. A. S. Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, Zongwei Zhou

    Abstract: We introduce the largest abdominal CT dataset (termed AbdomenAtlas) of 20,460 three-dimensional CT volumes sourced from 112 hospitals across diverse populations, geographies, and facilities. AbdomenAtlas provides 673K high-quality masks of anatomical structures in the abdominal region annotated by a team of 10 radiologists with the help of AI algorithms. We start by having expert radiologists manu… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: Published in Medical Image Analysis

  48. arXiv:2407.16667  [pdf, other

    cs.CR cs.AI cs.CL

    RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

    Authors: Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, Kui Ren

    Abstract: Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  49. arXiv:2407.15840  [pdf, other

    cs.RO

    QueST: Self-Supervised Skill Abstractions for Learning Continuous Control

    Authors: Atharva Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, Animesh Garg

    Abstract: Generalization capabilities, or rather a lack thereof, is one of the most important unsolved problems in the field of robot learning, and while several large scale efforts have set out to tackle this problem, unsolved it remains. In this paper, we hypothesize that learning temporal action abstractions using latent variable models (LVMs), which learn to map data to a compressed latent space and bac… ▽ More

    Submitted 22 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: Keywords: Behavior Clonning, Action Quantization, Self Supervised Skill Abstraction, Few-shot Imitation Learning

  50. arXiv:2407.15815  [pdf, other

    cs.RO cs.AI cs.CV

    Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

    Authors: Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, Huazhe Xu

    Abstract: Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning app… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Webpage: https://gemcollector.github.io/maniwhere/