Skip to main content

Showing 1–50 of 296 results for author: He, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03643  [pdf, other

    cs.CV cs.CL

    CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

    Authors: Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He

    Abstract: Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm

  2. arXiv:2408.17267  [pdf, other

    cs.CV cs.AI

    UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

    Authors: Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li

    Abstract: Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 9 pages, 6 figures

  3. arXiv:2408.14765  [pdf, other

    cs.CV

    CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

    Authors: Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He

    Abstract: Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis tas… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: 21 pages, 11 figures

  4. arXiv:2408.14562  [pdf, other

    cs.CV cs.AI

    A Survey of Camouflaged Object Detection and Beyond

    Authors: Fengyang Xiao, Sujie Hu, Yuqi Shen, Chengyu Fang, Jinfa Huang, Chunming He, Longxiang Tang, Ziyun Yang, Xiu Li

    Abstract: Camouflaged Object Detection (COD) refers to the task of identifying and segmenting objects that blend seamlessly into their surroundings, posing a significant challenge for computer vision systems. In recent years, COD has garnered widespread attention due to its potential applications in surveillance, wildlife conservation, autonomous systems, and more. While several surveys on COD exist, they o… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: 26 pages, 10 figures, 8 tables

  5. arXiv:2408.13567  [pdf, other

    cs.LG cs.MA

    Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning

    Authors: Mingliang Zhang, Sichang Su, Chengyang He, Guillaume Sartoretti

    Abstract: In multi-agent reinforcement learning (MARL), achieving multi-task generalization to diverse agents and objectives presents significant challenges. Existing online MARL algorithms primarily focus on single-task performance, but their lack of multi-task generalization capabilities typically results in substantial computational waste and limited real-life applicability. Meanwhile, existing offline m… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  6. arXiv:2408.13483  [pdf, other

    eess.SP cs.IT

    Transmissive RIS Enabled Transceiver Systems:Architecture, Design Issues and Opportunities

    Authors: Zhendong Li, Wen Chen, Qingqing Wu, Ziwei Liu, Chong He, Xudong Bai, Jun Li

    Abstract: Reconfigurable intelligent surface (RIS) is anticipated to augment the performance of beyond fifth-generation (B5G) and sixth-generation (6G) networks by intelligently manipulating the state of its components. Rather than employing reflective RIS for aided communications, this paper proposes an innovative transmissive RIS-enabled transceiver (TRTC) architecture that can accomplish the functions of… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

    Journal ref: IEEE VTM, 2024

  7. arXiv:2408.12320  [pdf, other

    cs.AI cs.LG

    PolyRouter: A Multi-LLM Querying System

    Authors: Dimitris Stripelis, Zijian Hu, Jipeng Zhang, Zhaozhuo Xu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Salman Avestimehr, Chaoyang He

    Abstract: With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fas… ▽ More

    Submitted 26 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

    Comments: 14 pages, 7 figures, 2 tables

    ACM Class: I.2; I.5

  8. arXiv:2408.11982  [pdf, other

    eess.IV cs.CV cs.MM

    AIM 2024 Challenge on Compressed Video Quality Assessment: Methods and Results

    Authors: Maksim Smirnov, Aleksandr Gushchin, Anastasia Antsiferova, Dmitry Vatolin, Radu Timofte, Ziheng Jia, Zicheng Zhang, Wei Sun, Jiaying Qian, Yuqin Cao, Yinan Sun, Yuxin Zhu, Xiongkuo Min, Guangtao Zhai, Kanjar De, Qing Luo, Ao-Xiang Zhang, Peng Zhang, Haibo Lei, Linyan Jiang, Yaqing Li, Wenhui Meng, Xiaoheng Tan, Haiqiang Wang, Xiaozhong Xu , et al. (11 additional authors not shown)

    Abstract: Video quality assessment (VQA) is a crucial task in the development of video compression standards, as it directly impacts the viewer experience. This paper presents the results of the Compressed Video Quality Assessment challenge, held in conjunction with the Advances in Image Manipulation (AIM) workshop at ECCV 2024. The challenge aimed to evaluate the performance of VQA methods on a diverse dat… ▽ More

    Submitted 28 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

  9. arXiv:2408.09460  [pdf, other

    cs.CV

    Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

    Authors: Weijia Li, Jinhua Yu, Dairong Chen, Yi Lin, Runmin Dong, Xiang Zhang, Conghui He, Haohuan Fu

    Abstract: In this work, we propose a geometry-aware semi-supervised method for fine-grained building function recognition. This method leverages the geometric relationships between multi-source data to improve the accuracy of pseudo labels in semi-supervised learning, extending the task's scope and making it applicable to cross-categorization systems of building function recognition. Firstly, we design an o… ▽ More

    Submitted 27 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: This paper is currently under review

  10. arXiv:2408.05475  [pdf, other

    cs.CV

    Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

    Authors: Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, Conghui He

    Abstract: Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database. Significant challenges arise due to the drastic appearance and geometry differences between views. In this paper, we propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. Specifically, by utilizing the gro… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  11. arXiv:2408.05404  [pdf, other

    cs.CL

    LaiDA: Linguistics-aware In-context Learning with Data Augmentation for Metaphor Components Identification

    Authors: Hongde Liu, Chenyuan He, Feiyang Meng, Changyong Niu, Yuxiang Jia

    Abstract: Metaphor Components Identification (MCI) contributes to enhancing machine understanding of metaphors, thereby advancing downstream natural language processing tasks. However, the complexity, diversity, and dependency on context and background knowledge pose significant challenges for MCI. Large language models (LLMs) offer new avenues for accurate comprehension of complex natural language texts du… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: This paper has been accepted by NLPCC 2024 Shared Tasks

  12. arXiv:2408.03063  [pdf, other

    cs.RO

    Social Behavior as a Key to Learning-based Multi-Agent Pathfinding Dilemmas

    Authors: Chengyang He, Tanishq Duhan, Parth Tulsyan, Patrick Kim, Guillaume Sartoretti

    Abstract: The Multi-agent Path Finding (MAPF) problem involves finding collision-free paths for a team of agents in a known, static environment, with important applications in warehouse automation, logistics, or last-mile delivery. To meet the needs of these large-scale applications, current learning-based methods often deploy the same fully trained, decentralized network to all agents to improve scalabilit… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Submitted to Springer's Artificial Intelligence Journal

  13. arXiv:2408.02695  [pdf, other

    cs.LG cs.AI

    Distribution-Level Memory Recall for Continual Learning: Preserving Knowledge and Avoiding Confusion

    Authors: Shaoxu Cheng, Kanglei Geng, Chiyuan He, Zihuan Qiu, Linfeng Xu, Heqian Qiu, Lanxiao Wang, Qingbo Wu, Fanman Meng, Hongliang Li

    Abstract: Continual Learning (CL) aims to enable Deep Neural Networks (DNNs) to learn new data without forgetting previously learned knowledge. The key to achieving this goal is to avoid confusion at the feature level, i.e., avoiding confusion within old tasks and between new and old tasks. Previous prototype-based CL methods generate pseudo features for old knowledge replay by adding Gaussian noise to the… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  14. arXiv:2408.01812  [pdf, other

    cs.CV

    SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm

    Authors: Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Jinhua Yu, Haote Yang, Conghui He

    Abstract: Street-to-satellite image synthesis focuses on generating realistic satellite images from corresponding ground street-view images while maintaining a consistent content layout, similar to looking down from the sky. The significant differences in perspectives create a substantial domain gap between the views, making this cross-view generation task particularly challenging. In this paper, we introdu… ▽ More

    Submitted 17 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 9 pages, 6 figures

  15. arXiv:2408.00008  [pdf, other

    cs.DC cs.LG

    ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

    Authors: Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He

    Abstract: Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual sub-procedures, e.g. local inference and communication, however, there is no comprehensive framework that provides a holistic system view for optimizing LLM servin… ▽ More

    Submitted 23 July, 2024; originally announced August 2024.

  16. arXiv:2407.21669  [pdf, other

    cs.CL cs.LG

    Synth-Empathy: Towards High-Quality Synthetic Empathy Data

    Authors: Hao Liang, Linzhuang Sun, Jingxuan Wei, Xijie Huang, Linkun Sun, Bihui Yu, Conghui He, Wentao Zhang

    Abstract: In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present… ▽ More

    Submitted 10 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2407.01937

  17. arXiv:2407.20772  [pdf, other

    eess.SP cs.NI

    Edge Learning Based Collaborative Automatic Modulation Classification for Hierarchical Cognitive Radio Networks

    Authors: Peihao Dong, Chaowei He, Shen Gao, Fuhui Zhou, Qihui Wu

    Abstract: In hierarchical cognitive radio networks, edge or cloud servers utilize the data collected by edge devices for modulation classification, which, however, is faced with problems of the computation load, transmission overhead, and data privacy. In this article, an edge learning (EL) based framework jointly mobilizing the edge device and the edge server for intelligent co-inference is proposed to rea… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: Accepted by IEEE Internet of Things Journal

  18. arXiv:2407.20756  [pdf, other

    cs.CV cs.CL

    SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

    Authors: Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qinhan Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, Wentao Zhang

    Abstract: Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However, training these models requires vast amounts of data, posing challenges to efficiency, effectiveness, data quality, and privacy. In this paper, we int… ▽ More

    Submitted 10 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

  19. arXiv:2407.19451  [pdf, other

    cs.CV cs.GR

    Perm: A Parametric Representation for Multi-Style 3D Hair Modeling

    Authors: Chengan He, Xin Sun, Zhixin Shu, Fujun Luan, Sören Pirk, Jorge Alejandro Amador Herrera, Dominik L. Michels, Tuanfeng Y. Wang, Meng Zhang, Holly Rushmeier, Yi Zhou

    Abstract: We present Perm, a learned parametric model of human 3D hair designed to facilitate various hair-related applications. Unlike previous work that jointly models the global hair shape and local strand details, we propose to disentangle them using a PCA-based strand representation in the frequency domain, thereby allowing more precise editing and output control. Specifically, we leverage our strand r… ▽ More

    Submitted 8 August, 2024; v1 submitted 28 July, 2024; originally announced July 2024.

    Comments: Project page: https://cs.yale.edu/homes/che/projects/perm/

  20. arXiv:2407.13773  [pdf, other

    cs.DL cs.AI

    OpenDataLab: Empowering General Artificial Intelligence with Open Datasets

    Authors: Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin

    Abstract: The advancement of artificial intelligence (AI) hinges on the quality and accessibility of data, yet the current fragmentation and variability of data sources hinder efficient data utilization. The dispersion of data sources and diversity of data formats often lead to inefficiencies in data retrieval and processing, significantly impeding the progress of AI research and applications. To address th… ▽ More

    Submitted 4 June, 2024; originally announced July 2024.

  21. arXiv:2407.11466  [pdf, other

    cs.CY

    Navigating the Data Trading Crossroads: An Interdisciplinary Survey

    Authors: Yi Yu, Jingru Yu, Xuhong Wang, Juanjuan Li, Yilun Lin, Conghui He, Yanqing Yang, Yu Qiao, Li Li, Fei-Yue Wang

    Abstract: Data has been increasingly recognized as a critical factor in the future economy. However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse. Despite numerous studies proposing algorithms to protect privacy and methods for pricing data, a comprehensive understanding of these issues and systemic solutions remain elusive. This paper… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  22. arXiv:2407.09781  [pdf, other

    cs.CV

    Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

    Authors: Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang

    Abstract: Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-em… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  23. arXiv:2407.08966  [pdf, other

    cs.CV cs.AI cs.LG

    LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

    Authors: Yabin Zhang, Wenjie Zhu, Chenhang He, Lei Zhang

    Abstract: Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demand… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: ECCV2024; Codes and Supp. are available at: https://github.com/YBZh/LAPT

  24. arXiv:2407.08517  [pdf, other

    cs.CV

    Generalized Low-Rank Matrix Completion Model with Overlapping Group Error Representation

    Authors: Wenjing Lu, Zhuang Fang, Liang Wu, Liming Tang, Hanxin Liu, Chuanjiang He

    Abstract: The low-rank matrix completion (LRMC) technology has achieved remarkable results in low-level visual tasks. There is an underlying assumption that the real-world matrix data is low-rank in LRMC. However, the real matrix data does not satisfy the strict low-rank property, which undoubtedly present serious challenges for the above-mentioned matrix recovery methods. Fortunately, there are feasible sc… ▽ More

    Submitted 20 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  25. arXiv:2407.05638  [pdf, other

    cs.CV

    HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

    Authors: Junhao Su, Chenghao He, Feiyu Zhu, Xiaojie Xu, Dongzhi Guan, Chenyang Si

    Abstract: Traditional deep learning relies on end-to-end backpropagation for training, but it suffers from drawbacks such as high memory consumption and not aligning with biological neural networks. Recent advancements have introduced locally supervised learning, which divides networks into modules with isolated gradients and trains them locally. However, this approach can lead to performance lag due to lim… ▽ More

    Submitted 8 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  26. arXiv:2407.05623  [pdf, other

    cs.CV

    Momentum Auxiliary Network for Supervised Local Learning

    Authors: Junhao Su, Changpeng Cai, Feiyu Zhu, Chenghao He, Xiaojie Xu, Dongzhi Guan, Chenyang Si

    Abstract: Deep neural networks conventionally employ end-to-end backpropagation for their training process, which lacks biological credibility and triggers a locking dilemma during network parameter updates, leading to significant GPU memory use. Supervised local learning, which segments the network into multiple local blocks updated by independent auxiliary networks. However, these methods cannot replace e… ▽ More

    Submitted 12 August, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024(Oral)

  27. arXiv:2407.05342  [pdf, other

    cs.CV

    Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

    Authors: Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, Jiaya Jia

    Abstract: This study addresses the Domain-Class Incremental Learning problem, a realistic but challenging continual learning scenario where both the domain distribution and target classes vary across tasks. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. However, this incurs a new problem: the knowledge encoded in the pre-trained VLM… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  28. arXiv:2407.03320  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Authors: Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao , et al. (2 additional authors not shown)

    Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. Th… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Technical Report. https://github.com/InternLM/InternLM-XComposer

  29. arXiv:2407.03104  [pdf, other

    cs.CV cs.CL cs.MM

    KeyVideoLLM: Towards Large-scale Video Keyframe Selection

    Authors: Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, Wentao Zhang

    Abstract: Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particular… ▽ More

    Submitted 10 August, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  30. arXiv:2406.18536  [pdf, other

    eess.SY cs.AI cs.AR

    Reliable Interval Prediction of Minimum Operating Voltage Based on On-chip Monitors via Conformalized Quantile Regression

    Authors: Yuxuan Yin, Xiaoxiao Wang, Rebecca Chen, Chen He, Peng Li

    Abstract: Predicting the minimum operating voltage ($V_{min}$) of chips is one of the important techniques for improving the manufacturing testing flow, as well as ensuring the long-term reliability and safety of in-field systems. Current $V_{min}$ prediction methods often provide only point estimates, necessitating additional techniques for constructing prediction confidence intervals to cover uncertaintie… ▽ More

    Submitted 3 May, 2024; originally announced June 2024.

    Comments: Accepted by DATE 2024. Camera-ready version

  31. arXiv:2406.16554  [pdf, other

    cs.CL

    LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

    Authors: Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, Yu Cheng

    Abstract: Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B mod… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  32. Interpretable modulated differentiable STFT and physics-informed balanced spectrum metric for freight train wheelset bearing cross-machine transfer fault diagnosis under speed fluctuations

    Authors: Chao He, Hongmei Shi, Ruixin Li, Jianbo Li, ZuJun Yu

    Abstract: The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentia… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Journal ref: Advanced Engineering Informatics, 2024

  33. arXiv:2406.11633  [pdf, other

    cs.CV

    DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

    Authors: Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao

    Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extract… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures

  34. arXiv:2406.11138  [pdf, other

    cs.CV cs.AI

    Diffusion Models in Low-Level Vision: A Survey

    Authors: Chunming He, Yuqi Shen, Chengyu Fang, Fengyang Xiao, Longxiang Tang, Yulun Zhang, Wangmeng Zuo, Zhenhua Guo, Xiu Li

    Abstract: Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compellin… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 20 pages, 23 figures, 4 tables

  35. arXiv:2406.10847  [pdf, other

    cs.AI cs.CE cs.CL cs.MA

    TorchOpera: A Compound AI System for LLM Safety

    Authors: Shanshan Han, Yuhang Yao, Zijian Hu, Dimitris Stripelis, Zhaozhuo Xu, Chaoyang He

    Abstract: We introduce TorchOpera, a compound AI system for enhancing the safety and quality of prompts and responses for Large Language Models. TorchOpera ensures that all user prompts are safe, contextually grounded, and effectively processed, while enhancing LLM responses to be relevant and high quality. TorchOpera utilizes the vector database for contextual grounding, rule-based wrappers for flexible mo… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  36. arXiv:2406.10700  [pdf, other

    cs.CV cs.RO

    Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

    Authors: Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaoxiang Zhang, Lei Zhang

    Abstract: Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based me… ▽ More

    Submitted 18 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: 10 pages, 4 figures

  37. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  38. arXiv:2406.07966  [pdf, other

    cs.CV

    Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network

    Authors: Chengyu Fang, Chunming He, Fengyang Xiao, Yulun Zhang, Longxiang Tang, Yuelin Zhang, Kai Li, Xiu Li

    Abstract: Real-world Image Dehazing (RID) aims to alleviate haze-induced degradation in real-world settings. This task remains challenging due to the complexities in accurately modeling real haze distributions and the scarcity of paired real-world data. To address these challenges, we first introduce a cooperative unfolding network that jointly models atmospheric scattering and image scenes, effectively int… ▽ More

    Submitted 22 August, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: 10 pages, 7 figures, 6 tables

  39. arXiv:2406.00446  [pdf, other

    cs.CV cs.AI

    GLCAN: Global-Local Collaborative Auxiliary Network for Local Learning

    Authors: Feiyu Zhu, Yuming Zhang, Changpeng Cai, Guinan Guo, Jiao Li, Xiuyuan Guo, Quanwei Zhang, Peizhe Wang, Chenghao He, Junhao Su

    Abstract: Traditional deep neural networks typically use end-to-end backpropagation, which often places a big burden on GPU memory. Another promising training method is local learning, which involves splitting the network into blocks and training them in parallel with the help of an auxiliary network. Local learning has been widely studied and applied to image classification tasks, and its performance is co… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  40. arXiv:2405.19669  [pdf, other

    cs.CV

    Texture-guided Coding for Deep Features

    Authors: Lei Xiong, Xin Luo, Zihao Wang, Chaofan He, Shuyuan Zhu, Bing Zeng

    Abstract: With the rapid development of machine vision technology in recent years, many researchers have begun to focus on feature compression that is better suited for machine vision tasks. The target of feature compression is deep features, which arise from convolution in the middle layer of a pre-trained convolutional neural network. However, due to the large volume of data and high level of abstraction… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  41. arXiv:2405.18315  [pdf, other

    cs.AI cs.PL

    DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

    Authors: Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He

    Abstract: In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by p… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  42. arXiv:2405.16640  [pdf, other

    cs.AI cs.CL cs.CV cs.MM

    A Survey of Multimodal Large Language Model from A Data-centric Perspective

    Authors: Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Ping Huang, Jiulong Shan, Conghui He, Binhang Yuan, Wentao Zhang

    Abstract: Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal role in the development and refinement of these models. In this survey, we comprehensively review the literature on MLLMs from a data-centric perspective. Spec… ▽ More

    Submitted 18 July, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

  43. arXiv:2405.07524  [pdf, other

    cs.CV

    HybridHash: Hybrid Convolutional and Self-Attention Deep Hashing for Image Retrieval

    Authors: Chao He, Hongxi Wei

    Abstract: Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks… ▽ More

    Submitted 14 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: Accepted by ICMR 2024

  44. arXiv:2405.07257   

    cs.CV

    Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

    Authors: Changpeng Cai, Guinan Guo, Jiao Li, Junhao Su, Chenghao He, Jing Xiao, Yuanxu Chen, Lei Dai, Feiyu Zhu

    Abstract: Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and ca… ▽ More

    Submitted 27 August, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

    Comments: Due to our negligence, there are factual errors in the experimental results, so we are considering resubmitting the paper after an overhaul

    ACM Class: I.4.5; I.4.9

  45. arXiv:2405.06659  [pdf, other

    q-bio.BM cs.AI cs.LG physics.chem-ph

    ControlMol: Adding Substruture Control To Molecule Diffusion Models

    Authors: Qi Zhengyang, Liu Zijing, Zhang Jiying, Cao He, Li Yu

    Abstract: Designing new molecules is an important task in the field of pharmaceuticals. Due to the vast design space of molecules, generating molecules conditioned on a specific sub-structure relevant to a particular function or therapeutic target is a crucial task in computer-aided drug design. In this paper, we present ControlMol, which adds sub-structure control to molecule generation with diffusion mode… ▽ More

    Submitted 22 April, 2024; originally announced May 2024.

    Comments: 9 pages,7 figures

  46. arXiv:2404.19738  [pdf, other

    cs.HC

    DiaryHelper: Exploring the Use of an Automatic Contextual Information Recording Agent for Elicitation Diary Study

    Authors: Junze Li, Changyang He, Jiaxiong Hu, Boyang Jia, Alon Halevy, Xiaojuan Ma

    Abstract: Elicitation diary studies, a type of qualitative, longitudinal research method, involve participants to self-report aspects of events of interest at their occurrences as memory cues for providing details and insights during post-study interviews. However, due to time constraints and lack of motivation, participants' diary entries may be vague or incomplete, impairing their later recall. To address… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: CHI 2024

  47. arXiv:2404.18359  [pdf, other

    cs.CL cs.AI

    FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

    Authors: Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

    Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choi… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  48. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  49. arXiv:2404.16205  [pdf, other

    cs.CV cs.MM

    AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

    Authors: Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chenlong He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai , et al. (11 additional authors not shown)

    Abstract: This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed met… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop -- AI for Streaming (AIS) Video Quality Assessment Challenge

  50. arXiv:2404.15341  [pdf, other

    eess.SP cs.LG

    Classifier-guided neural blind deconvolution: a physics-informed denoising module for bearing fault diagnosis under heavy noise

    Authors: Jing-Xiao Liao, Chao He, Jipu Li, Jinwei Sun, Shiping Zhang, Xiaoge Zhang

    Abstract: Blind deconvolution (BD) has been demonstrated as an efficacious approach for extracting bearing fault-specific features from vibration signals under strong background noise. Despite BD's desirable feature in adaptability and mathematical interpretability, a significant challenge persists: How to effectively integrate BD with fault-diagnosing classifiers? This issue arises because the traditional… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.