Skip to main content

Showing 1–50 of 1,399 results for author: Yu, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.03512  [pdf, other

    cs.CY cs.CL

    From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

    Authors: Jifan Yu, Zheyuan Zhang, Daniel Zhang-li, Shangqing Tu, Zhanxin Hao, Rui Miao Li, Haoxuan Li, Yuanchun Wang, Hanming Li, Linlu Gong, Jie Cao, Jiayin Lin, Jinchang Zhou, Fei Qin, Haohua Wang, Jianxiao Jiang, Lijun Deng, Yisi Zhan, Chaojun Xiao, Xusheng Dai, Xuan Yan, Nianyi Lin, Nan Zhang, Ruixin Ni, Yang Dang , et al. (8 additional authors not shown)

    Abstract: Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integ… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  2. arXiv:2409.01557  [pdf, other

    cs.CV

    TASL-Net: Tri-Attention Selective Learning Network for Intelligent Diagnosis of Bimodal Ultrasound Video

    Authors: Chengqian Zhao, Zhao Yao, Zhaoyu Hu, Yuanxin Xie, Yafang Zhang, Yuanyuan Wang, Shuo Li, Jianhua Zhou, Jianqiao Zhou, Yin Wang, Jinhua Yu

    Abstract: In the intelligent diagnosis of bimodal (gray-scale and contrast-enhanced) ultrasound videos, medical domain knowledge such as the way sonographers browse videos, the particular areas they emphasize, and the features they pay special attention to, plays a decisive role in facilitating precise diagnosis. Embedding medical knowledge into the deep learning network can not only enhance performance but… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  3. arXiv:2409.01345  [pdf, ps, other

    cs.CL cs.AI

    Language Models Benefit from Preparation with Elicited Knowledge

    Authors: Jiacan Yu, Hannah An, Lenhart K. Schubert

    Abstract: The zero-shot chain of thought (CoT) approach is often used in question answering (QA) by language models (LMs) for tasks that require multiple reasoning steps, typically enhanced by the prompt "Let's think step by step." However, some QA tasks hinge more on accessing relevant knowledge than on chaining reasoning steps. We introduce a simple general prompting technique, called PREP, that involves… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  4. arXiv:2409.01235  [pdf, other

    q-bio.QM cs.LG

    MRI-based and metabolomics-based age scores act synergetically for mortality prediction shown by multi-cohort federated learning

    Authors: Pedro Mateus, Swier Garst, Jing Yu, Davy Cats, Alexander G. J. Harms, Mahlet Birhanu, Marian Beekman, P. Eline Slagboom, Marcel Reinders, Jeroen van der Grond, Andre Dekker, Jacobus F. A. Jansen, Magdalena Beran, Miranda T. Schram, Pieter Jelle Visser, Justine Moonen, Mohsen Ghanbari, Gennady Roshchupkin, Dina Vojinovic, Inigo Bermejo, Hailiang Mei, Esther E. Bron

    Abstract: Biological age scores are an emerging tool to characterize aging by estimating chronological age based on physiological biomarkers. Various scores have shown associations with aging-related outcomes. This study assessed the relation between an age score based on brain MRI images (BrainAge) and an age score based on metabolomic biomarkers (MetaboAge). We trained a federated deep learning model to e… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    ACM Class: I.2.1

  5. arXiv:2409.01081  [pdf, other

    cs.LG cs.AI q-bio.BM

    Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

    Authors: Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang

    Abstract: With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 20 pages, under review

  6. arXiv:2409.01073  [pdf, other

    cs.CV cs.AI cs.CL

    SCOPE: Sign Language Contextual Processing with Embedding from LLMs

    Authors: Yuqi Liu, Wenqian Zhang, Sihan Ren, Chengyu Huang, Jingyi Yu, Lan Xu

    Abstract: Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information. Current methods in vision-based sign language recognition (SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information. To address these challenges, we introduce SCOPE (Sign langua… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

  7. arXiv:2409.00988  [pdf, other

    cs.CV

    Self-Supervised Multi-Scale Network for Blind Image Deblurring via Alternating Optimization

    Authors: Lening Guo, Jing Yu, Ning Zhang, Chuangbai Xiao

    Abstract: Blind image deblurring is a challenging low-level vision task that involves estimating the unblurred image when the blur kernel is unknown. In this paper, we present a self-supervised multi-scale blind image deblurring method to jointly estimate the latent image and the blur kernel via alternating optimization. In the image estimation step, we construct a multi-scale generator network with multipl… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 21 pages, 17 figures, 94 references

  8. arXiv:2409.00985  [pdf, other

    cs.SE cs.AI cs.CL

    Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces

    Authors: Jiapeng Yu, Yuqian Wu, Yajing Zhan, Wenhao Guo, Zhou Xu, Raymond Lee

    Abstract: Online question-and-answer (Q\&A) systems based on the Large Language Model (LLM) have progressively diverged from recreational to professional use. This paper proposed a Multi-Agent framework with environmentally reinforcement learning (E-RL) for code correction called Code Learning (Co-Learning) community, assisting beginners to correct code errors independently. It evaluates the performance of… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 12 pages, 8 figures

  9. arXiv:2409.00800  [pdf, other

    cs.CL

    Comparing Discrete and Continuous Space LLMs for Speech Recognition

    Authors: Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu

    Abstract: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-spac… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: InterSpeech 2024

  10. arXiv:2409.00438  [pdf, other

    cs.LG cs.AI

    Breaking Down Financial News Impact: A Novel AI Approach with Geometric Hypergraphs

    Authors: Anoushka Harit, Zhongtian Sun, Jongmin Yu, Noura Al Moubayed

    Abstract: In the fast-paced and volatile financial markets, accurately predicting stock movements based on financial news is critical for investors and analysts. Traditional models often struggle to capture the intricate and dynamic relationships between news events and market reactions, limiting their ability to provide actionable insights. This paper introduces a novel approach leveraging Explainable Arti… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

    Comments: 16 pages, conference

  11. arXiv:2408.17267  [pdf, other

    cs.CV cs.AI

    UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

    Authors: Baichuan Zhou, Haote Yang, Dairong Chen, Junyan Ye, Tianyi Bai, Jinhua Yu, Songyang Zhang, Dahua Lin, Conghui He, Weijia Li

    Abstract: Recent evaluations of Large Multimodal Models (LMMs) have explored their capabilities in various domains, with only few benchmarks specifically focusing on urban environments. Moreover, existing urban benchmarks have been limited to evaluating LMMs with basic region-level urban tasks under singular views, leading to incomplete evaluations of LMMs' abilities in urban environments. To address these… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: 9 pages, 6 figures

  12. arXiv:2408.16235  [pdf, other

    cs.CV

    LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

    Authors: Ye Yu, Fengxin Chen, Jun Yu, Zhen Kan

    Abstract: While recent low-light image enhancement (LLIE) methods have made significant advancements, they still face challenges in terms of low visual quality and weak generalization ability when applied to complex scenarios. To address these issues, we propose a semi-supervised method based on latent mean-teacher and Gaussian process, named LMT-GP. We first design a latent mean-teacher framework that inte… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  13. arXiv:2408.15538  [pdf, other

    cs.AI cs.MA

    TrafficGamer: Reliable and Flexible Traffic Simulation for Safety-Critical Scenarios with Game-Theoretic Oracles

    Authors: Guanren Qiao, Guorui Quan, Jiawei Yu, Shujun Jia, Guiliang Liu

    Abstract: While modern Autonomous Vehicle (AV) systems can develop reliable driving policies under regular traffic conditions, they frequently struggle with safety-critical traffic scenarios. This difficulty primarily arises from the rarity of such scenarios in driving datasets and the complexities associated with predictive modeling among multiple vehicles. To support the testing and refinement of AV polic… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

  14. arXiv:2408.15270  [pdf, other

    cs.CV cs.GR cs.LG cs.RO

    SkillMimic: Learning Reusable Basketball Skills from Demonstrations

    Authors: Yinhuai Wang, Qihan Zhao, Runyi Yu, Ailing Zeng, Jing Lin, Zhengyi Luo, Hok Wai Tsui, Jiwen Yu, Xiu Li, Qifeng Chen, Jian Zhang, Lei Zhang, Ping Tan

    Abstract: Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-drive… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  15. arXiv:2408.14972  [pdf, other

    cs.CL

    AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems

    Authors: Chi-Min Chan, Jianxuan Yu, Weize Chen, Chunyang Jiang, Xinyu Liu, Weijie Shi, Zhiyuan Liu, Wei Xue, Yike Guo

    Abstract: The rapid advancement of large language models (LLMs) has led to the rise of LLM-based agents. Recent research shows that multi-agent systems (MAS), where each agent plays a specific role, can outperform individual LLMs. However, configuring an MAS for a task remains challenging, with performance only observable post-execution. Inspired by scaling laws in LLM development, we investigate whether MA… ▽ More

    Submitted 27 August, 2024; originally announced August 2024.

  16. arXiv:2408.14018  [pdf, ps, other

    cs.DS

    Quantum Speedups for Approximating the John Ellipsoid

    Authors: Xiaoyu Li, Zhao Song, Junwei Yu

    Abstract: In 1948, Fritz John proposed a theorem stating that every convex body has a unique maximal volume inscribed ellipsoid, known as the John ellipsoid. The John ellipsoid has become fundamental in mathematics, with extensive applications in high-dimensional sampling, linear programming, and machine learning. Designing faster algorithms to compute the John ellipsoid is therefore an important and emergi… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  17. arXiv:2408.11727  [pdf, other

    cs.CR cs.AI cs.CL cs.SE

    Efficient Detection of Toxic Prompts in Large Language Models

    Authors: Yi Liu, Junzhe Yu, Huijia Sun, Ling Shi, Gelei Deng, Yuqi Chen, Yang Liu

    Abstract: Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechani… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

    Comments: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

  18. T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

    Authors: Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu

    Abstract: Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in nat… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  19. arXiv:2408.10653  [pdf, other

    cs.CV

    UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image Enhancement

    Authors: Yingtie Lei, Jia Yu, Yihang Dong, Changwei Gong, Ziyang Zhou, Chi-Man Pun

    Abstract: Underwater image enhancement (UIE) plays a crucial role in various marine applications, but it remains challenging due to the complex underwater environment. Current learning-based approaches frequently lack explicit incorporation of prior knowledge about the physical processes involved in underwater image formation, resulting in limited optimization despite their impressive enhancement results. T… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Accepted by DSAA CIVIL 2024

  20. arXiv:2408.09667  [pdf, other

    cs.CL

    BLADE: Benchmarking Language Model Agents for Data-Driven Science

    Authors: Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff

    Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-dri… ▽ More

    Submitted 20 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

  21. arXiv:2408.09460  [pdf, other

    cs.CV

    Fine-Grained Building Function Recognition from Street-View Images via Geometry-Aware Semi-Supervised Learning

    Authors: Weijia Li, Jinhua Yu, Dairong Chen, Yi Lin, Runmin Dong, Xiang Zhang, Conghui He, Haohuan Fu

    Abstract: In this work, we propose a geometry-aware semi-supervised method for fine-grained building function recognition. This method leverages the geometric relationships between multi-source data to improve the accuracy of pseudo labels in semi-supervised learning, extending the task's scope and making it applicable to cross-categorization systems of building function recognition. Firstly, we design an o… ▽ More

    Submitted 27 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: This paper is currently under review

  22. arXiv:2408.08474  [pdf, other

    hep-ex astro-ph.IM cs.LG

    Enhancing Events in Neutrino Telescopes through Deep Learning-Driven Super-Resolution

    Authors: Felix J. Yu, Nicholas Kamp, Carlos A. ArgĂĽelles

    Abstract: Recent discoveries by neutrino telescopes, such as the IceCube Neutrino Observatory, relied extensively on machine learning (ML) tools to infer physical quantities from the raw photon hits detected. Neutrino telescope reconstruction algorithms are limited by the sparse sampling of photons by the optical modules due to the relatively large spacing ($10-100\,{\rm m})$ between them. In this letter, w… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: 5+1 pages, 4+1 figures

  23. arXiv:2408.07989  [pdf, other

    cs.CV cs.AI

    IIU: Independent Inference Units for Knowledge-based Visual Question Answering

    Authors: Yili Li, Jing Yu, Keke Gai, Gang Xiong

    Abstract: Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  24. arXiv:2408.07975  [pdf, other

    cs.RO cs.CL cs.CV

    Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

    Authors: Tianyu Wang, Haitao Lin, Junqiu Yu, Yanwei Fu

    Abstract: This paper investigates the task of the open-ended interactive robotic manipulation on table-top scenarios. While recent Large Language Models (LLMs) enhance robots' comprehension of user instructions, their lack of visual grounding constrains their ability to physically interact with the environment. This is because the robot needs to locate the target object for manipulation within the physical… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: Accepted by IROS 2024. 8 pages, 5 figures. See https://star-uu-wang.github.io/Polaris/

  25. GeoFormer: Learning Point Cloud Completion with Tri-Plane Integrated Transformer

    Authors: Jinpeng Yu, Binbin Huang, Yuxuan Zhang, Huaxia Li, Xu Tang, Shenghua Gao

    Abstract: Point cloud completion aims to recover accurate global geometry and preserve fine-grained local details from partial point clouds. Conventional methods typically predict unseen points directly from 3D point cloud coordinates or use self-projected multi-view depth maps to ease this task. However, these gray-scale depth maps cannot reach multi-view consistency, consequently restricting the performan… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: accepted by the 32nd ACM International Conference on Multimedia (MM'24)

  26. arXiv:2408.06395  [pdf, ps, other

    cs.DS cs.CR cs.LG

    Fast John Ellipsoid Computation with Differential Privacy Optimization

    Authors: Jiuxiang Gu, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Junwei Yu

    Abstract: Determining the John ellipsoid - the largest volume ellipsoid contained within a convex polytope - is a fundamental problem with applications in machine learning, optimization, and data analytics. Recent work has developed fast algorithms for approximating the John ellipsoid using sketching and leverage score sampling techniques. However, these algorithms do not provide privacy guarantees for sens… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  27. arXiv:2408.05475  [pdf, other

    cs.CV

    Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

    Authors: Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, Conghui He

    Abstract: Cross-view geolocalization identifies the geographic location of street view images by matching them with a georeferenced satellite database. Significant challenges arise due to the drastic appearance and geometry differences between views. In this paper, we propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. Specifically, by utilizing the gro… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV 2024

  28. arXiv:2408.05385  [pdf, other

    cs.RO

    Expected $1.x$-Makespan-Optimal MAPF on Grids in Low-Poly Time

    Authors: Teng Guo, Jingjin Yu

    Abstract: Multi-Agent Path Finding (MAPF) is NP-hard to solve optimally, even on graphs, suggesting no polynomial-time algorithms can compute exact optimal solutions for them. This raises a natural question: How optimal can polynomial-time algorithms reach? Whereas algorithms for computing constant-factor optimal solutions have been developed, the constant factor is generally very large, limiting their appl… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2201.08976

  29. arXiv:2408.04593  [pdf, other

    cs.CV cs.RO eess.IV

    SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

    Authors: Jieming Yu, An Wang, Wenzhen Dong, Mengya Xu, Mobarakol Islam, Jie Wang, Long Bai, Hongliang Ren

    Abstract: The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-sh… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Empirical study. Previous work "SAM Meets Robotic Surgery" is accessible at: arXiv:2308.07156

  30. arXiv:2408.04547  [pdf, other

    cs.MM

    Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation

    Authors: Haoxiang Shi, Ziqi Liang, Jun Yu

    Abstract: Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information.… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted by INTERSPEECH 2024

  31. arXiv:2408.04249  [pdf, other

    cs.CV

    InstantStyleGaussian: Efficient Art Style Transfer with 3D Gaussian Splatting

    Authors: Xin-Yi Yu, Jun-Xin Yu, Li-Bo Zhou, Yan Wei, Lin-Lin Ou

    Abstract: We present InstantStyleGaussian, an innovative 3D style transfer method based on the 3D Gaussian Splatting (3DGS) scene representation. By inputting a target-style image, it quickly generates new 3D GS scenes. Our method operates on pre-reconstructed GS scenes, combining diffusion models with an improved iterative dataset update strategy. It utilizes diffusion models to generate target style image… ▽ More

    Submitted 26 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

  32. arXiv:2408.04223  [pdf, other

    cs.CV cs.AI

    VideoQA in the Era of LLMs: An Empirical Study

    Authors: Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao

    Abstract: Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video underst… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Preprint. Under Review

  33. arXiv:2408.04102  [pdf, other

    cs.CV cs.AI

    ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

    Authors: William Y. Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas, Peyman Milanfar, Feng Yang

    Abstract: Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribu… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: Accepted at ECCV 2024

  34. arXiv:2408.03195  [pdf, other

    cs.LG

    RELIEF: Reinforcement Learning Empowered Graph Feature Prompt Tuning

    Authors: Jiapeng Zhu, Zichen Ding, Jianxiang Yu, Jiaqi Tan, Xiang Li, Weining Qian

    Abstract: The advent of the "pre-train, prompt" paradigm has recently extended its generalization ability and data efficiency to graph representation learning, following its achievements in Natural Language Processing (NLP). Initial graph prompt tuning approaches tailored specialized prompting functions for Graph Neural Network (GNN) models pre-trained with specific strategies, such as edge prediction, thus… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  35. arXiv:2408.02679  [pdf, other

    cs.LG cs.GR cs.HC stat.ME

    Visual Analysis of Multi-outcome Causal Graphs

    Authors: Mengjie Fan, Jinlu Yu, Daniel Weiskopf, Nan Cao, Huai-Yu Wang, Liang Zhou

    Abstract: We introduce a visual analysis method for multiple causal graphs with different outcome variables, namely, multi-outcome causal graphs. Multi-outcome causal graphs are important in healthcare for understanding multimorbidity and comorbidity. To support the visual analysis, we collaborated with medical experts to devise two comparative visualization techniques at different stages of the analysis pr… ▽ More

    Submitted 25 August, 2024; v1 submitted 31 July, 2024; originally announced August 2024.

  36. arXiv:2408.02268  [pdf, other

    cs.HC

    CHORDination: Evaluating Visual Design Choices in Chord Diagrams for Network Data

    Authors: Kai Wang, Shuqi He, Wenlu Wang, Jinbei Yu, Yu Liu, Lingyun Yu

    Abstract: Chord diagrams are widely used for visualizing data connectivity and flow between nodes in a network. They are effective for representing complex structures through an intuitive and visually appealing circular layout. While previous work has focused on improving aesthetics and interactivity, the influence of fundamental design elements on user perception and information retrieval remains under-exp… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: 12 pages, 4 pages of appendix, 8 figures, VINCI 2024

  37. arXiv:2408.02054  [pdf, other

    cs.CV

    Step Saver: Predicting Minimum Denoising Steps for Diffusion Model Image Generation

    Authors: Jean Yu, Haim Barad

    Abstract: In this paper, we introduce an innovative NLP model specifically fine-tuned to determine the minimal number of denoising steps required for any given text prompt. This advanced model serves as a real-time tool that recommends the ideal denoise steps for generating high-quality images efficiently. It is designed to work seamlessly with the Diffusion model, ensuring that images are produced with sup… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  38. arXiv:2408.01812  [pdf, other

    cs.CV

    SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm

    Authors: Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Jinhua Yu, Haote Yang, Conghui He

    Abstract: Street-to-satellite image synthesis focuses on generating realistic satellite images from corresponding ground street-view images while maintaining a consistent content layout, similar to looking down from the sky. The significant differences in perspectives create a substantial domain gap between the views, making this cross-view generation task particularly challenging. In this paper, we introdu… ▽ More

    Submitted 17 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 9 pages, 6 figures

  39. arXiv:2408.01708  [pdf, other

    cs.CV

    AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

    Authors: Zili Wang, Qi Yang, Linsu Shi, Jiazhong Yu, Qinghua Liang, Fei Li, Shiming Xiang

    Abstract: Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within rest… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

  40. arXiv:2408.01253  [pdf, other

    cs.AI eess.SY q-bio.NC

    Metareasoning in uncertain environments: a meta-BAMDP framework

    Authors: Prakhar Godara, Tilman Diego Aléman, Angela J. Yu

    Abstract: In decision-making scenarios, \textit{reasoning} can be viewed as an algorithm $P$ that makes a choice of an action $a^* \in \mathcal{A}$, aiming to optimize some outcome such as maximizing the value function of a Markov decision process (MDP). However, executing $P$ itself may bear some costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  41. arXiv:2408.01218  [pdf, other

    cs.CV

    S2TD-Face: Reconstruct a Detailed 3D Face with Controllable Texture from a Single Sketch

    Authors: Zidu Wang, Xiangyu Zhu, Jiang Yu, Tianshuo Zhang, Zhen Lei

    Abstract: 3D textured face reconstruction from sketches applicable in many scenarios such as animation, 3D avatars, artistic design, missing people search, etc., is a highly promising but underdeveloped research topic. On the one hand, the stylistic diversity of sketches leads to existing sketch-to-3D-face methods only being able to handle pose-limited and realistically shaded sketches. On the other hand, t… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: ACM MM 2024

  42. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  43. arXiv:2407.21490  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Explainable and Controllable Motion Curve Guided Cardiac Ultrasound Video Generation

    Authors: Junxuan Yu, Rusi Chen, Yongsong Zhou, Yanlin Chen, Yaofei Duan, Yuhao Huang, Han Zhou, Tan Tao, Xin Yang, Dong Ni

    Abstract: Echocardiography video is a primary modality for diagnosing heart diseases, but the limited data poses challenges for both clinical teaching and machine learning training. Recently, video generative models have emerged as a promising strategy to alleviate this issue. However, previous methods often relied on holistic conditions during generation, hindering the flexible movement control over specif… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by MICCAI MLMI 2024

  44. arXiv:2407.21033  [pdf, other

    cs.IR cs.AI cs.CL cs.CV

    Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition

    Authors: Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Xiangwei Zhu, Jian Yin

    Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utili… ▽ More

    Submitted 21 August, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: 13 pages, 7 figures

  45. arXiv:2407.20730  [pdf, other

    cs.CV

    Autogenic Language Embedding for Coherent Point Tracking

    Authors: Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang

    Abstract: Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: accepted by ACM MM 2024

  46. arXiv:2407.20679  [pdf, other

    cs.CE

    Online Prediction-Assisted Safe Reinforcement Learning for Electric Vehicle Charging Station Recommendation in Dynamically Coupled Transportation-Power Systems

    Authors: Qionghua Liao, Guilong Li, Jiajie Yu, Ziyuan Gu, Wei Ma

    Abstract: With the proliferation of electric vehicles (EVs), the transportation network and power grid become increasingly interdependent and coupled via charging stations. The concomitant growth in charging demand has posed challenges for both networks, highlighting the importance of charging coordination. Existing literature largely overlooks the interactions between power grid security and traffic effici… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: 33 pages, 31 figures

  47. StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset

    Authors: Chaofan Huo, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu, Jingya Wang

    Abstract: Modeling and capturing the 3D spatial arrangement of the human and the object is the key to perceiving 3D human-object interaction from monocular images. In this work, we propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation. Compared with previous works which use contact map or imp… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: Accepted by IJCAI-23

  48. arXiv:2407.19941  [pdf, other

    cs.LG

    Boosting Graph Foundation Model from Structural Perspective

    Authors: Yao Cheng, Yige Zhao, Jianxiang Yu, Xiang Li

    Abstract: Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspectiv… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  49. arXiv:2407.19468  [pdf, other

    cs.CV cs.MM

    MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

    Authors: Buyu Liu, Kai Wang, Yansong Liu, Jun Bao, Tingting Han, Jun Yu

    Abstract: This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, a… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM MM24

  50. arXiv:2407.18170  [pdf, other

    cs.LG

    RIDA: A Robust Attack Framework on Incomplete Graphs

    Authors: Jianke Yu, Hanchen Wang, Chen Chen, Xiaoyang Wang, Wenjie Zhang, Ying Zhang

    Abstract: Graph Neural Networks (GNNs) are vital in data science but are increasingly susceptible to adversarial attacks. To help researchers develop more robust GNN models, it's essential to focus on designing strong attack models as foundational benchmarks and guiding references. Among adversarial attacks, gray-box poisoning attacks are noteworthy due to their effectiveness and fewer constraints. These at… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.