Skip to main content

Showing 1–50 of 146 results for author: Yang, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2409.17342  [pdf

    physics.med-ph eess.IV

    Small metal artifact detection and inpainting in cardiac CT images

    Authors: Trevor McKeown, H. Michael Gach, Yao Hao, Hongyu An, Clifford G. Robinson, Phillip S. Cuculich, Deshan Yang

    Abstract: Background: Quantification of cardiac motion on pre-treatment CT imaging for stereotactic arrhythmia radiotherapy patients is difficult due to the presence of image artifacts caused by metal leads of implantable cardioverter-defibrillators (ICDs). New methods are needed to accurately reduce the metal artifacts in already reconstructed CTs to recover the otherwise lost anatomical information. Purpo… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  2. arXiv:2409.16661  [pdf, ps, other

    eess.IV

    Morphological-consistent Diffusion Network for Ultrasound Coronal Image Enhancement

    Authors: Yihao Zhou, Zixun Huang, Timothy Tin-Yan Lee, Chonglin Wu, Kelly Ka-Lee Lai, De Yang, Alec Lik-hang Hung, Jack Chun-Yiu Cheng, Tsz-Ping Lam, Yong-ping Zheng

    Abstract: Ultrasound curve angle (UCA) measurement provides a radiation-free and reliable evaluation for scoliosis based on ultrasound imaging. However, degraded image quality, especially in difficult-to-image patients, can prevent clinical experts from making confident measurements, even leading to misdiagnosis. In this paper, we propose a multi-stage image enhancement framework that models high-quality im… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  3. arXiv:2409.14085  [pdf, other

    eess.AS cs.SD

    Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

    Authors: Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kaiwei Chang, Jiawei Du, Ke-Han Lu, Alexander H. Liu, Ho-Lam Chung, Yuan-Kuei Wu, Dongchao Yang, Songxiang Liu, Yi-Chiao Wu, Xu Tan, James Glass, Shinji Watanabe, Hung-yi Lee

    Abstract: Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralinguistics, speaker characteristics, and audio information even at low bitrates. Recently, numerous advanced neural codec models have been proposed. However, codec mo… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

  4. arXiv:2409.12560  [pdf, other

    eess.AS cs.SD

    AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Zhiyong Wu, Helen Meng, Xixin Wu

    Abstract: Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  5. arXiv:2409.11630  [pdf, other

    cs.SD eess.AS

    Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

    Abstract: The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  6. arXiv:2409.11169  [pdf, other

    eess.IV cs.AI cs.CV

    MAISI: Medical AI for Synthetic Imaging

    Authors: Pengfei Guo, Can Zhao, Dong Yang, Ziyue Xu, Vishwesh Nath, Yucheng Tang, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu

    Abstract: Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion mode… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  7. arXiv:2409.08702  [pdf, other

    eess.AS cs.AI

    DM: Dual-path Magnitude Network for General Speech Restoration

    Authors: Da-Hee Yang, Dail Kim, Joon-Hyuk Chang, Jeonghwan Choi, Han-gil Moon

    Abstract: In this paper, we introduce a novel general speech restoration model: the Dual-path Magnitude (DM) network, designed to address multiple distortions including noise, reverberation, and bandwidth degradation effectively. The DM network employs dual parallel magnitude decoders that share parameters: one uses a masking-based algorithm for distortion removal and the other employs a mapping-based appro… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  8. arXiv:2409.07020  [pdf, other

    eess.IV cs.CV

    EVENet: Evidence-based Ensemble Learning for Uncertainty-aware Brain Parcellation Using Diffusion MRI

    Authors: Chenjun Li, Dian Yang, Shun Yao, Shuyue Wang, Ye Wu, Le Zhang, Qiannuo Li, Kang Ik Kevin Cho, Johanna Seitz-Holland, Lipeng Ning, Jon Haitz Legarreta, Yogesh Rathi, Carl-Fredrik Westin, Lauren J. O'Donnell, Nir A. Sochen, Ofer Pasternak, Fan Zhang

    Abstract: In this study, we developed an Evidence-based Ensemble Neural Network, namely EVENet, for anatomical brain parcellation using diffusion MRI. The key innovation of EVENet is the design of an evidential deep learning framework to quantify predictive uncertainty at each voxel during a single inference. Using EVENet, we obtained accurate parcellation and uncertainty estimates across different datasets… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: 15 pages, 5 figures

  9. arXiv:2409.01957  [pdf, ps, other

    cs.IT eess.SP

    Power Control and Random Serving Mode Allocation for CJT-NCJT Hybrid Mode Enabled Cell-Free Massive MIMO With Limited Fronthauls

    Authors: Hangyu Zhang, Rui Zhang, Yongzhao Li, Yuhan Ruan, Tao Li, Dong Yang

    Abstract: With a great potential of improving the service fairness and quality for user equipments (UEs), cell-free massive multiple-input multiple-output (mMIMO) has been regarded as an emerging candidate for 6G network architectures. Under ideal assumptions, the coherent joint transmission (CJT) serving mode has been considered as an optimal option for cell-free mMIMO systems, since it can achieve coheren… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: 6 pages, 2 figures, accepted by GLOBECOM 2024

  10. arXiv:2409.00933  [pdf, other

    cs.SD eess.AS

    SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

    Authors: Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

    Abstract: The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speech into a shorter, multi-stream discrete semantic sequence with multiple tokens at each frame. Meanwhile, the ordered product quantization is proposed… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT 2024

  11. arXiv:2408.13893  [pdf, other

    cs.SD cs.CL eess.AS

    SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

    Abstract: Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3). Although these works dem… ▽ More

    Submitted 28 August, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

    Comments: Submit to TASLP

  12. arXiv:2408.13782  [pdf

    eess.IV cs.CV physics.optics

    Batch-FPM: Random batch-update multi-parameter physical Fourier ptychography neural network

    Authors: Ruiqing Sun, Delong Yang, Yiyan Su, Shaohui Zhang, Qun Hao

    Abstract: Fourier Ptychographic Microscopy (FPM) is a computational imaging technique that enables high-resolution imaging over a large field of view. However, its application in the biomedical field has been limited due to the long image reconstruction time and poor noise robustness. In this paper, we propose a fast and robust FPM reconstruction method based on physical neural networks with batch update st… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

  13. arXiv:2406.17897  [pdf, other

    eess.IV

    Pixel-weighted Multi-pose Fusion for Metal Artifact Reduction in X-ray Computed Tomography

    Authors: Diyu Yang, Craig A. J. Kemp, Soumendu Majee, Gregery T. Buzzard, Charles A. Bouman

    Abstract: X-ray computed tomography (CT) reconstructs the internal morphology of a three dimensional object from a collection of projection images, most commonly using a single rotation axis. However, for objects containing dense materials like metal, the use of a single rotation axis may leave some regions of the object obscured by the metal, even though projections from other rotation axes (or poses) migh… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Submitted to IEEE MMSP 2024. arXiv admin note: substantial text overlap with arXiv:2209.07561

  14. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  15. arXiv:2406.08336  [pdf, other

    cs.SD cs.CV eess.AS

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (… ▽ More

    Submitted 24 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  16. arXiv:2406.06329  [pdf, other

    cs.CL eess.AS

    A Parameter-efficient Language Extension Framework for Multilingual ASR

    Authors: Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

    Abstract: Covering all languages with a multilingual speech recognition model (MASR) is very difficult. Performing language extension on top of an existing MASR is a desirable choice. In this study, the MASR continual learning problem is probabilistically decomposed into language identity prediction (LP) and cross-lingual adaptation (XLA) sub-problems. Based on this, we propose an architecture-based framewo… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  17. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  18. arXiv:2406.02328  [pdf, other

    cs.SD eess.AS

    SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

    Abstract: In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac… ▽ More

    Submitted 14 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  19. arXiv:2405.03141  [pdf, other

    eess.IV cs.AI cs.CV physics.med-ph

    Automatic Ultrasound Curve Angle Measurement via Affinity Clustering for Adolescent Idiopathic Scoliosis Evaluation

    Authors: Yihao Zhou, Timothy Tin-Yan Lee, Kelly Ka-Lee Lai, Chonglin Wu, Hin Ting Lau, De Yang, Chui-Yi Chan, Winnie Chiu-Wing Chu, Jack Chun-Yiu Cheng, Tsz-Ping Lam, Yong-Ping Zheng

    Abstract: The current clinical gold standard for evaluating adolescent idiopathic scoliosis (AIS) is X-ray radiography, using Cobb angle measurement. However, the frequent monitoring of the AIS progression using X-rays poses a challenge due to the cumulative radiation exposure. Although 3D ultrasound has been validated as a reliable and radiation-free alternative for scoliosis assessment, the process of mea… ▽ More

    Submitted 6 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

  20. arXiv:2404.04427  [pdf

    physics.med-ph eess.IV

    A comprehensive liver CT landmark pair dataset for evaluating deformable image registration algorithms

    Authors: Zhendong Zhang, Edward Robert Criscuolo, Yao Hao, Deshan Yang

    Abstract: Purpose: Evaluating deformable image registration (DIR) algorithms is vital for enhancing algorithm performance and gaining clinical acceptance. However, there's a notable lack of dependable DIR benchmark datasets for assessing DIR performance except for lung images. To address this gap, we aim to introduce our comprehensive liver computed tomography (CT) DIR landmark dataset library. Acquisitio… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  21. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  22. arXiv:2403.13720  [pdf, other

    cs.SD eess.AS

    UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

    Authors: Wataru Nakata, Kazuki Yamauchi, Dong Yang, Hiroaki Hyodo, Yuki Saito

    Abstract: We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech c… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: 5 pages, 3 figures

  23. arXiv:2403.12970  [pdf

    eess.IV cs.CV physics.bio-ph physics.optics

    Hybrid deep learning and physics-based neural network for programmable illumination computational microscopy

    Authors: Ruiqing Sun, Delong Yang, Shaohui Zhang, Qun Hao

    Abstract: Relying on either deep models or physical models are two mainstream approaches for solving inverse sample reconstruction problems in programmable illumination computational microscopy. Solutions based on physical models possess strong generalization capabilities while struggling with global optimization of inverse problems due to a lack of insufficient physical constraints. In contrast, deep learn… ▽ More

    Submitted 17 January, 2024; originally announced March 2024.

  24. arXiv:2403.11974  [pdf, other

    eess.IV cs.CV

    OUCopula: Bi-Channel Multi-Label Copula-Enhanced Adapter-Based CNN for Myopia Screening Based on OU-UWF Images

    Authors: Yang Li, Qiuyi Huang, Chong Zhong, Danjuan Yang, Meiyan Li, A. H. Welsh, Aiyi Liu, Bo Fu, Catherien C. Liu, Xingtao Zhou

    Abstract: Myopia screening using cutting-edge ultra-widefield (UWF) fundus imaging is potentially significant for ophthalmic outcomes. Current multidisciplinary research between ophthalmology and deep learning (DL) concentrates primarily on disease classification and diagnosis using single-eye images, largely ignoring joint modeling and prediction for Oculus Uterque (OU, both eyes). Inspired by the complex… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  25. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  26. arXiv:2402.00288  [pdf, other

    eess.AS cs.SD

    Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito

    Abstract: Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and in… ▽ More

    Submitted 14 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted by INTERSPEECH2024

  27. arXiv:2401.03800  [pdf, other

    cs.CV eess.IV

    MvKSR: Multi-view Knowledge-guided Scene Recovery for Hazy and Rainy Degradation

    Authors: Dong Yang, Wenyu Xu, Yuan Gao, Yuxu Lu, Jingming Zhang, Yu Guo

    Abstract: High-quality imaging is crucial for ensuring safety supervision and intelligent deployment in fields like transportation and industry. It enables precise and detailed monitoring of operations, facilitating timely detection of potential hazards and efficient management. However, adverse weather conditions, such as atmospheric haziness and precipitation, can have a significant impact on image qualit… ▽ More

    Submitted 8 January, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

  28. arXiv:2401.03689  [pdf, other

    eess.AS cs.SD

    LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

    Authors: Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

    Abstract: Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically i… ▽ More

    Submitted 10 June, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024

  29. arXiv:2401.02118  [pdf, other

    cs.IT eess.SP

    Radio Map-Based Spectrum Sharing for Joint Communication and Sensing

    Authors: Xionran Fang, Wei Feng, Yunfei Chen, Dingxi Yang, Ning Ge, Zhiyong Feng, Yue Gao

    Abstract: The sixth-generation (6G) network is expected to provide both communication and sensing (C&S) services. However, spectrum scarcity poses a major challenge to the harmonious coexistence of C&S systems. Without effective cooperation, the interference resulting from spectrum sharing impairs the performance of both systems. This paper addresses C&S interference within a distributed network. Different… ▽ More

    Submitted 27 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

  30. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  31. arXiv:2311.18168  [pdf, other

    cs.CV cs.LG eess.AS

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel

    Abstract: We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D f… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  32. arXiv:2311.17790  [pdf, other

    cs.SD eess.AS

    FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition

    Authors: Dongning Yang, Wei Wang, Yanmin Qian

    Abstract: Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  33. arXiv:2311.04685  [pdf, other

    eess.IV

    An End-Cloud Computing Enabled Surveillance Video Transmission System

    Authors: Dingxi Yang, Zhijin Qin, Liting Wang, Xiaoming Tao, Fang Cui, Hengjiang Wang

    Abstract: The enormous data volume of video poses a significant burden on the network. Particularly, transferring high-definition surveillance videos to the cloud consumes a significant amount of spectrum resources. To address these issues, we propose a surveillance video transmission system enabled by end-cloud computing. Specifically, the cameras actively down-sample the original video and then a redundan… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  34. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  35. arXiv:2310.04114  [pdf, other

    eess.IV cs.CV

    Aorta Segmentation from 3D CT in MICCAI SEG.A. 2023 Challenge

    Authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu

    Abstract: Aorta provides the main blood supply of the body. Screening of aorta with imaging helps for early aortic disease detection and monitoring. In this work, we describe our solution to the Segmentation of the Aorta (SEG.A.231) from 3D CT challenge. We use automated segmentation method Auto3DSeg available in MONAI. Our solution achieves an average Dice score of 0.920 and 95th percentile of the Hausdorf… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: MICCAI 2023, SEG.A. 2023 challenge 1st place

  36. arXiv:2310.02862  [pdf, other

    cs.LG cs.AI eess.SP

    A novel asymmetrical autoencoder with a sparsifying discrete cosine Stockwell transform layer for gearbox sensor data compression

    Authors: Xin Zhu, Daoguang Yang, Hongyi Pan, Hamid Reza Karimi, Didem Ozevin, Ahmet Enis Cetin

    Abstract: The lack of an efficient compression model remains a challenge for the wireless transmission of gearbox data in non-contact gear fault diagnosis problems. In this paper, we present a signal-adaptive asymmetrical autoencoder with a transform domain layer to compress sensor signals. First, a new discrete cosine Stockwell transform (DCST) layer is introduced to replace linear layers in a multi-layer… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  37. arXiv:2310.00704  [pdf, other

    cs.SD eess.AS

    UniAudio: An Audio Foundation Model Toward Universal Audio Generation

    Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng

    Abstract: Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other con… ▽ More

    Submitted 11 December, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  38. arXiv:2309.17269  [pdf, ps, other

    eess.IV cs.CV

    Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

    Authors: Weiwen Zhang, Dawei Yang, Haoxuan Che, An Ran Ran, Carol Y. Cheung, Hao Chen

    Abstract: For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for t… ▽ More

    Submitted 29 September, 2023; originally announced September 2023.

    Comments: 10 pages, 9 figures

  39. arXiv:2309.02285  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS 2: Describing and Generating Voices with Text Prompt

    Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

    Abstract: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text… ▽ More

    Submitted 11 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Demo page: https://speechresearch.github.io/prompttts2

  40. arXiv:2309.01212  [pdf, other

    cs.SD eess.AS

    NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

    Authors: Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou

    Abstract: The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-condition… ▽ More

    Submitted 3 September, 2023; originally announced September 2023.

  41. arXiv:2308.06891  [pdf

    cs.RO eess.SY

    Viia-hand: a Reach-and-grasp Restoration System Integrating Voice interaction, Computer vision and Auditory feedback for Blind Amputees

    Authors: Chunhao Peng, Dapeng Yang, Ming Cheng, Jinghui Dai, Deyu Zhao, Li Jiang

    Abstract: Visual feedback plays a crucial role in the process of amputation patients completing grasping in the field of prosthesis control. However, for blind and visually impaired (BVI) amputees, the loss of both visual and grasping abilities makes the "easy" reach-and-grasp task a feasible challenge. In this paper, we propose a novel multi-sensory prosthesis system helping BVI amputees with sensing, navi… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  42. Cooperative Decision-Making in Shared Spaces: Making Urban Traffic Safer through Human-Machine Cooperation

    Authors: Balint Varga, Dongxu Yang, Sören Hohmann

    Abstract: In this paper, a cooperative decision-making is presented, which is suitable for intention-aware automated vehicle functions. With an increasing number of highly automated and autonomous vehicles on public roads, trust is a very important issue regarding their acceptance in our society. The most challenging scenarios arise at low driving speeds of these highly automated and autonomous vehicles, wh… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Journal ref: 2023 IEEE 21st Jubilee International Symposium on Intelligent Systems and Informatics (SISY)

  43. arXiv:2305.19269  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Make-A-Voice: Unified Voice Synthesis With Discrete Representation

    Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

    Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  44. arXiv:2305.18474  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

    Authors: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

    Abstract: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  45. arXiv:2305.05835  [pdf, other

    eess.IV cs.CV cs.LG

    Reference-based OCT Angiogram Super-resolution with Learnable Texture Generation

    Authors: Yuyan Ruan, Dawei Yang, Ziqi Tang, An Ran Ran, Carol Y. Cheung, Hao Chen

    Abstract: Optical coherence tomography angiography (OCTA) is a new imaging modality to visualize retinal microvasculature and has been readily adopted in clinics. High-resolution OCT angiograms are important to qualitatively and quantitatively identify potential biomarkers for different retinal diseases accurately. However, one significant problem of OCTA is the inevitable decrease in resolution when increa… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: 12 pages, 11 figures

    MSC Class: 68T07 ACM Class: I.2; I.4

  46. arXiv:2305.02765  [pdf, other

    cs.SD eess.AS

    HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, Yuexian Zou

    Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: The second version of HiFi-Codec

  47. arXiv:2304.12995  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

    Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

    Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  48. arXiv:2304.10167  [pdf

    physics.optics cs.ET eess.IV

    Adaptive coded illumination Fourier ptychography microscopy based on physical neural network

    Authors: Ruiqing Sun, Delong Yang, Yao Hu, Qun Hao, Xin Li, Shaohui Zhang

    Abstract: Fourier Ptychographic Microscopy (FPM) is a computational technique that achieves a large space-bandwidth product imaging. It addresses the challenge of balancing a large field of view and high resolution by fusing information from multiple images taken with varying illumination angles. Nevertheless, conventional FPM framework always suffers from long acquisition time and a heavy computational bur… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

  49. arXiv:2304.03966  [pdf, other

    eess.SY

    A Smart Switch Configuration and Reliability Assessment Method for Large-Scale Offshore Wind Farm Electrical Collector System

    Authors: Xiaochi Ding, Xinwei Shen, Qiuwei Wu, Liming Wang, Dechang Yang

    Abstract: With the development of offshore wind farms (OWFs) in far-offshore and deep-sea areas, each OWF could contain more and more wind turbines and cables, making it imperative to study high-reliability electrical collector system (ECS) for OWF. Enlightened by active distribution network, for OWF, we propose an ECS switch configuration that enables post-fault network recovery, along with a reliability a… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

    Comments: 10 pages

  50. Intention-Aware Decision-Making for Mixed Intersection Scenarios

    Authors: Balint Varga, Dongxu Yang, Soeren Hohmann

    Abstract: This paper presents a white-box intention-aware decision-making for the handling of interactions between a pedestrian and an automated vehicle (AV) in an unsignalized street crossing scenario. Moreover, a design framework has been developed, which enables automated parameterization of the decision-making. This decision-making is designed in such a manner that it can understand pedestrians in urban… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.