skip to main content
10.1145/3637528.3671836acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models

Published: 24 August 2024 Publication History

Abstract

Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate input visits, resulting in inadequate modeling of temporal dependencies between visits and overlooking the generation of time information, a crucial element in EHR data. Moreover, their ability to learn visit representations is limited due to simple linear mapping functions, thus compromising generation quality. To address these limitations, we propose a novel EHR data generation model called EHRPD. It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation. To enhance generation quality and diversity, we introduce a novel time-aware visit embedding module and a pioneering predictive denoising diffusion probabilistic model (P-DDPM). Additionally, we devise a predictive U-Net (PU-Net) to optimize P-DDPM. We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives. The experimental results demonstrate the efficacy and utility of the proposed EHRPD in addressing the aforementioned limitations and advancing EHR data generation.

Supplemental Material

MOV File - Promo video
Teaser for Upcoming Conference Presentation. 'Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models'. Discover how advanced predictive technologies are revolutionizing EHR synthesis, addressing significant challenges in healthcare informatics.

References

[1]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. 2021. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems (2021), 17981--17993.
[2]
Mariette Awad, Rahul Khanna, Mariette Awad, and Rahul Khanna. 2015. Support vector regression. Efficient learning machines: Theories, concepts, and applications for engineers and system designers (2015), 67--80.
[3]
Mrinal Kanti Baowaly, Chia-Ching Lin, Chao-Lin Liu, and Kuan-Ta Chen. 2019. Synthesizing electronic health records using improved generative adversarial networks. JAMIA (2019), 228--241.
[4]
Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Cao Xiao, and Jimeng Sun. 2021. EVA: Generating longitudinal electronic health records using conditional variational autoencoders. In Machine Learning for Healthcare Conference. 260--282.
[5]
George EP Box and David A Pierce. 1970. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. (1970), 1509--1526.
[6]
Zhengping Che, Yu Cheng, Shuangfei Zhai, Zhaonan Sun, and Yan Liu. 2017. Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In IEEE ICDM. 787--792.
[7]
Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. 2016. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Advances in neural information processing systems (2016).
[8]
Trisha Das, Zifeng Wang, and Jimeng Sun. 2023. Twin: Personalized clinical trial digital twin generation. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 402--413.
[9]
Yujuan Feng, Zhenxing Xu, Lin Gan, Ning Chen, Bin Yu, Ting Chen, and Fei Wang. 2019. Dcmn: Double core memory network for patient outcome prediction with multimodal data. In International Conference on Data Mining. 200--209.
[10]
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. 2022. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. In International Conference on Learning Representations.
[11]
Huan He, Shifan Zhao, Yuanzhe Xi, and Joyce C Ho. 2023. MedDiff: Generating electronic health records using accelerated denoising diffusion model. arXiv preprint arXiv:2302.04355 (2023).
[12]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems (2020), 6840--6851.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997), 1735--1780.
[14]
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data (2016), 1--9.
[15]
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning. 17564--17579.
[16]
Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. 2022. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing (2022), 47--59.
[17]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. 2022. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems (2022), 4328--4343.
[18]
Junyu Luo, Muchao Ye, Cao Xiao, and Fenglong Ma. 2020. Hitanet: Hierarchical time-aware attention networks for risk prediction on electronic health records. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 647--656.
[19]
Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1903--1911.
[20]
Liantao Ma, Junyi Gao, Yasha Wang, Chaohe Zhang, Jiangtao Wang, Wenjie Ruan, Wen Tang, Xin Gao, and Xinyu Ma. 2020. Adacare: Explainable clinical health status representation learning via scale-adaptive feature extraction and recalibration. In Association for the Advancement of Artificial Intelligence. 825--832.
[21]
Hongyuan Mei and Jason M Eisner. 2017. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems (2017).
[22]
Ahmed Ammar Naseer, Benjamin Walker, Christopher Landon, Andrew Ambrosy, Marat Fudim, Nicholas Wysham, Botros Toro, Sumanth Swaminathan, and Terry Lyons. 2023. ScoEHR: Generating Synthetic Electronic Health Records using Continuous-time Diffusion Models. In Machine Learning for Healthcare Conference. PMLR, 489--508.
[23]
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. 16784--16804.
[24]
Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. 2021. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning. 8857--8868.
[25]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition. 10684--10695.
[26]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. 2022. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 4713--4726.
[27]
Robert E Schapire. 2003. The boosting approach to machine learning: An overview. Nonlinear estimation and classification (2003), 149--171.
[28]
Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W Sjoding, and Jenna Wiens. 2020. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association (2020), 1921--1934.
[29]
Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems (2021), 24804--24816.
[30]
Brandon Theodorou, Cao Xiao, and Jimeng Sun. 2023. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nature communications (2023), 5305.
[31]
Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, and Fenglong Ma. 2023. Hierarchical Pretraining on Multimodal Electronic Health Records. In Empirical Methods in Natural Language Processing.
[32]
Zifeng Wang and Jimeng Sun. 2022. PromptEHR: Conditional Electronic Healthcare Records Generation with Prompt Learning. In Conference on Empirical Methods in Natural Language Processing.
[33]
Greg Welch, Gary Bishop, et al. 1995. An introduction to the Kalman filter. (1995).
[34]
Cao Xiao, Edward Choi, and Jimeng Sun. 2018. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association (2018), 1419--1428.
[35]
Yanbo Xu, Siddharth Biswal, Shriprasad R Deshpande, Kevin O Maher, and Jimeng Sun. 2018. Raim: Recurrent attentive and intensive model of multimodal patient monitoring data. In ACM SIGKDD international conference on Knowledge Discovery and Data Mining. 2565--2573.
[36]
Ziqi Zhang, Chao Yan, Thomas A Lasko, Jimeng Sun, and Bradley A Malin. 2021. SynTEG: a framework for temporal structured electronic health data simulation. Journal of the American Medical Informatics Association (2021), 596--604.
[37]
Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, and Fenglong Ma. 2024. MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation. In SIAM International Conference on Data Mining.
[38]
Yao Zhou, Jianpeng Xu, Jun Wu, Zeinab Taghavi Nasrabadi, Evren Körpeoglu, Kannan Achan, and Jingrui He. 2021. PURE: Positive-Unlabeled Recommendation with Generative Adversarial Network. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14--18, 2021. ACM, 2409--2419.

Index Terms

  1. Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models

                        Recommendations

                        Comments

                        Information & Contributors

                        Information

                        Published In

                        cover image ACM Conferences
                        KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
                        August 2024
                        6901 pages
                        ISBN:9798400704901
                        DOI:10.1145/3637528
                        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                        Sponsors

                        Publisher

                        Association for Computing Machinery

                        New York, NY, United States

                        Publication History

                        Published: 24 August 2024

                        Permissions

                        Request permissions for this article.

                        Check for updates

                        Author Tags

                        1. diffusion models
                        2. electronic health records
                        3. medical data synthesis
                        4. multimodal data mining

                        Qualifiers

                        • Research-article

                        Funding Sources

                        Conference

                        KDD '24
                        Sponsor:

                        Acceptance Rates

                        Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

                        Upcoming Conference

                        KDD '25

                        Contributors

                        Other Metrics

                        Bibliometrics & Citations

                        Bibliometrics

                        Article Metrics

                        • 0
                          Total Citations
                        • 277
                          Total Downloads
                        • Downloads (Last 12 months)277
                        • Downloads (Last 6 weeks)43
                        Reflects downloads up to 29 Jan 2025

                        Other Metrics

                        Citations

                        View Options

                        Login options

                        View options

                        PDF

                        View or Download as a PDF file.

                        PDF

                        eReader

                        View online with eReader.

                        eReader

                        Figures

                        Tables

                        Media

                        Share

                        Share

                        Share this Publication link

                        Share on social media