skip to main content
10.1145/3700666.3700693acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbraConference Proceedingsconference-collections
research-article

Evaluation of Synthetic Data Generation Models for Balancing Multiclass Metabolomic Profiles

Published: 13 January 2025 Publication History

Abstract

Balancing highly imbalanced multiclass datasets, a crucial step for effective classification in health and well-being analysis, remains challenging, especially with conditions like Metabolic Syndrome (MetS). This study evaluates Synthetic Data Generation (SDG) techniques for balancing a dataset of metabolomic profiles associated with MetS. Evaluating models like Gaussian Copula, FAST ML, CTGAN, TVAE, TabDDPM, and Tabsyn reveals their effectiveness in preserving multivariate relationships but dissimilar success in replicating individual variable distributions. Results show significant improvements in classification tasks with balanced data, highlighting the potential of SDG in enhancing real-world data analysis and decision-making in health research.

References

[1]
Shuo Wang, & Xin Yao (2012). Multiclass Imbalance Problems: Analysis and Potential Solutions. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society, 42(4), 1119–1130.
[2]
Fujiwara, K., Huang, Y., Hori, K., Nishioji, K., Kobayashi, M., Kamaguchi, M., & Kano, M. (2020). Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Frontiers in public health, 8, 178.
[3]
Abbasian M, Ebrahimi H, Delvarianzadeh M, Norouzi P, Fazli M. Association between serum uric acid (SUA) levels and metabolic syndrome (MetS) components in personnel of shahroud university of medical sciences. Diabetes Metab Syndr (2016) 10(3):132–6.
[4]
Alberti, K. G., Eckel, R. H., Grundy, S. M., Zimmet, P. Z., Cleeman, J. I., Donato, K. A., ... & Smith Jr, S. C. (2009). Harmonizing the metabolic syndrome: a joint interim statement of the international diabetes federation task force on epidemiology and prevention; national heart, lung, and blood institute; American heart association; world heart federation; international atherosclerosis society; and international association for the study of obesity. Circulation, 120(16), 1640-1645.
[5]
Bruzzone, C., Gil-Redondo, R., Seco, M., Barragán, R., de la Cruz, L., Cannet, C., ... & Millet, O. (2021). A molecular signature for the metabolic syndrome by urine metabolomics. Cardiovascular Diabetology, 20(1), 155.
[6]
Harsurinder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2019. A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. ACM Comput. Surv. 52, 4, Article 79 (July 2020), 36 pages.
[7]
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic minority over-sampling technique." Journal of artificial intelligence research, 16, 321-357.
[8]
H. Han, W.-Y. Wang, and B.-H. Mao, ‘Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning’, in Advances in Intelligent Computing, vol. 3644, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 878–887.
[9]
Hernadez, M., Epelde, G., Alberdi, A., Cilla, R., & Rankin, D. (2023). Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of Information in Medicine, 62(S 01), e19-e38.
[10]
Kim, M. (2020, April 30). SMOTE: Practical Consideration & Limitations. Retrieved from https://medium.com/@minjukim023/smote-practical-consideration-limitations-f0d926b661a8
[11]
Kiran, A., & Kumar, S. S. (2023, March). A comparative analysis of gan and vae based synthetic data generators for high dimensional, imbalanced tabular data. In 2023 2nd International Conference for Innovation in Technology (INOCON) (pp. 1-6). IEEE.
[12]
Ahsan, M. M., Ali, M. S., & Siddique, Z. (2024). Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis. Neural Networks, 173, 106157.
[13]
Pan, T., Pedrycz, W., Yang, J., & Wang, J. (2024). An improved generative adversarial network to oversample imbalanced datasets. Engineering Applications of Artificial Intelligence, 132, 107934.
[14]
N. Patki, R. Wedge and K. Veeramachaneni, "The Synthetic Data Vault," 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 2016, pp. 399-410.
[15]
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular data using Conditional GAN. Neural Information Processing Systems.
[16]
Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023, July). Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning (pp. 17564-17579). PMLR.
[17]
Zhang, H., Zhang, J., Srinivasan, B., Shen, Z., Qin, X., Faloutsos, C., ... & Karypis, G. (2023). Mixed-type tabular data synthesis with score-based diffusion in latent space. arXiv preprint arXiv:2310.09656.
[18]
Vallevik, V. B., Babic, A., Marshall, S. E., Elvatun, S., Brøgger, H., Alagaratnam, S., ... & Nygård, J. F. (2024). Can I trust my fake data–A comprehensive quality assessment framework for synthetic tabular data in healthcare. arXiv preprint arXiv:2401.13716.
[19]
Hernadez, M., Epelde, G., Alberdi, A., Cilla, R., & Rankin, D. (2023). Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of Information in Medicine, 62(S 01), e19-e38.
[20]
Murtaza, H., Ahmed, M., Khan, N. F., Murtaza, G., Zafar, S., & Bano, A. (2023). Synthetic data generation: State of the art in health care domain. Computer Science Review, 48, 100546.
[21]
Dankar, F. K., Ibrahim, M. K., & Ismail, L. (2022). A multi-dimensional evaluation of synthetic data generators. IEEE Access, 10, 11147-11158.
[22]
Trygg, J., & Wold, S. (2002). Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics, 16(3), 119-128.
[23]
Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikström, C., & Wold, S. (2006). Multi- and Megavariate Data Analysis Part I: Basic Principles and Applications. Umetrics Academy.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICBRA '24: Proceedings of the 11th International Conference on Bioinformatics Research and Applications
September 2024
144 pages
ISBN:9798400717536
DOI:10.1145/3700666
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2025

Check for updates

Author Tags

  1. Data imbalance
  2. metabolic syndrome
  3. metabolomic
  4. multiclass
  5. synthetic data generation

Qualifiers

  • Research-article

Funding Sources

  • Industry, Energy Transition and Sustainability department of the Basque country: ELKARTEK 2024 call

Conference

ICBRA '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 2
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media