skip to main content
10.1145/3535508.3545536acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Public Access

A comparison of dimensionality reduction methods for large biological data

Published: 07 August 2022 Publication History

Abstract

Large-scale data often suffer from the curse of dimensionality and the constraints associated with it; therefore, dimensionality reduction methods are often performed prior to most machine learning pipelines. In this paper, we directly compare autoencoders performance as a dimensionality reduction technique (via the latent space) to other established methods: PCA, LASSO, and t-SNE. To do so, we use four distinct datasets that vary in the types of features, metadata, labels, and size to robustly compare different methods. We test prediction capability using both Support Vector Machines (SVM) and Random Forests (RF). Significantly, we conclude that autoencoders are an equivalent dimensionality reduction architecture to the previously established methods, and often outperform them in both prediction accuracy and time performance when condensing large, sparse datasets.

References

[1]
Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2, 4 (2010), 433--459.
[2]
Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 37--49.
[3]
Lawrence Cayton. 2005. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12, 1--17 (2005), 1.
[4]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321--357.
[5]
François Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
[6]
Zhiyu Deng, Jinming Zhang, Junya Li, and Xiujun Zhang. 2021. Application of deep learning in plant-microbiota association analysis. Frontiers in Genetics 12 (2021).
[7]
Edson Duarte and Jacques Wainer. 2017. Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters. Pattern Recognition Letters 88 (2017), 6--11.
[8]
Valeria Fonti and Eduard Belitser. 2017. Feature selection using LASSO. VU Amsterdam Research Paper in Business Analytics 30 (2017), 1--25.
[9]
Beatriz García-Jiménez, Jorge Muñoz, Sara Cabello, Joaquín Medina, and Mark D Wilkinson. 2021. Predicting microbiomes through a deep latent space. Bioinformatics 37, 10 (2021), 1444--1451.
[10]
Michael W Henson, David M Pitre, Jessica Lee Weckhorst, V Celeste Lanclos, Austen T Webber, and J Cameron Thrash. 2016. Artificial seawater media facilitate cultivating members of the microbial majority from the Gulf of Mexico. MSphere 1, 2 (2016), e00028--16.
[11]
Jin Huang and Charles X Ling. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering 17, 3 (2005), 299--310.
[12]
Hiroyuki Imachi, Masaru K Nobu, Nozomi Nakahara, Yuki Morono, Miyuki Ogawara, Yoshihiro Takaki, Yoshinori Takano, Katsuyuki Uematsu, Tetsuro Ikuta, Motoo Ito, et al. 2020. Isolation of an archaeon at the prokaryote-eukaryote interface. Nature 577, 7791 (2020), 519--525.
[13]
Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112--2120.
[14]
Boyang Tom Jin, Feng Xu, Raymond T Ng, and James C Hogg. 2021. Mian: interactive web-based microbiome data table visualization and machine learning platform. Bioinformatics 38, 4 (2021), 1176--1178.
[15]
Brandon Kristy, Alyssa Carrell, Eric Johnston, Jonathan Cumming, Dawn Klingeman, Kimberly Gwinn, Kimberly Syring, Caroline Skalla, Scott J. Emrich, and Melissa A. Cregger. 2022. Chronic drought differentially alters the belowground microbiome of drought tolerant and drought susceptible genotypes of Populus trichocarpa. Phytobiomes (2022), in revision.
[16]
Seung Jae Lee and Mina Rho. 2022. Multimodal deep learning applied to classify healthy and disease states of human microbiome. Scientific Reports 12, 1 (2022), 824.
[17]
Karen G. Lloyd, Andrew D. Steen, Joshua Ladau, Junqi Yin, and Lonnie Crosby. 2018. Phylogenetically novel uncultured microbial cells dominate Earth microbiomes. mSystems 3, 5 (sep 2018), e00055--18.
[18]
Stephen Nayfach, Simon Roux, Rekha Seshadri, Daniel Udwary, Neha Varghese, Frederik Schulz, Dongying Wu, David Paez-Espino, I-Min Chen, Marcel Huntemann, et al. 2021. A genomic catalog of Earth's microbiomes. Nature biotechnology 39, 4 (2021), 499--509.
[19]
Nuala A O'Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research 44, D1 (2016), D733--D745.
[20]
Donovan H Parks, Maria Chuvochina, David W Waite, Christian Rinke, Adam Skarshewski, Pierre-Alain Chaumeil, and Philip Hugenholtz. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology 36, 10 (2018), 996--1004.
[21]
Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata. 2016. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Computational Biology 12, 7 (2016), e1004977.
[22]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[23]
Owen Queen and Scott J. Emrich. 2021. LASSO-based feature selection for improved microbial and microbiome classification. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2301--2308.
[24]
Mayu Sakurada and Takehisa Yairi. 2014. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. 4--11.
[25]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. Advances in neural information processing systems 25 (2012).
[26]
Shinichi Sunagawa, Silvia G Acinas, Peer Bork, Chris Bowler, Damien Eveillard, Gabriel Gorsky, Lionel Guidi, Daniele Iudicone, Eric Karsenti, Fabien Lombard, et al. 2020. Tara Oceans: towards global ocean ecosystems biology. Nature Reviews Microbiology 18, 8 (2020), 428--445.
[27]
Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. The journal of machine learning research 15, 1 (2014), 3221--3245.
[28]
Wei Wang, Yan Huang, Yizhou Wang, and Liang Wang. 2014. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 490--497.
[29]
Yasi Wang, Hongxun Yao, and Sicheng Zhao. 2016. Auto-encoder based dimensionality reduction. Neurocomputing 184 (2016), 232--242.
[30]
Lei Zhu, Jaishree Tripathi, Frances Maureen Rocamora, Olivo Miotto, Rob van der Pluijm, Till S. Voss, Sachel Mok, Dominic P. Kwiatkowski, François Nosten, Nicholas P. J. Day, Nicholas J. White, Arjen M. Dondorp, Zbynek Bozdech, Aung Pyae Phyo, Elizabeth A. Ashley, Frank Smithuis, Khin Lin, Kyaw Myo Tun, M. Abul Faiz, Mayfong Mayxay, Mehul Dhorda, Nguyen Thanh Thuy-Nhien, Paul N. Newton, Sasithon Pukrittayakamee, Tin M. Hlaing, Tran Tinh Hien, Ye Htut, and Tracking Resistance to Artemisinin Collaboration I. 2018. The origins of malaria artemisinin resistance defined by a genetic and transcriptomic background. Nature Communications 9, 1 (2018), 5158.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
August 2022
549 pages
ISBN:9781450393867
DOI:10.1145/3535508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. autoencoders
  2. classification
  3. dimensionality reduction

Qualifiers

  • Research-article

Funding Sources

Conference

BCB '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)136
  • Downloads (Last 6 weeks)20
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media