Background/Objectives: This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced. The investigation obtained the highest multiclassification performance to date in the multistage diagnosis of Alzheimer’s disease utilizing the blood gene expression profiles of Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants. Based on the performance results obtained, and other factors such as early prediction capabilities, this study compares the efficacies of the two types of biomarkers for multistage diagnosis. This study presents the sole investigation in which multiclassification-based AD stage diagnosis was conducted utilizing blood gene expression data. We obtained the best multiclassification result in both modalities of the ADNI data in terms of F1-score and were able to identify new genetic biomarkers.
Methods: The combination of the XGBoost and SFBS (Sequential Floating Backward Selection) methods was used to select the features. We were able to select the 95 most effective gene probe sets out of 49,386. For the clinical study data, eight of the most effective biomarkers were selected using SFBS. A deep learning (DL) classifier was used to identify the stages—cognitive normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD)/dementia. DL, support vector machine (SVM), gradient boosting (GB), and random forest (RF) classifiers were used for the AD stage detection from gene expression profile data. Because of the high data imbalance in genomic data, borderline oversampling/data augmentation was applied in the model training and original samples for validation.
Results: Utilizing clinical data, the highest ROC AUC scores attained were 0.989, 0.927, and 0.907 for the identification of the CN, MCI, and dementia stages, respectively. The highest F1 scores achieved were 0.971, 0.939, and 0.886. Employing gene expression data, we obtained ROC AUC scores of 0.763, 0.761, and 0.706 for the CN, MCI, and dementia stages, respectively, and F1 scores of 0.71, 0.77, and 0.53 for CN, MCI, and dementia, respectively.
Conclusions: This represents the best outcome to date for AD stage diagnosis from ADNI blood gene expression profile data utilizing multiclassification techniques. The results indicated that our multiclassification model effectively manages the imbalanced data of a high-dimension, low-sample-size (HDLSS) nature to identify samples of the minority class. MAPK14, PLG, FZD2, FXYD6, and TEP1 are among the novel genes identified as being associated with AD risk.
Full article