Background: Stroke is a leading cause of disability and mortality worldwide, with ischemic strokes comprising the majority of cases. Despite advances in neuroimaging, there is a pressing need for supplementary diagnostic tools to enhance accuracy. This study explores the application of machine learning (ML) techniques to predict ischemic stroke using RNA-seq data from the GEO database (GSE22255).
Methods: We developed and evaluated various machine learning models, including Random Forest, K-Nearest Neighbors (KNN), and CHAID (Chi-squared Automatic Interaction Detection), based on their accuracy, precision, specificity, and sensitivity. The analysis utilized a dataset comprising 54,676 genes across 40 samples (20 cases and 20 controls). All modeling was conducted using IBM SPSS Modeler version 18.
Results: The models were assessed based on their classification accuracy, performance evaluation scores, and AUC/Gini AUC metrics. The Random Forest model achieved the highest accuracy (96.67% in training, 80% in testing), while the CHAID algorithm provided interpretable results with key variables (TP53, CYP1A1, and CYP2D6) identified. The KNN model exhibited strong performance with notable confidence in its predictions.
Conclusion: This study demonstrates the potential of ML techniques, particularly Random Forest, to enhance stroke diagnosis and provide insights into stroke pathology, offering a novel approach to improving clinical decision-making. However, the study is limited by the small sample size, and future work should focus on validation with larger datasets and integration with other omics data for clinical application.