Predicting stroke occurrence
A machine learning approach
February 25, 2025
Key finding
Identified and highlighted a pervasive methodological flaw in the body of literature that utilised an open-access dataset and a machine learning approach in predicting stroke occurrence. Over 9 published scientific papers (even some published in peer-reviewed prestigious journals) had oversampled the minority class before splitting the dataset into a training and test set. This data leakage led to overoptimistic estimates of model performance as the training and test sets were then correlated; synthetic oversampled observations could occur in the test set whilst being directly derived from the training set. The correct application of oversampling within this project revealed a degradation in model performance, highlighting the danger of this flaw in the real world application and implementation of such models in imbalanced classification problems.
Background
An open-source dataset that contained routinely collected electronic health records was used to predict stroke occurrences as a binary classification problem using machine learning methods. Four machine learning algorithms were compared individually (neural network, support vector machine,gradient boosted decision trees and random forest) and with a stacking approach using the tidymodels
and stacks
package within R. This was an imbalanced classification problem with far fewer stroke events than non-events therefore consideration was given to subsampling methods and appropriate evaluation metrics to avoid the “accuracy paradox” and adequately assess model performance.
You can read the full report HERE
- Posted on:
- February 25, 2025
- Length:
- 2 minute read, 223 words
- See Also: