Predicting stroke occurrence

A machine learning approach

February 25, 2025

Key finding

Identified and highlighted a pervasive methodological flaw in the body of literature that utilised an open-access dataset and a machine learning approach in predicting stroke occurrence. Over 9 published scientific papers (even some published in peer-reviewed prestigious journals) had oversampled the minority class before splitting the dataset into a training and test set. This data leakage led to overoptimistic estimates of model performance as the training and test sets were then correlated; synthetic oversampled observations could occur in the test set whilst being directly derived from the training set. The correct application of oversampling within this project revealed a degradation in model performance, highlighting the danger of this flaw in the real world application and implementation of such models in imbalanced classification problems.

Background

An open-source dataset that contained routinely collected electronic health records was used to predict stroke occurrences as a binary classification problem using machine learning methods. Four machine learning algorithms were compared individually (neural network, support vector machine,gradient boosted decision trees and random forest) and with a stacking approach using the tidymodels and stacks package within R. This was an imbalanced classification problem with far fewer stroke events than non-events therefore consideration was given to subsampling methods and appropriate evaluation metrics to avoid the “accuracy paradox” and adequately assess model performance.

You can read the full report HERE

Posted on:
February 25, 2025
Length:
2 minute read, 223 words
See Also: