Skip to main content

Teaching myself to code by building a stroke prediction app without AI

·2604 words·13 mins
Author
bnhagy

TL;DR: The app works great.

Recently, I obtained a dataset with 5110 rows containing health data of anonymised patients. This was a good opportunity to use this data to train some models to be able to predict stroke from health data. I happen to know some R and the caret package happens to be perfect for this.

I’ll detail down below how I managed to train models and how these models can predict stroke. I’ve uploaded the files onto my Github and deployed the app using Shiny. It’s a bit wonky right now, but that’s the result of avoiding any AI use to write my code.

How it went
#

As someone with no formal education in data science, computer science and the like, I taught myself how to code starting from HTML to code in internet forums in the early 2000s. I progressed since then, and learned R during university to process data. This was a challenge for me because I have never done machine learning before, and I wanted to particularly challenge myself with absolutely no AI use.

Luckily, there were old StackOverflow posts from before the AI era and there were useful blog posts online that helped me. One of the most useful resources I found was a post from University of Virginia that taught me about multiple imputations. Originally, I thought of replacing missing values with averages, but I learned that there was a package called mice that could do calculations and impute missing data for me. It was amazing how many packages are out there and how useful they were.

I really had to thank my friends who, despite them not understanding what I was doing, really tried to help me. I had friends with backgrounds in computer science who had zero experience in R reading .txt files I sent them to find errors. They also told me nobody uses R in 2026. I wonder if it’s true…it’s difficult for me to abandon R (despite knowing Python) because R was the first programming language I learnt.

I really value community so much, it upsets me people abandon good connections to speak with AI. Despite struggling, I really appreciate how much Bima and Kelvin tried helping me through Discord calls. It made me think about how people who rely completely to AI could have an entire learning journey without fostering any connections.

Not relying on AI was a challenge that made me more serious about exploring for resources. I had to interpret some of the resources I found and practiced my literacy.

How I did it
#

Setup
#

Load libraries: tidyverse to clean the data, caret for machine learning models, and mice for multiple imputation of missing values in the data. Note that some machine learning models will need additional package installs, but let’s proceed with these for now.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mice)

## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

After downloading the data, it’s important to take a look at it to have a feel about it.

file = "D://R Files//healthcare-dataset-stroke-data.csv"
dataset_raw = read.csv(file = file, na.strings="N/A")

head(dataset_raw)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21   NA    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1

summary(dataset_raw)

##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##                                                                      
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##                                                                            
##  avg_glucose_level      bmi        smoking_status         stroke       
##  Min.   : 55.12    Min.   :10.30   Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.50   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Median :28.10   Mode  :character   Median :0.00000  
##  Mean   :106.15    Mean   :28.89                      Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:33.10                      3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                      Max.   :1.00000  
##                    NA's   :201

Cleaning and tidying up the data.
#

Ensure that the data has the correct data type.

dataset <- transform(dataset_raw, 
          bmi = as.numeric(bmi), 
          hypertension = as.logical(hypertension), 
          heart_disease = as.logical(heart_disease),
          gender = as.factor(gender),
          work_type = as.factor(work_type),
          Residence_type = as.factor(Residence_type),
          ever_married = as.factor(ever_married),
          stroke = as.factor(stroke),
          smoking_status = as.factor(smoking_status)
          )

dataset <- dataset %>%
  mutate(ever_married = if_else(ever_married == "Yes", TRUE, if_else(ever_married=="No", FALSE, FALSE)))

Let’s check:

summary(dataset)

##        id           gender          age        hypertension    heart_disease  
##  Min.   :   67   Female:2994   Min.   : 0.08   Mode :logical   Mode :logical  
##  1st Qu.:17741   Male  :2115   1st Qu.:25.00   FALSE:4612      FALSE:4834     
##  Median :36932   Other :   1   Median :45.00   TRUE :498       TRUE :276      
##  Mean   :36518                 Mean   :43.23                                  
##  3rd Qu.:54682                 3rd Qu.:61.00                                  
##  Max.   :72940                 Max.   :82.00                                  
##                                                                               
##  ever_married            work_type    Residence_type avg_glucose_level
##  Mode :logical   children     : 687   Rural:2514     Min.   : 55.12   
##  FALSE:1757      Govt_job     : 657   Urban:2596     1st Qu.: 77.25   
##  TRUE :3353      Never_worked :  22                  Median : 91.89   
##                  Private      :2925                  Mean   :106.15   
##                  Self-employed: 819                  3rd Qu.:114.09   
##                                                      Max.   :271.74   
##                                                                       
##       bmi                smoking_status stroke  
##  Min.   :10.30   formerly smoked: 885   0:4861  
##  1st Qu.:23.50   never smoked   :1892   1: 249  
##  Median :28.10   smokes         : 789           
##  Mean   :28.89   Unknown        :1544           
##  3rd Qu.:33.10                                  
##  Max.   :97.60                                  
##  NA's   :201

head(dataset)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67        FALSE          TRUE         TRUE       Private
## 2 51676 Female  61        FALSE         FALSE         TRUE Self-employed
## 3 31112   Male  80        FALSE          TRUE         TRUE       Private
## 4 60182 Female  49        FALSE         FALSE         TRUE       Private
## 5  1665 Female  79         TRUE         FALSE         TRUE Self-employed
## 6 56669   Male  81        FALSE         FALSE         TRUE       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21   NA    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1

nrow(dataset)

## [1] 5110

Now it’s time to check for missing values.

(sum(is.na(dataset$bmi)) / nrow(dataset)) * 100

## [1] 3.933464

# 3% of the bmi data is missing.

Impute missing values with MICE
#

This is when mice package becomes useful. Create a mice object to define methods.

set.seed(7)

mice_1 <- mice(dataset, maxit=0) # Set maxit to zero first because we don't want to predict yet. We are only creating a mice object.

predM <- mice_1$predictorMatrix

id shouldn’t be used to predict anything, so make sure to leave it out.

predM[, c("id")] <- 0 
meth <- mice_1$method

Check the methods. Only bmi has NA values and needs to be calculated.

meth

##                id            gender               age      hypertension 
##                ""                ""                ""                "" 
##     heart_disease      ever_married         work_type    Residence_type 
##                ""                ""                ""                "" 
## avg_glucose_level               bmi    smoking_status            stroke 
##                ""             "pmm"                ""                ""

Run mice to create dataset with 5 different possible values for the missing values.

mice_results <- mice(dataset, maxit = 1, 
             predictorMatrix = predM, 
             method = meth, print =  FALSE) 

Check the predicted values.

head(mice_results$imp$bmi) 

##       1    2    3    4    5
## 2  21.4 26.7 28.7 21.9 33.3
## 9  27.3 19.4 24.2 28.6 32.2
## 14 37.9 40.9 35.5 27.0 23.2
## 20 26.7 27.8 22.1 20.5 27.0
## 28 31.5 23.2 31.2 31.9 39.1
## 30 45.2 32.8 37.3 29.1 28.6

Extract the first set and use it to complete the data.

dataset <- mice::complete(mice_results, 1)
head(dataset)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67        FALSE          TRUE         TRUE       Private
## 2 51676 Female  61        FALSE         FALSE         TRUE Self-employed
## 3 31112   Male  80        FALSE          TRUE         TRUE       Private
## 4 60182 Female  49        FALSE         FALSE         TRUE       Private
## 5  1665 Female  79         TRUE         FALSE         TRUE Self-employed
## 6 56669   Male  81        FALSE         FALSE         TRUE       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21 21.4    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1

Build prediction models
#

Starting from here, start using the caret package. Start by preparing a training and testing set.

control = trainControl(method="cv", number = 10)

Let’s train using 6 different machine learning models.

  1. Boosted logistic regression
  2. Naive Bayesian
  3. K-nearest Neighbors
  4. Linear Support Vector Machine (SVM)
  5. Random forest (method=‘ranger’)
  6. CART

Each of these are can be plugged in the "model=" argument of the train() function. Some (i.e. ranger) prompt additional package installs, and simply say yes to all of them.

# boosted logistic regression
set.seed(7)
fit.logreg <- train(stroke ~ 
                      gender + age + hypertension + ever_married + 
                      work_type + Residence_type + avg_glucose_level + bmi + 
                      smoking_status, data=dataset, method="LogitBoost", metric="Accuracy", trControl=control)

# naive Bayes
set.seed(7)
fit.naivebayes <- train(stroke ~ 
                      gender + age + hypertension + ever_married + 
                      work_type + Residence_type + avg_glucose_level + bmi + 
                      smoking_status, data=dataset, method="naive_bayes", metric="Accuracy", trControl=control)

# K-nearest neighbors 
set.seed(7)
fit.knn <- train(stroke ~ 
                   gender + age + hypertension + ever_married + 
                   work_type + Residence_type + avg_glucose_level + bmi + 
                   smoking_status, data=dataset, method="knn", metric="Accuracy", trControl=control)

# Linear support vector machine
set.seed(7)
fit.svm <- train(stroke ~ 
                   gender + age + hypertension + ever_married + 
                   work_type + Residence_type + avg_glucose_level + bmi + 
                   smoking_status, data=dataset, method="svmLinear", metric="Accuracy", trControl=control)

# random forest
set.seed(7)
fit.rf <- train(stroke ~ 
                  gender + age + hypertension + ever_married + 
                  work_type + Residence_type + avg_glucose_level + bmi + 
                  smoking_status, data=dataset, method="ranger", metric="Accuracy", trControl=control)

# CART
set.seed(7)
fit.cart <- train(stroke ~ 
                    gender + age + hypertension + ever_married + 
                    work_type + Residence_type + avg_glucose_level + bmi + 
                    smoking_status, data=dataset, method="rpart", metric="Accuracy", trControl=control)

From all the models, check which is the most accurate.

results <- resamples(list(logreg=fit.logreg, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf, naivebayes=fit.naivebayes))

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: logreg, cart, knn, svm, rf, naivebayes 
## Number of resamples: 10 
## 
## Accuracy 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## logreg     0.9432485 0.9476517 0.9510284 0.9491193 0.9510763 0.9511719    0
## cart       0.9452055 0.9510763 0.9510763 0.9504896 0.9510763 0.9529412    0
## knn        0.9412916 0.9471624 0.9510763 0.9489240 0.9511480 0.9530333    0
## svm        0.9510763 0.9510763 0.9510763 0.9512724 0.9510763 0.9529412    0
## rf         0.9510763 0.9510763 0.9510763 0.9512724 0.9510763 0.9529412    0
## naivebayes 0.9510763 0.9510763 0.9510763 0.9512724 0.9510763 0.9529412    0
## 
## Kappa 
##                   Min.      1st Qu.       Median        Mean    3rd Qu.
## logreg     -0.01368083 -0.006420385 -0.001888788 0.002818109 0.00000000
## cart       -0.01059472  0.000000000  0.000000000 0.024640625 0.05048551
## knn        -0.01657825 -0.007300869  0.000000000 0.009876670 0.00000000
## svm         0.00000000  0.000000000  0.000000000 0.000000000 0.00000000
## rf          0.00000000  0.000000000  0.000000000 0.000000000 0.00000000
## naivebayes  0.00000000  0.000000000  0.000000000 0.000000000 0.00000000
##                  Max. NA's
## logreg     0.06731401    0
## cart       0.12613722    0
## knn        0.07343608    0
## svm        0.00000000    0
## rf         0.00000000    0
## naivebayes 0.00000000    0

Let’s test with a sample patient who is 81 years old and smokes.

patient = tribble(~gender, ~age, ~hypertension, ~heart_disease, 
                  ~ever_married, ~work_type, ~Residence_type, 
                  ~avg_glucose_level, ~bmi, ~smoking_status, 
                  "Male", 81, TRUE, TRUE, TRUE, "Private", "Urban", 300, 28.6, "smokes")

predict(fit.rf, patient)

## [1] 0
## Levels: 0 1

The patient is unlikely to have stroke! What a surprise. Let’s try using a different model:

predict(fit.logreg, patient)

## [1] 1
## Levels: 0 1

The logistic regression model says he might have stroke.

Putting it all together as a web app
#

Models are trained and everything’s working well. However, how do we make this accessible for people who aren’t familliar with R? We cannot expect them to run predict() every single time.

Enter shiny, a package that lets you turn R files in web apps.

First, let’s save our models into an RDS file (this prevents us from retraining the models repeatedly)

saveRDS(fit.logreg, "logreg.rds")
# Do this for every model

Then let’s create an App.R for shiny. This file will later be deployed on shinyapps.io.

After saving each model into an RDS file, time to load them.

cart_model <- readRDS("./data//cart.rds")
knn_model <- readRDS("./data//knn.rds")
logreg_model <- readRDS("./data//logreg.rds")
naivebayes_model <- readRDS("./data//naivebayes.rds")
rf_model <- readRDS("./data//rf.rds")
svm_model <- readRDS("./data//svm.rds")

Libraries need to be loaded, and ensure that model-specific libraries are loaded as well.

library(shiny)
library(dplyr)
library(bslib)
library(caret)
library(kernlab)
library(ranger)
library(caTools)
library(naivebayes)

Define the UI. Let’s have a simple 2 column layout with input buttons on the left and the output on the right in the form of accordions that you can click to show hidden text.

ui <- page_fillable(
"Stroke Prediction",
layout_columns(
    card(helpText("Select below:"),
        selectInput(
        "gender", label = "Gender:",
        choices = list("Male","Female","Other"),multiple = FALSE),
        numericInput(
        "age",label="Age:",value=20),
        checkboxInput("hypertension", "I have hypertension", value = FALSE),
        checkboxInput("heart_disease", "I have heart disease", value = FALSE),
        checkboxInput("ever_married", "I'm married/have been married", value = FALSE),
        selectInput(
        "work_type", label = "Type of employment: (Select children if you are below 18)",
        choices = list("children","Govt_job","Never_worked","Private","Self-employed"), multiple=FALSE),
        selectInput(
        "Residence_type", label = "Type of residence:",
        choices = list("Rural","Urban"),multiple=FALSE),
        numericInput(
        "avg_glucose_level", label= "Average glucose level:", value=106.1477),
        # Use mean as default.
        numericInput(
        "bmi", label="BMI:", value=28.92663),
        # Use mean as default
        selectInput(
        "smoking_status", label="Smoking status:",
        choices = list("formerly smoked","never smoked","smokes","Unknown")),),
    card(
    accordion(  
        accordion_panel( 
        title = "Logistic Regression", 
        textOutput("logreg")
        ),  
        accordion_panel(
        title = "CART",
        textOutput("cart")
        ),  
        accordion_panel(
        title = "K-nearest Neighbors",
        textOutput("knn")
        ),  
        accordion_panel(
        title = "Naive Bayesian",
        textOutput("naivebayes") 
        ),
        accordion_panel(
        title = "Random forest",
        textOutput("rf") 
        ),
        accordion_panel(
        title = "Support vector machine (linear)",
        textOutput("svm") 
        )
    ),
    ),
    col_widths = c(3,9) 
)   
)

It should look something like this:

Functions to rerun everytime someone modifies the input should be placed in server.

server <- function(input, output) {

patient_data <- reactive({
    data.frame(
    gender = input$gender,
    age = input$age,
    hypertension = input$hypertension,
    heart_disease = input$heart_disease,
    ever_married = input$ever_married, 
    work_type = input$work_type,
    Residence_type = input$Residence_type,
    avg_glucose_level = input$avg_glucose_level,
    bmi = input$bmi,
    smoking_status = input$smoking_status,
    stringsAsFactors = TRUE 
    )
})

generate_prediction <- function(model, model_name) {
    renderText({
    prediction <- predict(model, patient_data())
    
    is_high_risk <- if(is.factor(prediction)) prediction == "1" else prediction == 1
    
    if (is_high_risk) {
        paste("According to the", model_name, "model, you have a high chance for stroke.")
    } else {
        paste("According to the", model_name, "model, you have a low chance for stroke.")
    }
    })
}

# 3. Assigning the outputs
output$logreg     <- generate_prediction(logreg_model, "Logistic Regression")
output$cart       <- generate_prediction(cart_model, "CART")
output$knn        <- generate_prediction(knn_model, "K-nearest Neighbors")
output$naivebayes <- generate_prediction(naivebayes_model, "Naive Bayesian")
output$rf         <- generate_prediction(rf_model, "Random Forest")
output$svm        <- generate_prediction(svm_model, "SVM")
}

Finally, let’s run the application

shinyApp(ui = ui, server = server)

All done! Deploy it into shinyapps.io with the rsconnect package. You can see the final product here.