Exercise 5B - Solutions: Kmeans and Random Forest

Part 1: K-Means Clustering

In this part you will run a K-means clustering. You will work with a dataset from patients with kidney disease. The dataset contains approximately 20 biological measures (variables) collected across 400 patients. The outcome is the classification variable which denotes whether a person suffers from ckd=chronic kidney disease ckd or not notckd.

Below is a description of the variables in the dataset:

        age     -   age
        bp      -   blood pressure
        rbc     -   red blood cells
        pc      -   pus cell
        pcc     -   pus cell clumps
        ba      -   bacteria
        bgr     -   blood glucose random
        bu      -   blood urea
        sc      -   serum creatinine
        sod     -   sodium
        pot     -   potassium
        hemo    -   hemoglobin
        pcv     -   packed cell volume
        wc      -   white blood cell count
        rc      -   red blood cell count
        htn     -   hypertension
        dm      -   diabetes mellitus
        cad     -   coronary artery disease
        appet   -   appetite
        pe      -   pedal edema
        ane     -   anemia
        class   -   classification  
  1. Load the R packages needed for analysis. If you are in doubt about what packages you need, look at Presentation 5B. You will need to do some data-wrangling and will also need a package for K-means clustering and visualization.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra)
Welcome to factoextra!
Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
library(ggplot2)
  1. Load in the dataset named kidney_disease.Rdata.
#in case
load("../data/kidney_disease.Rdata")
  1. Before running K-means clustering please remove rows with any missing values across all variables in your dataset - yes, you will lose quite a lot of rows. Consider which columns you can use and if you have to do anything to them before clustering?
# Set seed to ensure reproducibility
set.seed(123)  

# remove any missing values
kidney <- kidney %>%
  drop_na() 

# scale numeric values and only use these
kidney_num <- kidney %>%
  mutate(across(where(is.numeric), scale)) %>%
  dplyr::select(where(is.numeric))
  1. Run the k-means clustering algorithm with 4 centers on the data. Look at the clusters you have generated.
# Set seed to ensure reproducibility
set.seed(123)  

# run kmeans
kmeans_res <- kidney_num %>% 
  kmeans(centers = 4, nstart = 25)

kmeans_res
K-means clustering with 4 clusters of sizes 119, 26, 1, 13

Cluster means:
         age         bp        bgr         bu         sc        sod         pot
1 -0.1606270 -0.1812766 -0.3421452 -0.3945903 -0.4097699  0.3696236 -0.09212654
2  0.6371017  0.2644634  1.2990025  0.3971316  0.3647090 -0.9188501 -0.08119564
3  0.1637295  1.4325099  1.4314972  2.4018426  1.1140015 -0.7849388 12.22513695
4  0.1835566  1.0202582  0.4238246  2.6329989  2.9358598 -1.4853976  0.06530832
        hemo        pcv         wc         rc
1  0.4923705  0.4918043 -0.2332733  0.4495784
2 -1.2670571 -1.2727402  0.9780463 -1.1415414
3 -1.9462630 -2.0832161 -1.1005490 -1.9608496
4 -1.8232572 -1.7961726  0.2639126 -1.6814542

Clustering vector:
  [1] 2 4 2 2 2 4 2 2 2 2 4 4 4 2 1 2 2 2 3 4 1 2 2 4 1 2 2 2 2 2 2 4 2 4 4 2 2
 [38] 2 4 2 4 2 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 409.4247 267.2635   0.0000 123.4990
 (between_SS / total_SS =  54.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
kmeans_res$centers
         age         bp        bgr         bu         sc        sod         pot
1 -0.1606270 -0.1812766 -0.3421452 -0.3945903 -0.4097699  0.3696236 -0.09212654
2  0.6371017  0.2644634  1.2990025  0.3971316  0.3647090 -0.9188501 -0.08119564
3  0.1637295  1.4325099  1.4314972  2.4018426  1.1140015 -0.7849388 12.22513695
4  0.1835566  1.0202582  0.4238246  2.6329989  2.9358598 -1.4853976  0.06530832
        hemo        pcv         wc         rc
1  0.4923705  0.4918043 -0.2332733  0.4495784
2 -1.2670571 -1.2727402  0.9780463 -1.1415414
3 -1.9462630 -2.0832161 -1.1005490 -1.9608496
4 -1.8232572 -1.7961726  0.2639126 -1.6814542
  1. Visualize the results of your clustering. Do 4 clusters seems like a good fit for our data in the first two dimensions (Dim1 and Dim2)? How about if you have a look at Dim3 or Dim4?
fviz_cluster(kmeans_res, data = kidney_num, axes=c(1,2),
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "orchid3"), 
             geom = "point",
             ellipse.type = "norm", 
             ggtheme = theme_bw())
Too few points to calculate an ellipse

  1. Investigate the best number of clusters for this dataset. Use the silhouette metric.
kidney_num %>%
  fviz_nbclust(kmeans, method = "silhouette")

  1. Re-do the clustering (plus visualization) with the optimal number of clusters.
# Set seed to ensure reproducibility
set.seed(123)  

#run kmeans

kmeans_res <- kidney_num %>% 
  kmeans(centers = 2, nstart = 25)

kmeans_res
K-means clustering with 2 clusters of sizes 120, 39

Cluster means:
         age         bp       bgr         bu        sc        sod        pot
1 -0.1826253 -0.1901587 -0.344100 -0.3887455 -0.409560  0.3622341 -0.0907221
2  0.5619239  0.5851036  1.058769  1.1961401  1.260185 -1.1145665  0.2791449
       hemo        pcv         wc         rc
1  0.477263  0.4767479 -0.2093952  0.4450345
2 -1.468501 -1.4669168  0.6442931 -1.3693368

Clustering vector:
  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2
 [38] 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 433.7458 647.1223
 (between_SS / total_SS =  37.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
fviz_cluster(kmeans_res, data = kidney_num,
             palette = c("#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "norm", 
             ggtheme = theme_bw())

  1. Now, try to figure out what the two clusters might represent. There are different ways to do this, but one easy way would be to simply compare the clusters IDs from the Kmeans output with one or more of the categorical variables from the dataset. You could use count() or table() for this.
table(kidney$htn, kmeans_res$cluster)
     
        1   2
  no  118   7
  yes   2  32
table(kidney$dm, kmeans_res$cluster)
     
        1   2
  no  119  12
  yes   1  27
table(kidney$classification, kmeans_res$cluster)
        
           1   2
  ckd      4  39
  notckd 116   0
  1. The withiness measure (within cluster variance/spread) is much larger for one cluster then the other, what biological reason could there be for that?

Part 2: Random Forest

For our Random Forest model we will use the Obstetrics and Periodontal Therapy Dataset, which was collected in a randomized clinical trial. The trail was concerned with the effects of treatment of maternal periodontal disease and if it may reduce preterm birth risk.

This dataset has a total of 823 observations and 171 variables. As is often the case with clinical data, the data has many missing values, and for most categorical variables there is an unbalanced in number of observations for each level.

In order for you to not spend too much time on data cleaning and wrangling we have selected 33 variables (plus a patient ID column, PID) for which we have just enough observations per group level.

Summary Statistics

  1. Load the R packages needed for analysis. As for Part 1 above, have a look at the accompanying presentation for this exercise to get an idea of what packages you might need.

N.B that you may have some clashes of function names between these packages. Because of this, you will need to specify the package you are calling some of the functions from. You do this with the package name followed by two colons followed by the name of the function, e.g. dplyr::select()

library(tidyverse)
library(caret)
library(randomForest)
  1. Load in the dataset Obt_Perio_ML.Rdata and inspect it.
load(file = "../data/Obt_Perio_ML.Rdata")
  1. Do some basic summary statistics and distributional plots to get a feel for the data. Which types of variables do we have?
# Reshape data to long format for ggplot2
long_data <- optML %>% 
  dplyr::select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "value")

# Plot histograms for each numeric variable in one grid
ggplot(long_data, aes(x = value)) +
  geom_histogram(binwidth = 0.5, fill = "#9395D3", color ='grey30') +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal()

  1. Some of the numeric variables are actually categorical. We have identified them in the facCols vector. Here, we change their type from numeric to character (since the other categorical variables are of this type). This code is sightly different from changing the type to factor, why we have written the code for you. Try to understand what is going on.
facCols <- c("Race", 
             "ETXU_CAT5", 
             "BL.Anti.inf", 
             "BL.Antibio", 
             "V3.Anti.inf", 
             "V3.Antibio", 
             "V3.Bac.vag", 
             "V5.Anti.inf",
             "V5.Antibio",
             "V5.Bac.vag",
             "X..Vis.Att")


optML <- optML %>%
  mutate(across(all_of(facCols), as.character))

head(optML)
# A tibble: 6 × 89
  PID   Apgar1 Apgar5 Birthweight GA.at.outcome Any.SAE. Clinic Group   Age
  <chr>  <dbl>  <dbl>       <dbl>         <dbl> <chr>    <chr>  <chr> <dbl>
1 P10        9      9        3107           278 No       NY     C        30
2 P170       9      9        3040           286 No       MN     C        20
3 P280       8      9        3370           282 No       MN     T        29
4 P348       9      9        3180           275 No       KY     C        18
5 P402       8      9        2615           267 No       KY     C        18
6 P209       8      9        3330           284 No       MN     C        18
# ℹ 80 more variables: Race <chr>, Education <chr>, Public.Asstce <chr>,
#   BMI <dbl>, Use.Tob <chr>, N.prev.preg <dbl>, Live.PTB <chr>,
#   Any.stillbirth <chr>, Any.live.ptb.sb.sp.ab.in.ab <chr>,
#   EDC.necessary. <chr>, N.qualifying.teeth <dbl>, BL.GE <dbl>, BL..BOP <dbl>,
#   BL.PD.avg <dbl>, BL..PD.4 <dbl>, BL..PD.5 <dbl>, BL.CAL.avg <dbl>,
#   BL..CAL.2 <dbl>, BL..CAL.3 <dbl>, BL.Calc.I <dbl>, BL.Pl.I <dbl>,
#   V3.GE <dbl>, V3..BOP <dbl>, V3.PD.avg <dbl>, V3..PD.4 <dbl>, …
  1. Make count tables of your categorical/factor variables, are they balanced?
# Count observations per level/group for each categorical variable

factor_counts <- optML[,-1] %>%
  dplyr::select(where(is.character)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "level") %>%
  arrange(variable) %>%
  dplyr::count(variable, level, sort = FALSE)

factor_counts
# A tibble: 66 × 3
   variable                    level     n
   <chr>                       <chr> <int>
 1 Any.SAE.                    No      865
 2 Any.SAE.                    Yes     135
 3 Any.live.ptb.sb.sp.ab.in.ab No      432
 4 Any.live.ptb.sb.sp.ab.in.ab Yes     568
 5 Any.stillbirth              No      894
 6 Any.stillbirth              Yes     106
 7 BL.Anti.inf                 0       853
 8 BL.Anti.inf                 1       147
 9 BL.Antibio                  0       924
10 BL.Antibio                  1        76
# ℹ 56 more rows

Random Forest Model

Now, let’s try a Random Forest classifier. As we cannot model with multiple outcome variables, so we will select Preg.ended...37.wk as our outcome of interest. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).

  1. Remove the other five possible outcome variables measures from the dataset, i.e.: GA.at.outcome, Birthweight, Apgar1, Apgar5 andAny.SAE.. You will need to take out one more variable from the dataset before modelling, figure out which one and remove it.
optML <- optML %>% 
  dplyr::select(-c(Apgar1, Apgar5, GA.at.outcome, Birthweight, Any.SAE., PID))
  1. The R-package we are using to make a RF classifier requires the outcome variable to be categorical. As Preg.ended...37.wk is encoded as numeric (0 and 1), you will need to change these to categories!
optML <- optML %>%
  mutate(Preg.ended...37.wk = factor(Preg.ended...37.wk, levels = c(0, 1), labels = c("No", "Yes")))
  1. Split your dataset into train and test set, you should have 70% of the data in the training set and 30% in the test set. How you chose to split is up to you, BUT afterwards you should ensure that for the categorical/factor variables all levels are represented in both sets.
set.seed(123)

idx <- createDataPartition(optML$Preg.ended...37.wk, p = 0.7, list = FALSE)
train <- optML[idx, ]
test  <- optML[-idx, ]
train %>%
  dplyr::select(where(is.character)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "level") %>%
  arrange(variable) %>%
  dplyr::count(variable, level, sort = FALSE)
# A tibble: 64 × 3
   variable                    level     n
   <chr>                       <chr> <int>
 1 Any.live.ptb.sb.sp.ab.in.ab No      304
 2 Any.live.ptb.sb.sp.ab.in.ab Yes     397
 3 Any.stillbirth              No      620
 4 Any.stillbirth              Yes      81
 5 BL.Anti.inf                 0       601
 6 BL.Anti.inf                 1       100
 7 BL.Antibio                  0       643
 8 BL.Antibio                  1        58
 9 Bact.vag                    No      606
10 Bact.vag                    Yes      95
# ℹ 54 more rows
test %>%
  dplyr::select(where(is.character)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "level") %>%
  arrange(variable) %>%
  dplyr::count(variable, level, sort = FALSE)
# A tibble: 64 × 3
   variable                    level     n
   <chr>                       <chr> <int>
 1 Any.live.ptb.sb.sp.ab.in.ab No      128
 2 Any.live.ptb.sb.sp.ab.in.ab Yes     171
 3 Any.stillbirth              No      274
 4 Any.stillbirth              Yes      25
 5 BL.Anti.inf                 0       252
 6 BL.Anti.inf                 1        47
 7 BL.Antibio                  0       281
 8 BL.Antibio                  1        18
 9 Bact.vag                    No      271
10 Bact.vag                    Yes      28
# ℹ 54 more rows
  1. Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.
set.seed(123)

# Set up cross-validation: 5-fold CV
RFcv <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)

# Train Random Forest
set.seed(123)
rf_model <- caret::train(
  Preg.ended...37.wk ~ .,  # formula interface
  data = train,
  method = "rf",           # random forest
  trControl = RFcv,
  metric = "ROC",          # optimize AUC
  tuneLength = 5           # try 5 different mtry values
)


# Model summary
print(rf_model)
Random Forest 

701 samples
 82 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 561, 561, 560, 562, 560 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec     
   2    0.9647253  1.0000000  0.9317299
  25    0.9717093  0.9978495  0.9359852
  49    0.9717448  0.9914436  0.9444958
  72    0.9732898  0.9849920  0.9444958
  96    0.9721122  0.9700297  0.9488437

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 72.
  1. Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
# Best parameters
rf_model$bestTune
  mtry
4   72
# Plot performance
plot(rf_model)

  1. Use your test set to evaluate your model performance. How does the random forest compare to the elastic net regression?
# Predict class probabilities
y_pred <- predict(rf_model, newdata = test, type = "prob")

y_pred <- as.factor(ifelse(y_pred$Yes > 0.5, "Yes", "No"))

caret::confusionMatrix(y_pred, test$Preg.ended...37.wk)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  199   3
       Yes   1  96
                                          
               Accuracy : 0.9866          
                 95% CI : (0.9661, 0.9963)
    No Information Rate : 0.6689          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9696          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.9950          
            Specificity : 0.9697          
         Pos Pred Value : 0.9851          
         Neg Pred Value : 0.9897          
             Prevalence : 0.6689          
         Detection Rate : 0.6656          
   Detection Prevalence : 0.6756          
      Balanced Accuracy : 0.9823          
                                          
       'Positive' Class : No              
                                          
  1. Extract the predictive variables with the greatest importance from your fit.
varImpOut <- varImp(rf_model)

varImpOut$importance
                                    Overall
ClinicMN                         0.01804294
ClinicMS                         0.52153545
ClinicNY                         0.82039418
GroupT                           0.74870413
Age                             30.00331934
Race2                            0.32925998
Race3                            2.50423631
Race4                            0.66503953
Race5                            2.02115492
EducationLT 8 yrs                0.42252921
EducationMT 12 yrs               2.15366958
Public.AsstceYes                 1.08279810
BMI                             19.72889567
Use.TobYes                       2.53370014
N.prev.preg                     15.41112225
Live.PTBYes                      3.65578373
Any.stillbirthYes                1.42092446
Any.live.ptb.sb.sp.ab.in.abYes   2.46535539
EDC.necessary.Yes                1.63419920
N.qualifying.teeth               7.63472505
BL.GE                           29.99229874
BL..BOP                         24.96374646
BL.PD.avg                       12.48239462
BL..PD.4                        21.15470971
BL..PD.5                        10.11665732
BL.CAL.avg                      10.01080983
BL..CAL.2                       13.13299658
BL..CAL.3                       11.06044312
BL.Calc.I                       25.76543226
BL.Pl.I                         15.80928083
V3.GE                           13.24739001
V3..BOP                         14.43030715
V3.PD.avg                       10.77002882
V3..PD.4                        13.94132133
V3..PD.5                        10.18609049
V3.CAL.avg                       9.58816248
V3..CAL.2                       12.29150544
V3..CAL.3                        9.53175611
V3.Calc.I                       17.13619012
V3.Pl.I                         12.21630158
V5.GE                           19.21166534
V5..BOP                         20.91370713
V5.PD.avg                       16.25564466
V5..PD.4                        31.50998650
V5..PD.5                        13.32420686
V5.CAL.avg                      15.25301243
V5..CAL.2                       11.31672985
V5..CAL.3                       23.95342489
V5.Calc.I                       13.82039233
V5.Pl.I                         23.10719150
N.PAL.sites1                     0.80519543
N.PAL.sites3-33                  0.48853267
Bact.vagYes                      0.80228500
Gest.diabYes                    23.95864119
UTIYes                           4.80936693
Pre.eclampYes                   80.16903727
BL.Anti.inf1                     2.36829721
BL.Antibio1                      1.58554731
V3.Anti.inf1                    32.45795477
V3.Antibio1                      3.17081591
V3.Bac.vag1                      3.19642494
V5.Anti.inf1                     0.75135748
V5.Antibio1                      2.66044954
V5.Bac.vag1                      2.38427066
X..Vis.Att1                      9.68593806
X..Vis.Att2                      1.64170697
X..Vis.Att3                      2.10550103
X..Vis.Att4                      3.83135432
X..Vis.Att5                    100.00000000
X1st.Miss.Vis                   22.12128966
OAA1                            16.00591323
OCR1                            26.35622873
OFN1                            16.86907504
OPG1                            34.91413510
OPI1                            33.88105531
OTD1                            34.17513325
OTF1                            18.84530025
OCRP1                           40.11798855
OPGE21                          37.92758656
OMMP91                          43.15886495
OFIBRIN1                        34.73916659
OAA5                            39.68053119
OCR5                            76.67181698
OFN5                            52.77596051
OPG5                            41.95370038
OPI5                            77.96525817
OTD5                            16.21791451
OTF5                            46.71822447
OCRP5                           28.23787971
OPGE25                          28.35777371
OMMP95                          47.46932166
ETXU_CAT52                       2.10873809
ETXU_CAT53                       1.80774399
ETXU_CAT54                       0.00000000
ETXU_CAT55                       0.46491255
OFIBRIN5                        35.69157845
varImportance <- as.data.frame(as.matrix(varImpOut$importance)) %>% 
  rownames_to_column(var = 'VarName') %>%
  arrange(desc(Overall))

varImportance
                          VarName      Overall
1                     X..Vis.Att5 100.00000000
2                   Pre.eclampYes  80.16903727
3                            OPI5  77.96525817
4                            OCR5  76.67181698
5                            OFN5  52.77596051
6                          OMMP95  47.46932166
7                            OTF5  46.71822447
8                          OMMP91  43.15886495
9                            OPG5  41.95370038
10                          OCRP1  40.11798855
11                           OAA5  39.68053119
12                         OPGE21  37.92758656
13                       OFIBRIN5  35.69157845
14                           OPG1  34.91413510
15                       OFIBRIN1  34.73916659
16                           OTD1  34.17513325
17                           OPI1  33.88105531
18                   V3.Anti.inf1  32.45795477
19                       V5..PD.4  31.50998650
20                            Age  30.00331934
21                          BL.GE  29.99229874
22                         OPGE25  28.35777371
23                          OCRP5  28.23787971
24                           OCR1  26.35622873
25                      BL.Calc.I  25.76543226
26                        BL..BOP  24.96374646
27                   Gest.diabYes  23.95864119
28                      V5..CAL.3  23.95342489
29                        V5.Pl.I  23.10719150
30                  X1st.Miss.Vis  22.12128966
31                       BL..PD.4  21.15470971
32                        V5..BOP  20.91370713
33                            BMI  19.72889567
34                          V5.GE  19.21166534
35                           OTF1  18.84530025
36                      V3.Calc.I  17.13619012
37                           OFN1  16.86907504
38                      V5.PD.avg  16.25564466
39                           OTD5  16.21791451
40                           OAA1  16.00591323
41                        BL.Pl.I  15.80928083
42                    N.prev.preg  15.41112225
43                     V5.CAL.avg  15.25301243
44                        V3..BOP  14.43030715
45                       V3..PD.4  13.94132133
46                      V5.Calc.I  13.82039233
47                       V5..PD.5  13.32420686
48                          V3.GE  13.24739001
49                      BL..CAL.2  13.13299658
50                      BL.PD.avg  12.48239462
51                      V3..CAL.2  12.29150544
52                        V3.Pl.I  12.21630158
53                      V5..CAL.2  11.31672985
54                      BL..CAL.3  11.06044312
55                      V3.PD.avg  10.77002882
56                       V3..PD.5  10.18609049
57                       BL..PD.5  10.11665732
58                     BL.CAL.avg  10.01080983
59                    X..Vis.Att1   9.68593806
60                     V3.CAL.avg   9.58816248
61                      V3..CAL.3   9.53175611
62             N.qualifying.teeth   7.63472505
63                         UTIYes   4.80936693
64                    X..Vis.Att4   3.83135432
65                    Live.PTBYes   3.65578373
66                    V3.Bac.vag1   3.19642494
67                    V3.Antibio1   3.17081591
68                    V5.Antibio1   2.66044954
69                     Use.TobYes   2.53370014
70                          Race3   2.50423631
71 Any.live.ptb.sb.sp.ab.in.abYes   2.46535539
72                    V5.Bac.vag1   2.38427066
73                   BL.Anti.inf1   2.36829721
74             EducationMT 12 yrs   2.15366958
75                     ETXU_CAT52   2.10873809
76                    X..Vis.Att3   2.10550103
77                          Race5   2.02115492
78                     ETXU_CAT53   1.80774399
79                    X..Vis.Att2   1.64170697
80              EDC.necessary.Yes   1.63419920
81                    BL.Antibio1   1.58554731
82              Any.stillbirthYes   1.42092446
83               Public.AsstceYes   1.08279810
84                       ClinicNY   0.82039418
85                   N.PAL.sites1   0.80519543
86                    Bact.vagYes   0.80228500
87                   V5.Anti.inf1   0.75135748
88                         GroupT   0.74870413
89                          Race4   0.66503953
90                       ClinicMS   0.52153545
91                N.PAL.sites3-33   0.48853267
92                     ETXU_CAT55   0.46491255
93              EducationLT 8 yrs   0.42252921
94                          Race2   0.32925998
95                       ClinicMN   0.01804294
96                     ETXU_CAT54   0.00000000
  1. Make a logistic regression using the same dataset (you already have your train data, test data). How do the results of Elastic Net regression and Random Forest compare to the output of your glm.
# Model
model1 <- glm(Preg.ended...37.wk ~ ., data = train, family = 'binomial')
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Filter for significant p-values and convert to tibble
model1out <- coef(summary(model1)) %>% 
  as.data.frame() %>%
  rownames_to_column(var = 'VarName') %>% 
  filter(`Pr(>|z|)` <= 0.05 & VarName != "(Intercept)")

model1out
                          VarName      Estimate   Std. Error   z value
1                        ClinicMN   3.542937389 1.202938e+00  2.945237
2                        ClinicMS   4.722466336 1.539047e+00  3.068436
3                             Age   0.099066121 3.963720e-02  2.499322
4                           Race2   1.556170358 7.570134e-01  2.055671
5                           Race3  -1.831413796 8.199202e-01 -2.233649
6                           Race4   2.037549168 8.451256e-01  2.410942
7                             BMI  -0.051294157 2.395974e-02 -2.140848
8               Any.stillbirthYes  -1.163635266 5.766514e-01 -2.017918
9  Any.live.ptb.sb.sp.ab.in.abYes   1.244548045 4.387567e-01  2.836533
10                          BL.GE  -2.873528165 1.128779e+00 -2.545697
11                     BL.CAL.avg  -3.271711925 1.608976e+00 -2.033413
12                      BL.Calc.I   1.860831814 5.992122e-01  3.105464
13                      V5.PD.avg   6.300599065 2.112232e+00  2.982910
14                       V5..PD.5  -0.224079493 7.518646e-02 -2.980317
15                      V5..CAL.2  -0.082018978 3.439168e-02 -2.384849
16                      V5..CAL.3   0.082747672 4.111225e-02  2.012725
17                        V5.Pl.I  -1.646774971 6.603526e-01 -2.493781
18                N.PAL.sites3-33  -2.641029317 8.778559e-01 -3.008500
19                    Bact.vagYes  -1.150697985 5.670383e-01 -2.029313
20                   Gest.diabYes   3.333025702 6.332455e-01  5.263402
21                  Pre.eclampYes   3.710328604 6.124567e-01  6.058108
22                   BL.Anti.inf1   1.143911202 5.663997e-01  2.019618
23                   V3.Anti.inf1  -2.156092211 5.678189e-01 -3.797148
24                    V3.Bac.vag1   1.777882908 8.502466e-01  2.091020
25                    V5.Bac.vag1  -1.583298315 6.419565e-01 -2.466364
26                    X..Vis.Att2  -3.259322250 9.778397e-01 -3.333187
27                    X..Vis.Att3  -2.314608482 1.064351e+00 -2.174667
28                    X..Vis.Att4  -4.194052276 1.280769e+00 -3.274637
29                    X..Vis.Att5 -50.849219237 1.879170e+01 -2.705941
30                  X1st.Miss.Vis   0.475763272 1.901160e-01  2.502490
31                           OTD1   0.197791299 8.705262e-02  2.272089
32                           OTF1   0.109275890 5.365919e-02  2.036480
33                          OCRP1  -0.040626188 1.889370e-02 -2.150250
34                         OPGE21   0.001765246 5.477228e-04  3.222883
35                         OMMP91  -0.978945284 4.452503e-01 -2.198640
36                       OFIBRIN1   0.029454320 1.362220e-02  2.162229
37                           OCR5  -0.206539912 8.392606e-02 -2.460975
38                           OTD5  -0.312131962 1.162515e-01 -2.684972
39                         OPGE25  -0.002963085 7.410702e-04 -3.998386
40                         OMMP95   1.323888206 4.564577e-01  2.900353
41                     ETXU_CAT53   1.043541524 4.641852e-01  2.248115
42                     ETXU_CAT54  -4.001036721 1.515164e+00 -2.640663
       Pr(>|z|)
1  3.227070e-03
2  2.151826e-03
3  1.244312e-02
4  3.981427e-02
5  2.550619e-02
6  1.591136e-02
7  3.228631e-02
8  4.359980e-02
9  4.560620e-03
10 1.090599e-02
11 4.201082e-02
12 1.899809e-03
13 2.855220e-03
14 2.879503e-03
15 1.708612e-02
16 4.414354e-02
17 1.263904e-02
18 2.625409e-03
19 4.242646e-02
20 1.414136e-07
21 1.377321e-09
22 4.342299e-02
23 1.463705e-04
24 3.652625e-02
25 1.364925e-02
26 8.585731e-04
27 2.965507e-02
28 1.057978e-03
29 6.811116e-03
30 1.233232e-02
31 2.308114e-02
32 4.170215e-02
33 3.153542e-02
34 1.269074e-03
35 2.790350e-02
36 3.060050e-02
37 1.385601e-02
38 7.253590e-03
39 6.377584e-05
40 3.727430e-03
41 2.456887e-02
42 8.274401e-03
# Compare output from Elastic Net with output from glm model
intersect(as.character(varImportance$VarName), model1out$VarName) %>%
  sort()
 [1] "Age"                            "Any.live.ptb.sb.sp.ab.in.abYes"
 [3] "Any.stillbirthYes"              "Bact.vagYes"                   
 [5] "BL.Anti.inf1"                   "BL.CAL.avg"                    
 [7] "BL.Calc.I"                      "BL.GE"                         
 [9] "BMI"                            "ClinicMN"                      
[11] "ClinicMS"                       "ETXU_CAT53"                    
[13] "ETXU_CAT54"                     "Gest.diabYes"                  
[15] "N.PAL.sites3-33"                "OCR5"                          
[17] "OCRP1"                          "OFIBRIN1"                      
[19] "OMMP91"                         "OMMP95"                        
[21] "OPGE21"                         "OPGE25"                        
[23] "OTD1"                           "OTD5"                          
[25] "OTF1"                           "Pre.eclampYes"                 
[27] "Race2"                          "Race3"                         
[29] "Race4"                          "V3.Anti.inf1"                  
[31] "V3.Bac.vag1"                    "V5..CAL.2"                     
[33] "V5..CAL.3"                      "V5..PD.5"                      
[35] "V5.Bac.vag1"                    "V5.PD.avg"                     
[37] "V5.Pl.I"                        "X..Vis.Att2"                   
[39] "X..Vis.Att3"                    "X..Vis.Att4"                   
[41] "X..Vis.Att5"                    "X1st.Miss.Vis"