library(tidyverse)
library(caret)
library(glmnet)
library(MASS)
Exercise 5B - Models and Model Evaluation in R
Introduction
In this exercise you will try to implement a Penalized Regression model and a Random Forest in R.
For modelling we will use the Obstetrics and Periodontal Therapy Dataset, which was collected in a randomized clinical trial. The trail was concerned with the effects of treatment of maternal periodontal disease and if it may reduce preterm birth risk.
This dataset has a total of 823 observations and 171 variables. As is often the case with clinical data, the data has many missing values, and for most categorical variables there is an unbalanced in number of observations for each level.
In order for you to not spend too much time on data cleaning and wrangling we have selected 33 variables (plus a patient ID column, PID
) for which we have just enough observations per group level. As Lasso cannot handle NAs, we have imputed missing values and removed a few outlier samples.
- Load the R packages needed for analysis:
Note that you may have some clashes of function names between these packages. Because of this, you will need to specify the package you are calling some of the functions from. You do this with the package name followed by two colons followed by the name of the function, e.g. dplyr::select()
Summary Statistics
- Load in the dataset
Obt_Perio_ML.Rdata
and inspect it.
Navigate to the data/Obt_Perio_ML.Rdata
file in the file panel of your Rstudio and click the file. Confirm that you want to load the data into your environment by clicking “Yes”.
For reproducibility, copy the load
function call printed in the console and paste it to the code chunk.
You can look at what the individual variables contain here. The curated dataset contains five variables which may be regarded as outcomes, these are:
GA.at.outcome
: Gestational age at end of pregnancy (days)Birthweight
: Weight of baby at birth (grams)Apgar1
: Apgar score, a summary of a newborn infant’s appearance at birth, range: 0-10Apgar5
: Apgar score at 5 minutes after birth, range: 0-10Preg.ended...37.wk
: Pregnancy ended before week 37, categorical (0 = no, 1 = yes)Any.SAE.
: Whether participant experienced any serious adverse events (Yes, No)
The remaining 28 variables we will consider as potential explanatory variables for these outcomes.
Do some basic summary statistics. How many categorical variables and how many numeric variables do you have? Try to make distributional plots for a couple of your numeric variables (or all if you would like) to get a feel for some of the data distributions you have.
Some of the numeric variables are actually categorical. We have identified them in the
facCols
vector. Here, we change their type from numeric to character (since the other categorical variables are of this type). This code is sightly different from changing the type to factor, why we have written the code for you. Try to understand what is going on.
<- c("Race",
facCols "ETXU_CAT5",
"BL.Anti.inf",
"BL.Antibio",
"V3.Anti.inf",
"V3.Antibio",
"V3.Bac.vag",
"V5.Anti.inf",
"V5.Antibio",
"V5.Bac.vag",
"X..Vis.Att")
<- optML %>%
optML mutate(across(all_of(facCols), as.character))
head(optML)
- Make count tables for all your categorical/factor variables. Are they balanced?
Part 1: Elastic Net Regression
As described above we have five variables which could be considered outcomes as these where all measured at the end of pregnancy. We can only work with one outcome at a time and we will pick Preg.ended...37.wk
for now. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
As you will use the outcome
Preg.ended...37.wk
, you should remove the other five possible outcome variables from your dataset.Elastic net regression can be sensitive to large differences in the range of numeric/integer variables, as such these variables should be scaled. Scale all numeric/integer variables in your dataset.
mutate(across(where(…)))
Split your dataset into train and test set, you should have 70% of the data in the training set and 30% in the test set. How you chose to split is up to you, BUT afterwards you should ensure that for the categorical/factor variables all levels are represented in both sets.
After dividing into train and test set, pull out the outcome variable,
Preg.ended...37.wk
, into its own vector for both datasets, name these:y_train
andy_test
.Remove the outcome variable,
Preg.ended...37.wk
, from the train and test set, as well asPID
(if you have not already done so), as we should obviously not use this for training or testing.
You will employ the package glmnet
to perform Elastic Net Regression. The main function from this package is glmnet()
which we will use to fit the model. Additionally, you will also perform cross validation with cv.glmnet()
to obtain the best value of the model hyper-parameter, lambda (\(λ\)).
As we are working with a mix of categorical and numerical predictors, it is advisable to dummy-code the variables. You can easily do this by creating a model matrix for both the test and train set.
- Create the model matrix needed for input to
glmnet()
andcv.glmnet()
. See pseudo-code below:
<- model.matrix(~ .- 1, data = train)
modTrain <- model.matrix(~ .- 1, data = test) modTest
- Create your Elastic Net Regression model with
glmnet()
.
As you are performing Elastic Net Regression, you set the mixing parameter alpha = 0.5. This gives you a 50/50 balance of the L1 (LASSO) and L2 (Ridge) penalties.
Use
cv.glmnet()
to attain the best value of the hyperparameter lambda (\(λ\)). Remember to set a seed for reproducible results.Plot all the values of lambda tested during cross validation by calling
plot()
on the output of yourcv.glmnet()
. Extract the best lambda value from thecv.glmnet()
model and save it as an object.
Now, let’s see how well your model performed.
- Predict if an individual is likely to give birth before the 37th week using your model and your test set. See pseudo-code below.
<- predict(model, test, type = 'class') y_pred
Just like for the logistic regression model you can calculate the accuracy of the prediction by comparing it to
y_test
withconfusionMatrix()
. Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?Lastly, let’s extract the variables which were retained in the model (e.g. not penalized out). We do this by calling the coefficient with
coef()
on our model. See pseudo-code below.
<- coef(mode, s = bestLambda)
coeffs
# Convert coefficients to a data frame for easier viewing
<- as.data.frame(as.matrix(coeffs)) coeffsDat
Make a plot that shows the absolute importance of the variables retained in your model. This could be a barplot with variable names on the y-axis and the length of the bars denoting absolute size of coefficient.
Now repeat what you just did above, but this time instead of using
Preg.ended...37.wk
as outcome, try using a continuous variable, such asGA.at.outcome
. N.B remember this means that you should evaluate the model using the RMSE and a scatter plot instead of the accuracy!
Part 2: Random Forest
Now, let’s try a Random Forest classifier. We will continue using the Obt_Perio_ML.Rdata
dataset with Preg.ended...37.wk
as outcome.
Just like in the section on Elastic Net above:
Load the dataset (if you have not already)
Remove the possiable outcome variables you will not be using.
Split the dataset into test and train set - this time keep the outcome variable
Preg.ended...37.wk
in the dataset.Remember to remove the
PID
column before training!
Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.
First the cross-validation parameters:
set.seed(123)
# Set up cross-validation: 5-fold CV
<- trainControl(
RFcv method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
Next we train the model:
# Train Random Forest
set.seed(123)
<- train(
rf_model ~ .,
Outcome data = Trainingdata,
method = "rf",
trControl = RFcv,
metric = "ROC",
tuneLength = 5
)
# Model summary
print(rf_model)
- Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
# Best parameters
$bestTune
rf_model
# Plot performance
plot(rf_model)
Use your test set to evaluate your model performance. How does the Random Forest compare to the Elastic Net regression?
Extract the predictive variables with the greatest importance from your fit.
<- varImp(rf_model)
varImpOut
$importance varImpOut
- Make a logistic regression using the same dataset (you already have your train data, test data, y_train and y_test). How do the results of Elastic Net regression and Random Forest compare to the output of your
glm
.