Exercise 5C - Elastic Net Regression

Introduction

In this exercise you will perform Elastic Net Regression in R.

Just like in Exercise 5B where you made a random forest classifier, you will use the Obstetrics and Periodontal Therapy Dataset. This dataset is from a randomized clinical trial, concerned with the effects of treatment of maternal periodontal disease and if it may reduce preterm birth risk.

You can look at what the individual variables contain here. The curated dataset contains six variables which may be regarded as outcomes, these are:

GA.at.outcome: Gestational age at end of pregnancy (days)
Birthweight: Weight of baby at birth (grams)
Apgar1: Apgar score, a summary of a newborn infant’s appearance at birth, range: 0-10
Apgar5: Apgar score at 5 minutes after birth, range: 0-10
Preg.ended...37.wk: Pregnancy ended before week 37, categorical (0 = no, 1 = yes)
Any.SAE.: Whether participant experienced any serious adverse events (Yes, No)

The remaining 28 variables we will consider as potential explanatory variables for these outcomes.

Just as in Exercise 5B we will use the Preg.ended...37.wk as our outcome variable. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).

Elastic Net Regression

Load the R packages needed for analysis:
Load in the dataset Obt_Perio_ML.Rdata and inspect it.
Do some basic summary statistics and distributional plots to get a feel for the data. Which types of variables do we have?
Some of the numeric variables are actually categorical. We have identified them in the facCols vector below. Change their type from numeric to character.

facCols <- c("Race", 
             "ETXU_CAT5", 
             "BL.Anti.inf", 
             "BL.Antibio", 
             "V3.Anti.inf", 
             "V3.Antibio", 
             "V3.Bac.vag", 
             "V5.Anti.inf",
             "V5.Antibio",
             "V5.Bac.vag",
             "X..Vis.Att")

As you will only use the outcome Preg.ended...37.wk, you should remove the other five possible outcome variables from your dataset. You will need to take out one more variable from the dataset before modelling, figure out which one and remove it.
Elastic net regression can be sensitive to large differences in the range of numeric/integer variables, as such these variables should be scaled. Scale all numeric/integer variables in your dataset.

Hint

mutate(across(where(…)))

Split your dataset into train and test set, you should have 70% of the data in the training set and 30% in the test set. How you chose to split is up to you, BUT afterwards you should ensure that for the categorical/factor variables all levels are represented in both sets.
After dividing into train and test set, pull out the outcome variable, Preg.ended...37.wk, into its own vector for both datasets, name these: y_train and y_test.
Remove the outcome variable, Preg.ended...37.wk, from the train and test set, as we should obviously not use this for training.

You will employ the package glmnet to perform Elastic Net Regression. The main function from this package is glmnet() which we will use to fit the model. Additionally, you will also perform cross validation with cv.glmnet() to obtain the best value of the model hyper-parameter, lambda (\(λ\)).

As we are working with a mix of categorical and numerical predictors, it is advisable to dummy-code the variables. You can easily do this by creating a model matrix for both the test and train set.

Create the model matrix needed for input to glmnet() and cv.glmnet(). See pseudo-code below:

modTrain <- model.matrix(~ .- 1, data = train)
modTest <- model.matrix(~ .- 1, data = test)

Create your Elastic Net Regression model with glmnet().

Hint

As you are performing Elastic Net Regression, you set the mixing parameter alpha = 0.5. This gives you a 50/50 balance of the L1 (LASSO) and L2 (Ridge) penalties.

Use cv.glmnet() to attain the best value of the hyperparameter lambda (\(λ\)). Remember to set a seed for reproducible results.
Plot all the values of lambda tested during cross validation by calling plot() on the output of your cv.glmnet(). Extract the best lambda value from the cv.glmnet() model and save it as an object.

Now, let’s see how well your model performed.

Predict if an individual is likely to give birth before the 37th week using your model and your test set. See pseudo-code below.

y_pred <- predict(model, test, type = 'class')

Just like for the logistic regression model you can calculate the accuracy of the prediction by comparing it to y_test with confusionMatrix(). Do you have a good accuracy? N.B look at the 2x2 contingency table, what does it tell you?
Lastly, let’s extract the variables which were retained in the model (e.g. not penalized out). We do this by calling the coefficient with coef() on our model. See pseudo-code below.

coeffs <- coef(mode, s = bestLambda)

# Convert coefficients to a data frame for easier viewing
coeffsDat <- as.data.frame(as.matrix(coeffs))

Make a plot that shows the absolute importance of the variables retained in your model. This could be a barplot with variable names on the y-axis and the length of the bars denoting absolute size of coefficient.
Now repeat what you just did above, but this time instead of using Preg.ended...37.wk as outcome, try using a continuous variable, such as GA.at.outcome. N.B remember this means that you should evaluate the model using the RMSE and a scatter plot instead of the accuracy!