facCols <- c("Race",
"ETXU_CAT5",
"BL.Anti.inf",
"BL.Antibio",
"V3.Anti.inf",
"V3.Antibio",
"V3.Bac.vag",
"V5.Anti.inf",
"V5.Antibio",
"V5.Bac.vag",
"X..Vis.Att")
optML <- optML %>%
mutate(across(all_of(facCols), as.character))
head(optML)Exercise 5B - Kmeans and Random Forest
Part1: K-Means Clustering
In this part you will run a K-means clustering. We will work with a dataset from patients with kidney disease. The dataset contains approximately 20 biological measures (variables) collected across 400 patients. The outcome is the classification variable which denotes whether a person suffers from ckd=chronic kidney disease ckd or not notckd.
Below is a description of the variables in the dataset:
age - age
bp - blood pressure
rbc - red blood cells
pc - pus cell
pcc - pus cell clumps
ba - bacteria
bgr - blood glucose random
bu - blood urea
sc - serum creatinine
sod - sodium
pot - potassium
hemo - hemoglobin
pcv - packed cell volume
wc - white blood cell count
rc - red blood cell count
htn - hypertension
dm - diabetes mellitus
cad - coronary artery disease
appet - appetite
pe - pedal edema
ane - anemia
class - classification
Load the R packages needed for analysis. If you are in doubt about what packages you need, look at
Presentation 5B. You will need to do some data-wrangling and will also need a package for K-means clustering and visualization.Load in the dataset named
kidney_disease.Rdata.
Navigate to the data/kidney_disease.Rdata file in the file panel of your Rstudio and click the file. Confirm that you want to load the data into your environment by clicking “Yes”.
For reproducibility, copy the load function call printed in the console and paste it to the code chunk.
Before running K-means clustering remove rows with any missing values across all variables in your dataset - yes, you will lose quite a lot of rows. Consider which columns you can use and if you have to do anything to them before clustering?
Run the k-means clustering algorithm with 4 centers on the data. Look at the clusters you have generated.
Visualize the results of your clustering. Do 4 clusters seems like a good fit for our data in the first two dimensions (Dim1 and Dim2)? How about if you have a look at Dim3 or Dim4?
Investigate the best number of clusters for this dataset. Use the
silhouettemetric.Re-do the clustering (plus visualization) with the optimal number of clusters.
Now, try to figure out what the two clusters might represent. There are different ways to do this, but one easy way would be to simply compare the
clusters IDsfrom the Kmeans output with one or more of the categorical variables from the dataset. You could usecount()ortable()for this.The
withinessmeasure (within cluster variance/spread) is much larger for one cluster then the other, what biological reason could there be for that?
Part 2: Random Forest
For your Random Forest model you will use the Obstetrics and Periodontal Therapy Dataset, which was collected through a randomized clinical trial. The trial was concerned with the effects of treatment of maternal periodontal disease and if it may reduce preterm birth risk.
This dataset has a total of 823 observations and 171 variables. As is often the case with clinical data, the data has many missing values, and for most categorical variables there is an unbalanced in number of observations for each level.
In order for you to not spend too much time on data cleaning and wrangling we have selected 33 variables (plus a patient ID column, PID) for which we have just enough observations per group level.
Summary Statistics
- Load the R packages needed for analysis. As for Part 1 above, have a look at the accompanying presentation for this exercise to get an idea of what packages you might need.
N.B that you may have some clashes of function names between these packages. Because of this, you will need to specify the package you are calling some of the functions from. You do this with the package name followed by two colons followed by the name of the function, e.g. dplyr::select()
- Load in the dataset
Obt_Perio_ML.Rdataand inspect it.
You can look at what the individual variables contain here. The curated dataset contains six variables which may be regarded as outcomes, these are:
GA.at.outcome: Gestational age at end of pregnancy (days)Birthweight: Weight of baby at birth (grams)Apgar1: Apgar score, a summary of a newborn infant’s appearance at birth, range: 0-10Apgar5: Apgar score at 5 minutes after birth, range: 0-10Preg.ended...37.wk: Pregnancy ended before week 37, categorical (0 = no, 1 = yes)Any.SAE.: Whether participant experienced any serious adverse events (Yes, No)
The remaining 28 variables we will consider as potential explanatory variables for these outcomes.
Do some basic summary statistics. How many categorical variables and how many numeric variables do you have? Try to make distributional plots for a couple of your numeric variables (or all if you would like) to get a feel for some of the data distributions you have.
Some of the numeric variables are actually categorical. We have identified them in the
facColsvector. Here, we change their type from numeric to character (since the other categorical variables are of this type). This code is sightly different from changing the type to factor, why we have written the code for you. Try to understand what is going on.
- Make count tables for all your categorical/factor variables. Are they balanced?
Random Forest Model
Now, let’s try a Random Forest classifier. As we cannot model with multiple outcome variables, so we will select Preg.ended...37.wk as our outcome of interest. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).
Remove the other five possible outcome variables measures from the dataset, i.e.:
GA.at.outcome,Birthweight,Apgar1,Apgar5andAny.SAE.. You will need to take out one more variable from the dataset before modelling, figure out which one and remove it.The R-package we are using to make a RF classifier requires the outcome variable to be categorical. As
Preg.ended...37.wkis encoded as numeric (0 and 1), you will have to change these to categories.Split your dataset into train and test set, you should have 70% of the data in the training set and 30% in the test set. How you chose to split is up to you, BUT afterwards you should ensure that for the categorical/factor variables all levels are represented in both sets.
Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.
First the cross-validation parameters:
set.seed(123)
# Set up cross-validation: 5-fold CV
RFcv <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)Next we train the model:
# Train Random Forest
set.seed(123)
rf_model <- train(
Outcome ~ .,
data = Trainingdata,
method = "rf",
trControl = RFcv,
metric = "ROC",
tuneLength = 5
)
# Model summary
print(rf_model)- Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
# Best parameters
rf_model$bestTune
# Plot performance
plot(rf_model)Use your test set to evaluate your model performance. How does the Random Forest compare to the Elastic Net regression?
Extract the predictive variables with the greatest importance from your fit.
Make a logistic regression using the same dataset (you already have your train data, test data, y_train and y_test). How do the results of Elastic Net regression and Random Forest compare to the output of your
glm.