Exercise 5B - Kmeans and Random Forest

Part1: K-Means Clustering

In this part you will run a K-means clustering. We will work with a dataset from patients with kidney disease. The dataset contains approximately 20 biological measures (variables) collected across 400 patients. The outcome is the classification variable which denotes whether a person suffers from ckd=chronic kidney disease ckd or not notckd.

Below is a description of the variables in the dataset:

        age     -   age
        bp      -   blood pressure
        rbc     -   red blood cells
        pc      -   pus cell
        pcc     -   pus cell clumps
        ba      -   bacteria
        bgr     -   blood glucose random
        bu      -   blood urea
        sc      -   serum creatinine
        sod     -   sodium
        pot     -   potassium
        hemo    -   hemoglobin
        pcv     -   packed cell volume
        wc      -   white blood cell count
        rc      -   red blood cell count
        htn     -   hypertension
        dm      -   diabetes mellitus
        cad     -   coronary artery disease
        appet   -   appetite
        pe      -   pedal edema
        ane     -   anemia
        class   -   classification  
  1. Load the R packages needed for analysis. If you are in doubt about what packages you need, look at Presentation 5B. You will need to do some data-wrangling and will also need a package for K-means clustering and visualization.

  2. Load in the dataset named kidney_disease.Rdata.

Navigate to the data/kidney_disease.Rdata file in the file panel of your Rstudio and click the file. Confirm that you want to load the data into your environment by clicking “Yes”.

For reproducibility, copy the load function call printed in the console and paste it to the code chunk.

  1. Before running K-means clustering remove rows with any missing values across all variables in your dataset - yes, you will lose quite a lot of rows. Consider which columns you can use and if you have to do anything to them before clustering?

  2. Run the k-means clustering algorithm with 4 centers on the data. Look at the clusters you have generated.

  3. Visualize the results of your clustering. Do 4 clusters seems like a good fit for our data in the first two dimensions (Dim1 and Dim2)? How about if you have a look at Dim3 or Dim4?

  4. Investigate the best number of clusters for this dataset. Use the silhouette metric.

  5. Re-do the clustering (plus visualization) with the optimal number of clusters.

  6. Now, try to figure out what the two clusters might represent. There are different ways to do this, but one easy way would be to simply compare the clusters IDs from the Kmeans output with one or more of the categorical variables from the dataset. You could use count() or table() for this.

  7. The withiness measure (within cluster variance/spread) is much larger for one cluster then the other, what biological reason could there be for that?

Part 2: Random Forest

For your Random Forest model you will use the Obstetrics and Periodontal Therapy Dataset, which was collected through a randomized clinical trial. The trial was concerned with the effects of treatment of maternal periodontal disease and if it may reduce preterm birth risk.

This dataset has a total of 823 observations and 171 variables. As is often the case with clinical data, the data has many missing values, and for most categorical variables there is an unbalanced in number of observations for each level.

In order for you to not spend too much time on data cleaning and wrangling we have selected 33 variables (plus a patient ID column, PID) for which we have just enough observations per group level.

Summary Statistics

  1. Load the R packages needed for analysis. As for Part 1 above, have a look at the accompanying presentation for this exercise to get an idea of what packages you might need.

N.B that you may have some clashes of function names between these packages. Because of this, you will need to specify the package you are calling some of the functions from. You do this with the package name followed by two colons followed by the name of the function, e.g. dplyr::select()

  1. Load in the dataset Obt_Perio_ML.Rdata and inspect it.

You can look at what the individual variables contain here. The curated dataset contains six variables which may be regarded as outcomes, these are:

  • GA.at.outcome: Gestational age at end of pregnancy (days)
  • Birthweight: Weight of baby at birth (grams)
  • Apgar1: Apgar score, a summary of a newborn infant’s appearance at birth, range: 0-10
  • Apgar5: Apgar score at 5 minutes after birth, range: 0-10
  • Preg.ended...37.wk: Pregnancy ended before week 37, categorical (0 = no, 1 = yes)
  • Any.SAE.: Whether participant experienced any serious adverse events (Yes, No)

The remaining 28 variables we will consider as potential explanatory variables for these outcomes.

  1. Do some basic summary statistics. How many categorical variables and how many numeric variables do you have? Try to make distributional plots for a couple of your numeric variables (or all if you would like) to get a feel for some of the data distributions you have.

  2. Some of the numeric variables are actually categorical. We have identified them in the facCols vector. Here, we change their type from numeric to character (since the other categorical variables are of this type). This code is sightly different from changing the type to factor, why we have written the code for you. Try to understand what is going on.

facCols <- c("Race", 
             "ETXU_CAT5", 
             "BL.Anti.inf", 
             "BL.Antibio", 
             "V3.Anti.inf", 
             "V3.Antibio", 
             "V3.Bac.vag", 
             "V5.Anti.inf",
             "V5.Antibio",
             "V5.Bac.vag",
             "X..Vis.Att")


optML <- optML %>%
  mutate(across(all_of(facCols), as.character))

head(optML)
  1. Make count tables for all your categorical/factor variables. Are they balanced?

Random Forest Model

Now, let’s try a Random Forest classifier. As we cannot model with multiple outcome variables, so we will select Preg.ended...37.wk as our outcome of interest. This variable is a factor variable which denotes if a women gave birth prematurely (1=yes, 0=no).

  1. Remove the other five possible outcome variables measures from the dataset, i.e.: GA.at.outcome, Birthweight, Apgar1, Apgar5 andAny.SAE.. You will need to take out one more variable from the dataset before modelling, figure out which one and remove it.

  2. The R-package we are using to make a RF classifier requires the outcome variable to be categorical. As Preg.ended...37.wk is encoded as numeric (0 and 1), you will have to change these to categories.

  3. Split your dataset into train and test set, you should have 70% of the data in the training set and 30% in the test set. How you chose to split is up to you, BUT afterwards you should ensure that for the categorical/factor variables all levels are represented in both sets.

  4. Set up a Random Forest model with cross-validation. See pseudo-code below. Remember to set a seed.

First the cross-validation parameters:

set.seed(123)

# Set up cross-validation: 5-fold CV
RFcv <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)

Next we train the model:

# Train Random Forest
set.seed(123)
rf_model <- train(
  Outcome ~ .,
  data = Trainingdata,
  method = "rf",
  trControl = RFcv,
  metric = "ROC",
  tuneLength = 5           
)


# Model summary
print(rf_model)
  1. Plot your model fit. How does your model improve when you add 10, 20, 30, etc. predictors?
# Best parameters
rf_model$bestTune

# Plot performance
plot(rf_model)
  1. Use your test set to evaluate your model performance. How does the Random Forest compare to the Elastic Net regression?

  2. Extract the predictive variables with the greatest importance from your fit.

  3. Make a logistic regression using the same dataset (you already have your train data, test data, y_train and y_test). How do the results of Elastic Net regression and Random Forest compare to the output of your glm.