Exercise 5A: Intro to Regression in R

Introduction

In this exercise you will fit and interpret simple models.

Load packages

library(tidyverse)
library(readxl)
library(ggfortify)
library(factoextra)

Part 1: Linear regression

We will use the dataset described below to fit a linear regression model.

Load the data boston.csv and inspect it.

This dataset describes conditions surrounding the Boston housing market in the 1970s. Each row describes a zone in the Boston area (so there is more than one house in each row).

The columns are:

crim - per capita crime rate
indus - proportion of non-retail businesses
nox - Nitrogen oxides concentration (air pollution)
rm - average number of rooms
neighborhood - the type of neighborhood the zone is in
medv - median value per house in 1000s

Explore the data

Does the datatype of each column fit to it what it describes? Do you need to change any data types?

Making a model

Split the dataset into test and training data. N.B: For any categorical variables in the boston dataset ensure that all levels are represented in both training and test set.

Hint

There are many ways split a dataset - an easy way is to use the function sample_frac().

Using your training data fit a linear model with number of rooms (rm), crime rate (crim) and neighborhood type (neighborhood) as the predictors and the value of a house (medv) as the response variable (y).
Describe what information you get from the model summary.
If you wanted to know if there is a difference in the value of houses between the Suburban and Urban neighborhood what could you do to the variable neighborhood before modelling?
For linear regression there is an assumption that the model residuals (errors) are normally distributed. An easy way visualize this is by simply calling plot() on your model (see below). What do you think based on the plots?

#RMSE
par(mfrow=c(2,2))
plot(model)

Now, use our test set to predict the response medv (median value per house in 1000s).
Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedo-code below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.

#RMSE
rmse <- sqrt(mean((y_test - y_pred)^2))

Make a scatter plot to visualize how the predicted values fit with the observed values.

Hint

Plot y_test against y_pred.

Part 2: Logistic regression

For this part we will use the joined diabetes data since it has a categorical outcome (Diabetes yes or no). We will not use the oral Glucose measurements as predictors since this is literally how you define diabetes, so we’re loading the joined dataset we created in exercise 1, e.g. ‘diabetes_join.xlsx’ or what you have named it. N.B if you did not manage to finish making this dataset or forgot to save it, you can find a copy here: ../out/diabetes_join.Rdata. Navigate to the file in the file window of Rstudio and click on it. Click “Yes” to confirm that the file can be loaded in your environment and check that it has happened.

As the outcome we are studying, Diabetes, is categorical variable we will perform logistic regression. We select serum calcium levels (Serum_ca2), BMI and smoking habits (Smoker) as predictive variables.

Read in the Diabetes dataset.
Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable Diabetes is a factor.
Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
Fit a logistic regression model with Serum_ca2, BMI and Smoker as predictors and Diabetes as outcome, using your training data.

Hint

glm(…, family = ‘binomial’)

Check the model summary and try to determine whether you could potentially drop one or more of your variables? If so, make this alternative model (model2) and compare it to the original model. Is there a significant loss/gain, i.e. better fit when including the serum calcium levels as predictor?
Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?

Hint

predict(... , type ='response')

Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the accuracy. In order to get the accuracy you must first convert our predictions into Diabetes class labels (e.g. 0 or 1).

caret::confusionMatrix(y_pred, y_test)

Part 3: K-Means Clustering

In this part we will run K-means clustering. To mix it up a bit we will work with a new dataset from patients with kidney disease. The dataset contains approximately 20 biological measures (variables) collected across 400 patients. The outcome is the classification variable which denotes whether a person suffers from ckd=chronic kidney disease ckd or not notckd.

        age     -   age
        bp      -   blood pressure
        rbc     -   red blood cells
        pc      -   pus cell
        pcc     -   pus cell clumps
        ba      -   bacteria
        bgr     -   blood glucose random
        bu      -   blood urea
        sc      -   serum creatinine
        sod     -   sodium
        pot     -   potassium
        hemo    -   hemoglobin
        pcv     -   packed cell volume
        wc      -   white blood cell count
        rc      -   red blood cell count
        htn     -   hypertension
        dm      -   diabetes mellitus
        cad     -   coronary artery disease
        appet   -   appetite
        pe      -   pedal edema
        ane     -   anemia
        class   -   classification

Load in the dataset named kidney_disease.Rdata.
Before running K-means clustering please remove rows with any missing values across all variables in your dataset - yes, you will lose quite a lot of rows. Consider which columns you can use and if you have to do anything to them before clustering?
Run the k-means clustering algorithm with 4 centers on the data. Look at the clusters you have generated.
Visualize the results of your clustering. Do 4 clusters seems like a good fit for our data in the first two dimensions (Dim1 and Dim2)? How about if you have a look at Dim3 or Dim4?
Investigate the best number of clusters for this dataset. Use the silhouette metric.
Re-do the clustering (plus visualization) with the optimal number of clusters.
Now, try to figure out what the two clusters might represent. There are different ways to do this, but one easy way would be to simply compare the clusters IDs from the Kmeans output with one or more of the categorical variables from the dataset. You could use count() or table() for this.
The withiness measure (within cluster variance/spread) is much larger for one cluster then the other, what biological reason could there be for that?