library(tidyverse)
library(readxl)
library(ggfortify)
library(factoextra)
Exercise 5A: Intro to Regression in R
Introduction
In this exercise you will fit and interpret simple models.
- Load packages
Part 1: Linear regression
We will use the dataset described below to fit a linear regression model.
- Load the data
boston.csv
and inspect it.
This dataset describes conditions surrounding the Boston housing market in the 1970s. Each row describes a zone in the Boston area (so there is more than one house in each row).
The columns are:
crim - per capita crime rate
indus - proportion of non-retail businesses
nox - Nitrogen oxides concentration (air pollution)
rm - average number of rooms
neighborhood - the type of neighborhood the zone is in
medv - median value per house in 1000s
Explore the data
- Does the datatype of each column fit to it what it describes? Do you need to change any data types?
Making a model
- Split the dataset into test and training data. N.B: For any categorical variables in the boston dataset ensure that all levels are represented in both training and test set.
There are many ways split a dataset - an easy way is to use the function sample_frac()
.
Using your training data fit a linear model with number of rooms (
rm
), crime rate (crim
) and neighborhood type (neighborhood
) as the predictors and the value of a house (medv
) as the response variable (y).Describe what information you get from the model summary.
If you wanted to know if there is a difference in the value of houses between the
Suburban
andUrban
neighborhood what could you do to the variableneighborhood
before modelling?For linear regression there is an assumption that the model residuals (errors) are normally distributed. An easy way visualize this is by simply calling
plot()
on your model (see below). What do you think based on the plots?
#RMSE
par(mfrow=c(2,2))
plot(model)
Now, use our test set to predict the response
medv
(median value per house in 1000s
).Evaluate how well our model performs. There are different ways of doing this but lets use the classic measure of RMSE (Root Mean Square Error). The psedo-code below shows how to calculate the RMSE. A small RMSE (close to zero), indicates a good model.
#RMSE
<- sqrt(mean((y_test - y_pred)^2)) rmse
- Make a scatter plot to visualize how the predicted values fit with the observed values.
Plot y_test
against y_pred
.
Part 2: Logistic regression
For this part we will use the joined diabetes data since it has a categorical outcome (Diabetes yes or no). We will not use the oral Glucose measurements as predictors since this is literally how you define diabetes, so we’re loading the joined dataset we created in exercise 1, e.g. ‘diabetes_join.xlsx’ or what you have named it. N.B if you did not manage to finish making this dataset or forgot to save it, you can find a copy here: ../out/diabetes_join.Rdata
. Navigate to the file in the file window of Rstudio and click on it. Click “Yes” to confirm that the file can be loaded in your environment and check that it has happened.
As the outcome we are studying, Diabetes
, is categorical variable we will perform logistic regression. We select serum calcium levels (Serum_ca2
), BMI
and smoking habits (Smoker
) as predictive variables.
Read in the Diabetes dataset.
Logistic regression does not allow for any missing values so first ensure you do not have NAs in your dataframe. Ensure that your outcome variable
Diabetes
is a factor.Split your data into training and test data. Take care that the two classes of the outcome variable are represented in both training and test data, and at similar ratios.
Fit a logistic regression model with
Serum_ca2
,BMI
andSmoker
as predictors andDiabetes
as outcome, using your training data.
glm(…, family = ‘binomial’)
Check the model summary and try to determine whether you could potentially drop one or more of your variables? If so, make this alternative model (model2) and compare it to the original model. Is there a significant loss/gain, i.e. better fit when including the serum calcium levels as predictor?
Now, use your model to predict Diabetes class based on your test set. What does the output of the prediction mean?
predict(... , type ='response')
- Lets evaluate the performance of our model. As we are performing classification, measures such as mse/rmse will not work, instead we will calculate the accuracy. In order to get the accuracy you must first convert our predictions into Diabetes class labels (e.g. 0 or 1).
::confusionMatrix(y_pred, y_test) caret
Part 3: K-Means Clustering
In this part we will run K-means clustering. To mix it up a bit we will work with a new dataset from patients with kidney disease. The dataset contains approximately 20 biological measures (variables) collected across 400 patients. The outcome is the classification
variable which denotes whether a person suffers from ckd=chronic kidney disease ckd
or not notckd
.
age - age
bp - blood pressure
rbc - red blood cells
pc - pus cell
pcc - pus cell clumps
ba - bacteria
bgr - blood glucose random
bu - blood urea
sc - serum creatinine
sod - sodium
pot - potassium
hemo - hemoglobin
pcv - packed cell volume
wc - white blood cell count
rc - red blood cell count
htn - hypertension
dm - diabetes mellitus
cad - coronary artery disease
appet - appetite
pe - pedal edema
ane - anemia
class - classification
Load in the dataset named
kidney_disease.Rdata
.Before running K-means clustering please remove rows with any missing values across all variables in your dataset - yes, you will lose quite a lot of rows. Consider which columns you can use and if you have to do anything to them before clustering?
Run the k-means clustering algorithm with 4 centers on the data. Look at the clusters you have generated.
Visualize the results of your clustering. Do 4 clusters seems like a good fit for our data in the first two dimensions (Dim1 and Dim2)? How about if you have a look at Dim3 or Dim4?
Investigate the best number of clusters for this dataset. Use the
silhouette
metric.Re-do the clustering (plus visualization) with the optimal number of clusters.
Now, try to figure out what the two clusters might represent. There are different ways to do this, but one easy way would be to simply compare the
clusters IDs
from the Kmeans output with one or more of the categorical variables from the dataset. You could usecount()
ortable()
for this.The
withiness
measure (within cluster variance/spread) is much larger for one cluster then the other, what biological reason could there be for that?