Exercise 5 - Modelling in R

Introduction

In this exercise you will fit and interpret simple models.

  1. Load packages

Part 1: Linear regression

We will use the dataset described below to fit a linear regression model.

  1. Load the data boston.csv and inspect it.

This dataset describes conditions surrounding the Boston housing market in the 1970s. Each row describes a zone in the Boston area (so there is more than one house in each row).

The columns are:

crim - per capita crime rate
indus - proportion of non-retail businesses
nox - Nitrogen oxides concentration (air pollution)
rm - average number of rooms
neighborhood - the type of neighborhood the zone is in
medv - median value per house in 1000s

Explore the data

  1. Does the datatype of each column fit to it what it describes? Do you need to change any data types?

Making a model

  1. Split the dataset into test and training data.

  2. Fit a model of how well the number of rooms (rm), crime rate (crim) and neighborhood type (neighborhood) predict the value of the houses (medv).

  3. Describe what information you get from the model summary.

  4. Scale the numeric predictor columns and redo the modelling. What has changed?

There is a scale function, see ?scale().

Part 2: Logistic regression

For this part we will use the joined diabetes data since it has a categorical outcome (Diabetes yes or no). We will not use the oral Glucose measurements as predictors since this is literally how you define diabetes, so we’re loading the joined dataset we created in exercise 1, e.g. ‘diabetes_join.xlsx’ or what you have named it.

We choose to make a regression model of Diabetes as predicted by serum calcium levels (Serum_ca2), BMI and smoking habits (Smoker).

  1. We cannot have NA values in our predictors so remove all rows with NAs and save the result into a new dataframe diabetes_nona.

  2. Make the outcome variable into a factor if it is not already.

  3. Scale all numeric predictors. Check your result.

  4. Split your data into training and test data. Take care that the two classes of the outcome variable are in the same ratio in both training and test data.

  5. Fit a regression model with Serum_ca2, BMI and Smoker as predictors. Check the model summary.

  6. Create a second model with only BMI and Smoker as predictors. Compare the fit of your second model to the first one (including Serum_ca2). Is there a significant gain, i.e. better fit when including the serum calcium levels as predictor? Which model do you think is better?

Part 3: Clustering

In this part we will run clustering on the joined diabetes dataset from exercise 1. Load it here if you don’t have it already from Part 2.

  1. Run the k-means clustering algorithm with 4 centers on the data. Consider which columns you can use and if you have to manipulate them before. If you get an error, check whether you have values that might not be admissible, such as NA.

  2. Check whether the data you have run k-means on has the same number of rows as the dataframe with meta information, e.g. whether the person had diabetes. If they are not aligned, create a dataframe with Diabetes info that matches the dataframe you ran clustering on.

  3. Visualize the results of your clustering.

  4. Investigate the best number of clusters.

  5. Re-do the clustering (plus visualization) with that number.


Extra exercises

e1. Find the best single predictor in the Diabetes dataset. This is done by comparing the null model (no predictors) to all possible models with one predictor, i.e. outcome ~ predictor, outcome ~ predictor2, ect. The null model can be formulated like so: outcome ~ 1 (only the intercept). Fit all possible one predictor models and compare their fit to the null model with a likelihood ratio test. Find the predictor with the lowest p-value in the likelihood ratio test. This can be done in a loop in order to avoid writing out all models.

To use a formula with a variable you will need to combine the literal part and the variable with paste, e.g. paste("Outcome ~", my_pred).

e2. Write a function that handles visualization of k-means clustering results. Think about which information you need to pass and what it should return.