Presentation 5: Intro to Modelling in R

In this section we’ll look at how to define and fit a model in R.

Load packages

library(tidyverse)
library(caret)
library(ModelMetrics)

Load data

In order to focus on the technical aspects we’ll use a very simple toy dataset. It contains the number of cigarettes smoked per day and how long the person lived. It is inspired by this paper if you want to take a look.

df_smoke <- read.csv('../data/smoking.csv')
head(df_smoke)
  daily_cigarettes life
1                7   76
2               11   73
3               27   72
4               23   71
5               13   74
6               11   76

We will use this to perform a linear regression.

Linear Regression

Split Data into Training and Test Set

First, we will split our data into a test and a training set. There are numerous ways to do this. We here show sample_frac from dplyr:

# Set seed to ensure reproducibility
set.seed(123)  

#add an ID column to keep track of observations
df_smoke$ID <- 1:nrow(df_smoke)

train <- df_smoke %>% sample_frac(.75)
nrow(train)
[1] 75
head(train)
  daily_cigarettes life ID
1               29   72 31
2               16   73 79
3                5   78 51
4                3   77 14
5                4   79 67
6               23   71 42

As you can see, the ID’s in train are shuffled and it only has 75 rows since we asked for 75% of the data. Now all we have to do is identify the other 25%, i.e. the observations not in train. dpylr has a neat function called anti_join for that:

#from df_smoke remove what is in train by checking the ID column
test  <- anti_join(df_smoke, train, by = 'ID') 
nrow(test)
[1] 25
head(test)
  daily_cigarettes life ID
1                7   76  1
2               11   73  2
3               27   72  3
4               32   70 10
5                8   75 11
6               16   75 24

Defining the model

As stated above, a linear regression model generally has the form of:

\[y=b_0+b_1*x_i\]

Where we refer to \(b_0\) as the intercept and \(b_1\) as the coefficient. There will typically be one coefficient for each predictor. The goal of modelling is to estimate the values of \(b_0\) and all \(b_i\).

We need to tell R which of our variables is the outcome, \(y\) and which predictors \(x_i\) we want to include in the model. This is referred to in documentation as the model’s formula. Have a look:

#the formula is written like so:
lm(y ~ x_1 + x_2 + ...)
#see the help
?lm

In our case, \(y\) is the number of years lived and we have a singular predictor \(x_1\), the number of cigarettes smoked per day. So that will be our model formulation:

#remember to select the training data subset we defined above! 
model <- lm(life ~ daily_cigarettes, data = train)

Modelling results

By calling lm we have already trained our model! Lets have a look at the results. The summary gives us a lot of information about the model we trained:

# View model summary
summary(model)

Call:
lm(formula = life ~ daily_cigarettes, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.71479 -1.03035 -0.06517  0.82928  2.84669 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      78.71479    0.26847  293.20   <2e-16 ***
daily_cigarettes -0.28074    0.01351  -20.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.251 on 73 degrees of freedom
Multiple R-squared:  0.8553,    Adjusted R-squared:  0.8533 
F-statistic: 431.5 on 1 and 73 DF,  p-value: < 2.2e-16

It beings with Call which displays the formula used to fit the model.

The Residuals section summarizes the distribution of the residuals, which is the difference between the actual observed \(y\) values and the fitted \(y\) values.

The Coefficients table shows the estimated values for each coefficient including the intercept, along with their standard errors, t-values, and p-values. These help to determine the significance of each predictor. Smaller p-values indicate stronger evidence against the null hypothesis that the true coefficient is zero.

In the bottom section we have some information about how well model fits the training data. The Residual Standard Error (RSE) provides a measure of accuracy as it represents the average size of the residuals. The R-squared value indicates the proportion of variance explained by the model, with the Adjusted R-squared accounting for the number of predictors to prevent overfitting. Finally, the F-statistic and its p-value test whether the model as a whole explains a significant portion of the variance in the response variable (the outcome \(y\)).

Overall, the summary helps us to assess the model fit and identify significant predictors and their effect size (size of the coefficient).

We can extract the model object’s components with $:

model$coefficients
     (Intercept) daily_cigarettes 
      78.7147894       -0.2807398 
hist(model$residuals, breaks = 30, main = 'Histogram of residuals', 
     xlab = 'Residual')

Model interpretation

What do these results mean? Our model formulation is:

\[life=b_0+b_1*cigarettes\]

And we estimated these values:

model$coefficients
     (Intercept) daily_cigarettes 
      78.7147894       -0.2807398 

Therefore:

  • The intercept \(b_0\) is the number of years we estimated a person in this dataset will live if they smoke 0 cigarettes. It is 78.7 years

  • The coefficient of cigarettes per day is -0.28. This means for every 1 unit increase in cigarettes (one additional cigarette per day) the life expectancy decreases by 0.28 years.

Model performance

We now use our held out test data to evaluate the model performance. For that we will predict life expectancy for the 25 observations in test and compare with the actual values.

#use the fitted model to make predictions for the test data
y_pred <- predict(model, test)
y_pred
       1        2        3        4        5        6        7        8 
76.74961 75.62665 71.13481 69.73112 76.46887 74.22295 77.59183 70.29260 
       9       10       11       12       13       14       15       16 
73.38073 70.01186 76.46887 75.62665 75.34591 76.46887 73.38073 75.34591 
      17       18       19       20       21       22       23       24 
76.18813 71.41555 73.66147 71.13481 71.97703 75.34591 72.25777 75.90739 
      25 
78.71479 

Let’s see how that fits with the known values.

pred <- tibble(pred = y_pred, real = test$life)

ggplot(pred, aes(x=real, y=pred)) +
  geom_point()

Not too bad! Formally we usually calculate the mean square error (mse) between predictions and the known true values.

mse(pred$real,pred$pred)
[1] 1.742902

Our predictions are on average 1.7 years wrong.

Regression with categorical features

Now that we know how to make a simple linear model, how can we include categorical variables and what is the interpretation of their coefficients? To investigate this we will load a second version of the dataset that contains an additional variable: Exercise level.

Decision tree

Clustering: K-means