library(tidyverse)
library(caret)
library(glmnet)
library(MASS)
library(randomForest)
Presentation 5B: Modelling in R
Part 1: Penalized Regression
Elastic Net regression is part of the family of penalized regressions, which also includes Ridge regression and LASSO regression. Penalized regressions are useful when dealing with many predictors, as they help eliminate less informative ones while retaining the important predictors, making them ideal for high-dimensional datasets.
If you are interested in knowing more about penalized regressions, you can have a look at this excellent tutorial from Datacamp.
In linear regression, we estimate the relationship between predictors \(X\) and an outcome \(y\) using parameters \(β\), chosen to minimize the residual sum of squares (RSS). Two key properties of these \(β\) estimates are bias (the difference between the true parameter and the estimate) and variance (how much the estimates vary across different samples). While OLS (Ordinary Least Squares) gives unbiased estimates, it can suffer from high variance - especially when predictors are numerous or highly correlated — leading to poor predictions.
To address this, we can introduce regularization, which slightly biases the estimates in exchange for reduced variance and improved predictive performance. This trade-off is essential: as model complexity grows, variance increases and bias decreases, so regularization helps find a better balance between the two.
Ridge Regression = L2 penalty (adds the sum of the squares of the coefficients to the loss function). It discourages large coefficients by penalizing their squared magnitudes, shrinking them towards zero. This reduces overfitting while keeping all variables in the model.
Lasso Regression = L1 penalty (adds the sum of the absolute values of the coefficients to the loss function). This penalty encourages sparsity, causing some coefficients to become exactly zero for large \(λ\), thereby performing variable selection.
Elastic Net combines L1 and L2 penalties to balance variable selection and coefficient shrinkage. One of the key advantages of Elastic Net over other types of penalized regression is its ability to handle multicollinearity and situations where the number of predictors exceeds the number of observations.
Lambda, \((λ)\), controls the strength of the penalty. As \(λ\) increases, variance decreases and bias increases, raising the key question: how much bias are we willing to accept to reduce variance?
Elastic Net regression
Load the R packages needed for analysis:
For this exercise we will use a dataset from patients with Heart Disease. Information on the columns in the dataset can be found here.
<- read_csv("../data/HeartDisease.csv")
HD
head(HD)
# A tibble: 6 × 14
age sex chestPainType restBP chol fastingBP restElecCardio maxHR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 52 1 0 125 212 0 1 168
2 53 1 0 140 203 1 0 155
3 70 1 0 145 174 0 1 125
4 61 1 0 148 203 0 1 161
5 62 0 0 138 294 1 1 106
6 58 0 0 100 248 0 0 122
# ℹ 6 more variables: exerciseAngina <dbl>, STdepEKG <dbl>,
# slopePeakExST <dbl>, nMajorVessels <dbl>, DefectType <dbl>,
# heartDisease <dbl>
Firstly, let’s convert some of the variables that are encoded as numeric datatypes but should be factors:
<- c("sex",
facCols "chestPainType",
"fastingBP",
"restElecCardio",
"exerciseAngina",
"slopePeakExST",
"DefectType",
"heartDisease")
<- HD %>%
HD mutate_if(names(.) %in% facCols, as.factor)
head(HD)
# A tibble: 6 × 14
age sex chestPainType restBP chol fastingBP restElecCardio maxHR
<dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl>
1 52 1 0 125 212 0 1 168
2 53 1 0 140 203 1 0 155
3 70 1 0 145 174 0 1 125
4 61 1 0 148 203 0 1 161
5 62 0 0 138 294 1 1 106
6 58 0 0 100 248 0 0 122
# ℹ 6 more variables: exerciseAngina <fct>, STdepEKG <dbl>,
# slopePeakExST <fct>, nMajorVessels <dbl>, DefectType <fct>,
# heartDisease <fct>
Let’s do some summary statistics to have a look at the variables we have in our dataset. Firstly, the numeric columns. We can get a quick overview of variable distributions and ranges with some histograms.
# Reshape data to long format for ggplot2
<- HD %>%
long_data ::select(where(is.numeric)) %>%
dplyrpivot_longer(cols = everything(),
names_to = "variable",
values_to = "value")
head(long_data)
# A tibble: 6 × 2
variable value
<chr> <dbl>
1 age 52
2 restBP 125
3 chol 212
4 maxHR 168
5 STdepEKG 1
6 nMajorVessels 2
# Plot histograms for each numeric variable in one grid
ggplot(long_data,
aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "#9395D3", color ='grey30') +
facet_wrap(vars(variable), scales = "free") +
theme_minimal()
In opposition to hypothesis tests and classic linear regression, penalized regression has no assumption that predictors, nor model residuals, must be normally distributed so we do not have to test that, yeah! However, it does still assume that the relationship between predictors and the outcome is linear and that observations are independent from one another.
Importantly, penalized regression can be sensitive to large differences in the range of numeric/integer variables and it does not like missing values, so lets remove missing (if any) and scale our numeric variables.
Question: What happens to the data when it is scaled?
<- HD %>%
HD_EN drop_na(.)
<-HD %>%
HD_EN mutate(across(where(is.numeric), scale))
head(HD_EN)
# A tibble: 6 × 14
age[,1] sex chestPainType restBP[,1] chol[,1] fastingBP restElecCardio
<dbl> <fct> <fct> <dbl> <dbl> <fct> <fct>
1 -0.268 1 0 -0.377 -0.659 0 1
2 -0.158 1 0 0.479 -0.833 1 0
3 1.72 1 0 0.764 -1.40 0 1
4 0.724 1 0 0.936 -0.833 0 1
5 0.834 0 0 0.365 0.930 1 1
6 0.393 0 0 -1.80 0.0388 0 0
# ℹ 7 more variables: maxHR <dbl[,1]>, exerciseAngina <fct>,
# STdepEKG <dbl[,1]>, slopePeakExST <fct>, nMajorVessels <dbl[,1]>,
# DefectType <fct>, heartDisease <fct>
Now, let’s check the balance of the categorical/factor variables.
%>%
HD_EN ::select(where(is.factor)) %>%
dplyrpivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
count(Variable, Level, name = "Count")
# A tibble: 22 × 3
Variable Level Count
<chr> <fct> <int>
1 DefectType 0 7
2 DefectType 1 64
3 DefectType 2 544
4 DefectType 3 410
5 chestPainType 0 497
6 chestPainType 1 167
7 chestPainType 2 284
8 chestPainType 3 77
9 exerciseAngina 0 680
10 exerciseAngina 1 345
# ℹ 12 more rows
# OR
# cat_cols <- HD_EN %>% dplyr::select(where(is.factor)) %>% colnames()
#
# for (col in cat_cols){
# print(col)
# print(table(HD_EN[[col]]))
# }
From our count table above we see that variables DefectType
, chestPainType
, restElecCardio
, and slopePeakExST
are unbalanced. Especially DefectType
and restElecCardio
are problematic with only 7 and 15 observations for one of the factor levels.
To avoid issues when modelling, we will filter out these observations and re-level these two variables.
<- HD_EN %>%
HD_EN filter(DefectType != 0 & restElecCardio !=2) %>%
mutate(DefectType = as.factor(as.character(DefectType)),
restElecCardio = as.factor(as.character(restElecCardio)))
We will use heartDisease
(0 = no, 1 = yes) as the outcome.
We split our dataset into train and test set, we will keep 70% of the data in the training set and take out 30% for the test set. To keep track of our train and test samples, we will make an ID variable. Importantly, afterwards we must again ensure that all levels of each factor variable are represented in both sets.
# Add ID column
<- HD_EN %>%
HD_EN mutate(ID = paste0("ID", 1:nrow(HD_EN)))
# Set seed
set.seed(123)
# Training set
<- HD_EN %>%
train sample_frac(0.70)
# Check group levels
%>%
train ::select(where(is.factor)) %>%
dplyrpivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
count(Variable, Level, name = "Count")
# A tibble: 20 × 3
Variable Level Count
<chr> <fct> <int>
1 DefectType 1 42
2 DefectType 2 374
3 DefectType 3 286
4 chestPainType 0 330
5 chestPainType 1 120
6 chestPainType 2 196
7 chestPainType 3 56
8 exerciseAngina 0 478
9 exerciseAngina 1 224
10 fastingBP 0 601
11 fastingBP 1 101
12 heartDisease 0 333
13 heartDisease 1 369
14 restElecCardio 0 346
15 restElecCardio 1 356
16 sex 0 209
17 sex 1 493
18 slopePeakExST 0 50
19 slopePeakExST 1 323
20 slopePeakExST 2 329
# Test set
<- anti_join(HD_EN, train, by = 'ID')
test
# Check group levels
%>%
test ::select(where(is.factor)) %>%
dplyrpivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
count(Variable, Level, name = "Count")
# A tibble: 20 × 3
Variable Level Count
<chr> <fct> <int>
1 DefectType 1 18
2 DefectType 2 163
3 DefectType 3 120
4 chestPainType 0 151
5 chestPainType 1 47
6 chestPainType 2 82
7 chestPainType 3 21
8 exerciseAngina 0 192
9 exerciseAngina 1 109
10 fastingBP 0 253
11 fastingBP 1 48
12 heartDisease 0 150
13 heartDisease 1 151
14 restElecCardio 0 148
15 restElecCardio 1 153
16 sex 0 89
17 sex 1 212
18 slopePeakExST 0 20
19 slopePeakExST 1 144
20 slopePeakExST 2 137
After dividing into train and test set, we pull out the outcome variable heartDisease
into its own vector for both datasets, we name these: y_train
and y_test
.
<- train %>%
y_train pull(heartDisease)
<- test %>%
y_test pull(heartDisease)
Next, we remove the outcome variable heartDisease
from the train and test set, as well as ID
as we should obviously not use this for training or testing.
<- train %>%
train ::select(!c(ID, heartDisease))
dplyr
<- test %>%
test ::select(!c(ID, heartDisease)) dplyr
Another benefit of a regularized regression, such as Elastic Net, is that this type of model can accommodate a categorical or a numeric variable as outcome, and handle a mix of these types as predictors, making it super flexible.
However, if you do use categorical variables as your predictors within the glmnet()
function, you need to ensure that your variables are Dummy-coded (one-hot encoded). Dummy-coding means that categorical levels are converted into binary numeric indicators. You can do this ‘manually’, but there is a super easy way to do it with model.matrix()
as shown below.
Let’s create the model matrix needed for input to glmnet()
and cv.glmnet()
functions:
<- model.matrix(~ .- 1, data = train)
modTrain head(modTrain)
age sex0 sex1 chestPainType1 chestPainType2 chestPainType3
1 0.28282315 1 0 0 0 0
2 0.61350040 1 0 0 0 1
3 -1.37056311 0 1 0 1 0
4 -0.04785411 1 0 0 1 0
5 -0.48875711 0 1 0 0 0
6 0.83395190 0 1 0 1 0
restBP chol fastingBP1 restElecCardio1 maxHR exerciseAngina1
1 -0.66289286 2.0933271 0 1 0.60358256 1
2 1.04975673 -0.1162960 0 1 0.95132211 0
3 -0.09200966 -1.2792555 0 1 0.03850579 0
4 -1.34795270 0.4070358 0 0 0.77745234 0
5 0.70722681 -0.8916023 0 0 -1.00471285 1
6 -0.09200966 -0.2907399 0 1 -0.13536398 0
STdepEKG slopePeakExST1 slopePeakExST2 nMajorVessels DefectType2
1 -0.4012688 0 1 -0.7316143 1
2 -0.1459612 0 1 -0.7316143 1
3 -0.9118839 0 1 -0.7316143 1
4 -0.9118839 0 1 -0.7316143 1
5 -0.1459612 1 0 -0.7316143 0
6 0.6199615 1 0 2.1787531 0
DefectType3
1 0
2 0
3 0
4 0
5 1
6 1
<- model.matrix(~ .- 1,, data = test) modTest
Let’s create your Elastic Net Regression model with glmnet()
.
The parameter \(α\) essentially tells glmnet()
whether we are performing Ridge Regression (\(α\) = 0), LASSO regression (\(α\) = 1) or Elastic Net regression (0 < \(α\) < 1 ). Furthermore, like for logistic regression we must specify if our outcome is; binominal
, multinomial
, gaussian
, etc.
<- glmnet(modTrain, y_train, alpha = 0.5, family = "binomial") EN_model
As you may recall from the first part of this presentation, penalized regressions have a hyperparameter, \(λ\), which determines the strength of shrinkage of the coefficients. We can use k-fold cross-validation to find the value of lambda that minimizes the cross-validated prediction error. For classification problems such as the one we have here the prediction error is usually binominal deviance, which is related to log-loss (a measure of how well predicted probabilities match true class labels).
Let’s use cv.glmnet()
to attain the best value of the hyperparameter lambda (\(λ\)). We should remember to set a seed for reproducible results.
set.seed(123)
<- cv.glmnet(modTrain, y_train, alpha = 0.5, family = "binomial") cv_model
We can plot all the values of lambda tested during cross validation by calling plot()
on the output of your cv.glmnet()
and we can extract the best lambda value from the cv.glmnet()
model and save it as an object.
plot(cv_model)
<- cv_model$lambda.min bestLambda
The plot shows how the model’s prediction error changes with different values of lambda, \(λ\). It helps to identify the largest \(λ\) we can choose before the penalty starts to hurt performance - too much shrinkage can remove important predictors, increasing prediction error.
bestLambda
[1] 0.003892802
log(bestLambda)
[1] -5.548626
Time to see how well our model performs. Let’s predict if a individual is likely to have heart disease using our model and our test set.
<- predict(EN_model, s = bestLambda, newx = modTest, type = 'class') y_pred
Just like for the logistic regression model we can calculate the accuracy of the prediction by comparing it to y_test
with confusionMatrix()
.
<- as.factor(y_pred)
y_pred
::confusionMatrix(y_pred, y_test) caret
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 124 21
1 26 130
Accuracy : 0.8439
95% CI : (0.7978, 0.883)
No Information Rate : 0.5017
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6877
Mcnemar's Test P-Value : 0.5596
Sensitivity : 0.8267
Specificity : 0.8609
Pos Pred Value : 0.8552
Neg Pred Value : 0.8333
Prevalence : 0.4983
Detection Rate : 0.4120
Detection Prevalence : 0.4817
Balanced Accuracy : 0.8438
'Positive' Class : 0
Our model performs relatively well with a Balanced Accuracy of 0.84.
Just like with linear or logistic regression, we can pull out the coefficients (weights) from our model to asses which variable(s) are the most explanatory for heart disease. We use the function coef()
for this.
<- coef(EN_model, s = bestLambda)
coeffs
coeffs
19 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 0.22782401
age -0.05823254
sex0 0.75259034
sex1 -0.70972944
chestPainType1 1.26049789
chestPainType2 1.78481019
chestPainType3 1.55254904
restBP -0.39726763
chol -0.25270564
fastingBP1 0.26701756
restElecCardio1 0.48761187
maxHR 0.46555870
exerciseAngina1 -0.53368076
STdepEKG -0.48693148
slopePeakExST1 -0.94379010
slopePeakExST2 0.05576804
nMajorVessels -0.75202020
DefectType2 0.00917183
DefectType3 -1.23099327
First of all we see that none of our explanatory variables have been penalized so much that they have been removed, although some like age
contribute very little to the model.
Let’s order the coefficients by size and plot them to get an easy overview. First we do a little data wrangling to set up the dataset.
<- as.data.frame(as.matrix(coeffs)) %>%
coeffsDat rownames_to_column(var = 'VarName') %>%
arrange(desc(s1)) %>%
filter(!str_detect(VarName,"(Intercept)")) %>%
mutate(VarName = factor(VarName, levels=VarName))
Now we can make a bar plot to visualize our results.
# Plot
ggplot(coeffsDat, aes(x = VarName, y = s1)) +
geom_bar(stat = "identity", fill = "#9395D3") +
coord_flip() +
labs(title = "Feature Importance for Elastic Net",
x = "Features",
y = "Absolute Coefficients") +
theme_classic()
From the coefficients above it seem like cheat pain of any type (0 vs 1, 2 or 3) is a strong predictor of the outcome, e.g. heart disease. In opposition, having a DefectType3
significantly lowers the predicted probability of belonging to the event class (1 = Heart Disease). Specifically, it decreases the log-odds of belonging to class 1.
Part 2: Random Forest
In this section, we will train a Random Forest (RF) model using the same dataset and outcome as above. Random Forest is a simple ensemble machine learning method that builds multiple decision trees and combines their predictions to improve accuracy and robustness. By averaging the results of many trees, it reduces overfitting and increases generalization, making it particularly effective for complex, non-linear relationships. One of its key strengths is its ability to handle large datasets with many features, while also providing insights into feature importance.
Why do we want to try a RF? Unlike linear, logistic, or elastic net regression, RF does not assume a linear relationship between predictors and the outcome — it can naturally capture non-linear patterns and complex interactions between variables.
Another advantage is that RF considers one predictor at a time when splitting, making it robust to differences in variable scales and allowing it to handle categorical variables directly, without requiring dummy coding.
The downside to a is RF model is that it typically require a reasonably large sample size to perform well and can be less interpretable compared to regression-based approaches.
RF Model
Luckily, we already have a good understanding of our dataset so we won’t spend time on exploratory data analysis. We have the data loaded already:
head(HD)
# A tibble: 6 × 14
age sex chestPainType restBP chol fastingBP restElecCardio maxHR
<dbl> <fct> <fct> <dbl> <dbl> <fct> <fct> <dbl>
1 52 1 0 125 212 0 1 168
2 53 1 0 140 203 1 0 155
3 70 1 0 145 174 0 1 125
4 61 1 0 148 203 0 1 161
5 62 0 0 138 294 1 1 106
6 58 0 0 100 248 0 0 122
# ℹ 6 more variables: exerciseAngina <fct>, STdepEKG <dbl>,
# slopePeakExST <fct>, nMajorVessels <dbl>, DefectType <fct>,
# heartDisease <fct>
For RF there is no need to scale numeric predictors or dummy code categorical predictors However, we do need to covert the outcome variable heartDisease
from binary (0 or 1) to a category name (this is required by the function we will use for random forest later).
# Mutate outcome to category and add ID column for splitting
<- HD %>%
HD_RF mutate(heartDisease = as.factor(as.character(ifelse(heartDisease == 1, "yesHD", "noHD")))) %>%
mutate(ID = paste0("W", 1:nrow(HD)))
head(HD_RF$heartDisease)
[1] noHD noHD noHD noHD noHD yesHD
Levels: noHD yesHD
Then, we can split our dataset into training and test. You may notice that in opposition to the elastic net regression above, we keep the outcome variable in the dataset. Weather you need to keep it in or must pull it out depends on the R-package (functions) you are using.
# Set seed
set.seed(123)
# Training set
<- HD_RF %>%
train sample_frac(0.70)
# Test set
<- anti_join(HD_RF, train, by = 'ID')
test
# Remove the ID, which we do not want to use for training.
<- train %>%
train ::select(-ID)
dplyr
<- test %>%
test ::select(-ID) dplyr
Now let’s set up a Random Forest model with cross-validation - this way we do not overfit our model. The R-package caret
has a very versatile function trainControl()
which can be used with a range of resampling methods including bootstrapping, out-of-bag error, and leave-one-out cross-validation.
set.seed(123)
# Set up cross-validation: 5-fold CV
<- trainControl(
RFcv method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
Now that we have set up parameters for cross validation in the RFcv
object above, we can feed it to the train()
function from the caret
packages. We also specify the training data, the name of the outcome variable, and, importantly, that we want to perform random forest (method = "rf"
) as the train()
function can be used for different models.
# Train Random Forest
set.seed(123)
<- train(
rf_model ~ .,
heartDisease data = train,
method = "rf",
trControl = RFcv,
metric = "ROC",
tuneLength = 5
)
# Model summary
print(rf_model)
Random Forest
718 samples
13 predictor
2 classes: 'noHD', 'yesHD'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 575, 574, 574, 575, 574
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.9916503 0.9464386 0.9587900
6 0.9978648 0.9802817 0.9670852
10 0.9976344 0.9830986 0.9670852
14 0.9972088 0.9830986 0.9670852
19 0.9966214 0.9830986 0.9616058
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 6.
Next, we can plot your model fit to see how many explanatory variables significantly contribute to our model.
# Best parameters
$bestTune rf_model
mtry
2 6
# Plot performance
plot(rf_model)
Now, we use the test set to evaluate our model performance.
# Predict class probabilities
<- predict(rf_model, newdata = test, type = "prob")
y_pred y_pred
noHD yesHD
1 0.914 0.086
2 0.980 0.020
3 0.928 0.072
4 0.922 0.078
5 0.950 0.050
6 1.000 0.000
7 0.782 0.218
8 0.998 0.002
9 0.882 0.118
10 0.038 0.962
11 0.002 0.998
12 0.014 0.986
13 0.972 0.028
14 0.006 0.994
15 0.000 1.000
16 0.126 0.874
17 0.866 0.134
18 0.950 0.050
19 0.066 0.934
20 0.982 0.018
21 0.048 0.952
22 0.918 0.082
23 0.008 0.992
24 0.912 0.088
25 0.852 0.148
26 0.916 0.084
27 0.990 0.010
28 0.976 0.024
29 0.022 0.978
30 0.706 0.294
31 0.000 1.000
32 0.026 0.974
33 0.998 0.002
34 0.206 0.794
35 0.890 0.110
36 0.002 0.998
37 0.098 0.902
38 0.076 0.924
39 0.932 0.068
40 0.942 0.058
41 0.302 0.698
42 0.996 0.004
43 1.000 0.000
44 0.024 0.976
45 0.000 1.000
46 0.066 0.934
47 0.320 0.680
48 0.730 0.270
49 0.044 0.956
50 0.044 0.956
51 0.990 0.010
52 0.002 0.998
53 0.082 0.918
54 0.022 0.978
55 0.126 0.874
56 0.076 0.924
57 0.112 0.888
58 0.010 0.990
59 0.998 0.002
60 0.084 0.916
61 0.998 0.002
62 0.054 0.946
63 0.286 0.714
64 0.006 0.994
65 0.044 0.956
66 0.060 0.940
67 0.964 0.036
68 0.022 0.978
69 0.002 0.998
70 0.000 1.000
71 0.706 0.294
72 1.000 0.000
73 0.318 0.682
74 0.960 0.040
75 1.000 0.000
76 0.006 0.994
77 0.998 0.002
78 0.988 0.012
79 0.000 1.000
80 0.942 0.058
81 0.260 0.740
82 0.000 1.000
83 0.942 0.058
84 0.036 0.964
85 0.096 0.904
86 0.912 0.088
87 0.006 0.994
88 0.034 0.966
89 0.142 0.858
90 0.976 0.024
91 0.988 0.012
92 0.474 0.526
93 0.034 0.966
94 0.024 0.976
95 0.048 0.952
96 0.984 0.016
97 0.416 0.584
98 0.002 0.998
99 0.068 0.932
100 0.996 0.004
101 0.060 0.940
102 0.902 0.098
103 0.024 0.976
104 0.000 1.000
105 0.002 0.998
106 0.014 0.986
107 0.882 0.118
108 0.202 0.798
109 0.998 0.002
110 0.064 0.936
111 0.028 0.972
112 0.016 0.984
113 0.932 0.068
114 0.958 0.042
115 0.002 0.998
116 0.992 0.008
117 0.968 0.032
118 0.898 0.102
119 0.068 0.932
120 0.026 0.974
121 0.072 0.928
122 0.980 0.020
123 0.038 0.962
124 1.000 0.000
125 0.416 0.584
126 0.072 0.928
127 0.004 0.996
128 0.036 0.964
129 0.990 0.010
130 0.984 0.016
131 0.002 0.998
132 0.008 0.992
133 0.000 1.000
134 0.990 0.010
135 0.024 0.976
136 0.076 0.924
137 0.990 0.010
138 0.000 1.000
139 0.994 0.006
140 0.868 0.132
141 1.000 0.000
142 0.084 0.916
143 0.206 0.794
144 0.036 0.964
145 0.318 0.682
146 0.230 0.770
147 0.964 0.036
148 0.008 0.992
149 0.064 0.936
150 0.002 0.998
151 0.942 0.058
152 0.142 0.858
153 1.000 0.000
154 0.956 0.044
155 0.868 0.132
156 0.980 0.020
157 0.000 1.000
158 0.928 0.072
159 0.982 0.018
160 0.882 0.118
161 0.916 0.084
162 0.286 0.714
163 0.988 0.012
164 0.012 0.988
165 0.024 0.976
166 0.014 0.986
167 0.970 0.030
168 0.230 0.770
169 0.926 0.074
170 1.000 0.000
171 0.942 0.058
172 0.000 1.000
173 0.064 0.936
174 0.976 0.024
175 0.072 0.928
176 0.002 0.998
177 1.000 0.000
178 0.868 0.132
179 0.004 0.996
180 1.000 0.000
181 0.008 0.992
182 0.006 0.994
183 1.000 0.000
184 0.962 0.038
185 0.474 0.526
186 0.938 0.062
187 0.938 0.062
188 0.978 0.022
189 0.928 0.072
190 0.942 0.058
191 0.064 0.936
192 0.964 0.036
193 1.000 0.000
194 0.002 0.998
195 0.890 0.110
196 0.854 0.146
197 0.052 0.948
198 0.000 1.000
199 0.914 0.086
200 0.024 0.976
201 0.064 0.936
202 0.060 0.940
203 0.190 0.810
204 0.998 0.002
205 0.320 0.680
206 0.962 0.038
207 0.922 0.078
208 0.000 1.000
209 0.986 0.014
210 0.002 0.998
211 0.782 0.218
212 0.998 0.002
213 0.002 0.998
214 0.994 0.006
215 0.962 0.038
216 0.004 0.996
217 0.148 0.852
218 1.000 0.000
219 0.008 0.992
220 0.474 0.526
221 0.054 0.946
222 0.016 0.984
223 0.016 0.984
224 0.932 0.068
225 0.982 0.018
226 0.014 0.986
227 0.998 0.002
228 0.982 0.018
229 0.114 0.886
230 0.048 0.952
231 0.866 0.134
232 0.026 0.974
233 0.992 0.008
234 1.000 0.000
235 0.998 0.002
236 0.994 0.006
237 0.854 0.146
238 1.000 0.000
239 0.036 0.964
240 0.730 0.270
241 0.064 0.936
242 0.302 0.698
243 0.992 0.008
244 0.014 0.986
245 0.022 0.978
246 0.002 0.998
247 0.964 0.036
248 0.026 0.974
249 0.202 0.798
250 0.148 0.852
251 0.076 0.924
252 0.988 0.012
253 0.114 0.886
254 0.230 0.770
255 0.706 0.294
256 0.980 0.020
257 0.998 0.002
258 0.976 0.024
259 0.988 0.012
260 0.026 0.974
261 0.416 0.584
262 0.852 0.148
263 0.914 0.086
264 0.060 0.940
265 0.320 0.680
266 0.040 0.960
267 0.880 0.120
268 0.084 0.916
269 0.934 0.066
270 0.988 0.012
271 0.988 0.012
272 0.782 0.218
273 0.230 0.770
274 0.854 0.146
275 0.994 0.006
276 0.004 0.996
277 0.994 0.006
278 0.004 0.996
279 0.902 0.098
280 0.978 0.022
281 1.000 0.000
282 0.014 0.986
283 0.010 0.990
284 0.954 0.046
285 0.034 0.966
286 0.190 0.810
287 0.038 0.962
288 0.974 0.026
289 0.012 0.988
290 0.034 0.966
291 0.976 0.024
292 0.318 0.682
293 0.082 0.918
294 0.032 0.968
295 0.124 0.876
296 0.956 0.044
297 0.918 0.082
298 0.006 0.994
299 0.932 0.068
300 0.260 0.740
301 0.004 0.996
302 0.054 0.946
303 0.932 0.068
304 0.922 0.078
305 1.000 0.000
306 1.000 0.000
307 0.988 0.012
<- as.factor(ifelse(y_pred$yesHD > 0.5, "yesHD", "noHD"))
y_pred
::confusionMatrix(y_pred, test$heartDisease) caret
Confusion Matrix and Statistics
Reference
Prediction noHD yesHD
noHD 145 0
yesHD 0 162
Accuracy : 1
95% CI : (0.9881, 1)
No Information Rate : 0.5277
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4723
Detection Rate : 0.4723
Detection Prevalence : 0.4723
Balanced Accuracy : 1.0000
'Positive' Class : noHD
Lastly, we can extract the predictive variables with the greatest importance from your fit.
<- varImp(rf_model)
varImpOut
$importance varImpOut
Overall
age 70.849045
sex1 23.642161
chestPainType1 8.087142
chestPainType2 33.997946
chestPainType3 16.629339
restBP 61.528727
chol 67.870275
fastingBP1 7.750244
restElecCardio1 15.653588
restElecCardio2 0.000000
maxHR 95.828681
exerciseAngina1 58.691710
STdepEKG 76.039553
slopePeakExST1 20.352056
slopePeakExST2 33.839912
nMajorVessels 91.600454
DefectType1 3.171842
DefectType2 100.000000
DefectType3 67.485474
<- as.data.frame(as.matrix(varImpOut$importance)) %>%
varImportance rownames_to_column(var = 'VarName') %>%
arrange(desc(Overall))
varImportance
VarName Overall
1 DefectType2 100.000000
2 maxHR 95.828681
3 nMajorVessels 91.600454
4 STdepEKG 76.039553
5 age 70.849045
6 chol 67.870275
7 DefectType3 67.485474
8 restBP 61.528727
9 exerciseAngina1 58.691710
10 chestPainType2 33.997946
11 slopePeakExST2 33.839912
12 sex1 23.642161
13 slopePeakExST1 20.352056
14 chestPainType3 16.629339
15 restElecCardio1 15.653588
16 chestPainType1 8.087142
17 fastingBP1 7.750244
18 DefectType1 3.171842
19 restElecCardio2 0.000000
Variable importance is based on how much each variable improves the model’s accuracy across splits. DefectType2
might be involved in important interaction or split the data in a very informative way early in trees.
In terms of comparing the outcome of the Random Forest model with the Elastic Net Regression, don’t expect identical top features — they reflect different model assumptions.
Use both models as complementary tools:
EN: for interpretable, linear relationships
RF: for capturing complex patterns and variable interactions
If a feature ranks high in both models, it’s a strong signal that the feature is important.
If a feature ranks high in one but not the other — explore further: interaction? non-linearity? collinearity?