Exercise 3: Exploratory Data Analysis
Getting started
Load in the packages the you think you need for this exercise. You will be doing PCA, so have a look at which packages we used in Presentation 3. No worries if you forget any, you always load then later on.
Read in the joined diabetes data set you created in Exercise 2. If you did not make it all the way through Exercise 2 you can find the dataset you need in
../data/exercise2_diabetes_glucose.xlsx
.Have a look at the data type (
numeric
,categorical
,factor
) of each column to ensure these make sense. If need, convert variables to the correct (most logical) type.
Check Data Distributions
Let’s have a look at the distributions of the numerical variables.
Make histograms of the three
Glucose (mmol/L)
measurements in your dataset. What do you observe? Are the three groups of values normally distributed?Just as in question 3 above, make histograms of the three
Glucose (mmol/L)
measurement variables, BUT this time stratify your dataset by the variableDiabetes
. How do your distributions look now?
Try: facet_wrap(Var1 ~ Var2, scales = "free")
.
Make a qqnorm plot for the other numeric variables;
Age
,Bloodpressure
,BMI
,PhysicalActivity
andSerum_ca2
. What do the plots tell you?From the qq-norm plot above you will see that especially one of the variables seems to be far from normally distributed. What type of transformation could you potentially apply to this variable to make it normal? Transform and make a histogram or qqnorm plot. Did the transformation help?
Luckily for us it is not a requirement for dimensionality reduction methods like PCA that neither variables nor their residuals were normally distributed. Requirements for normality becomes important when performing statistical tests and modelling, including t-tests, ANOVA and regression models (more on this in part 5).
Data Cleaning
While PCA is very forgiving in terms of variable distributions, there are some things it does not handle well, including missing values and varying ranges of numeric variables. So, before you go on you need to do a little data management.
- The
Glucose (mmol/L)
variable in the dataset which denotes the result of theOral Glucose Tolerance Test
with measurements at times the0
,60
,120
min should be separated out into three columns, one for each time point.
pivot_wider(names_from = ..., values_from = ..., names_prefix = 'Measurement\_')
Remove any missing values from your dataset.
Extract all numeric variables AND scale these so they have comparable ranges. This numeric dataset you will use for PCA.
Extract all categorical variables from your dataset and save them to a new object. You will use these as labels for the PCA.
Principal Component Analysis
Calculate the PCs on your numeric dataset using the function
PCA()
from theFactoMineR
package. Note that you should setscale.unit = FALSE
as you have already scaled your dataset.Make a plot that shows how much variance is captured by each component (in each dimension). There are two ways of making this plot: either you can use the function
fviz_screeplot()
from thefactoextra
package as we did in the exercise OR, you can use ggplot2, by extracting theres.pca$eig
and plotting thepercentage of variance
column.Now, make a biplot (the round one with the arrows we showed in the presentation). This type of plot is a little complicated with ggplot alone, so either use the
fviz_pca_var()
function from thefactoextra
package we showed in the presentation, or - if you want to challenge yourself - tryautoplot()
from theggfortify
package.Have a look at the
$var$contrib
from your PCA object and compare it to the biplot, what do they tell you, i.e. which variables contribute the most to PC1 (dim1) and PC2 (dim2)? Also, look at the correlation matrix between variables in each component ($var$cor
), what can you conclude from this?Plot your samples in the new dimensions, i.e. PC1 (dim1) vs PC2 (dim2), with the function
fviz_pca_ind()
. Add color and/or labels to the points using categorical variables you extracted above. What pattern of clustering (in any) do you observe?Try to call
$var$cos2
from your PCA object. What does it tell you about the first 5 components? Are there any of these, in addition to Dim 1 and Dim 2, which seem to capture some variance? If so, try to plot these components and overlay the plot with the categorical variables from above to figure out which it might be capturing.
The Oral Glucose Tolerance Test
is the one used to produce the OGTT measurements (0, 60, 120)
. As these measurement are directly used to diagnose diabetes we are not surprised that they separate the dataset well - we are being completely bias.
Omit the Glucose measurement columns, calculate a PCA and create the plot above. Do the remaining variables still capture the
Diabetes
pattern?From the exploratory analysis you have performed, which variables (numeric and/or categorical) would you as a minimum include in a model for prediction of
Diabetes
?
N.B again you should not include the actual glucose measurement variables as they where used to the define the outcome you are looking at.