Exercise 3 A: Exploratory Data Analysis (EDA) - Plotting
In this exercise you will do a lot of plotting with ggplot. For a reminder on how ggplot works you can have a look the ggplot material covered in our previous course, From Excel to R.
Getting started
Load packages.
Load data from the
.rds
file you created in Exercise 2. Have a guess at what the function is called.
Plotting - Part 1
You will first do some basic plots to get started with ggplot again.
If it has been a while since you worked with ggplot, have a look at the ggplot material from the FromExceltoR course.
Create a scatter plot of
Age
andBlood Pressure
. Do you notice a trend?Create a scatter plot of
PhysicalActivity
andBMI.
Do you notice a trend?Now, create the same two plots as before, but this time stratify them by
Diabetes
. Do you notice any trends?
You can stratify a plot by a categorical variable in several ways, depending on the type of plot. The purpose of stratification is to distinguish samples based on their categorical values, making patterns or differences easier to identify. This can be done using aesthetics like color
, fill
, shape
.
Create a boxplot of
BMI
stratified byDiabetes.
Give the plot a meaningful title.Create a boxplot of
PhysicalActivity
stratified bySmoker
. Give the plot a meaningful title.
Plotting - Part 2
In order to plot the data inside the nested variable, the data needs to be unnested.
Create a boxplot of the glucose measurements at time 0 stratified by
Diabetes
. Give the plot a meaningful title.Create these boxplots for each time point (0, 60, 120) by using faceting by
Measurement
. Give the plot a meaningful title.
Faceting allows you to create multiple plots based on the values of a categorical variable, making it easier to compare patterns across groups. In ggplot2, you can use facet_wrap
for a single variable or facet_grid
for multiple variables.
- Calculate the mean glucose levels for each time point.
You will need to use unnest()
, group_by()
, and summerise()
.
- Make the same calculation as above, but additionally group the results by
Diabetes
. Save the data frame in a variable. Compare your results to the boxplots you made above.
Group by several variables: group_by(var1, var2)
.
- Create a plot that visualizes glucose measurements across time points, with one line for each patient ID. Then color the lines by their diabetes status. In summary, each patient’s glucose measurements should be connected with a line, grouped by their ID, and color-coded by
Diabetes
. Give the plot a meaningful title.
If your time points are strangely ordered have a look at the levels
of your Measurement
variable (the one that specifies which time point the measurement was taken at) and if necessary fix their order.
Extra exercises
This exercise might be a bit more challenging. It requires multiple operations and might involve some techniques that were not explicitly shown in the presentations.
e1. Recreate the plot you made in Exercise 12 and include the mean value for each glucose measurement for the two diabetes statuses (0 and 1) you calculated in Exercise 11. This plot should look like this:
There are several ways to solve this task. Here is a workflow suggestion:
You want to show both the raw data and the means in the same plot. Data from another dataset can be added to a plot by pointing the geom to a specific dataset like so:
+ geom_point(data = different_data, aes(x = VAR1, y = VAR2, group = VAR3))
geom_line
needs a variable to group by in order to know which dots should be connected. In exercise 10, this grouping variable was the patient ID. Which variable inglucose_mean
tells which points are connected? Pass it as group aesthetic.geom_line
has alinetype
aestethic to define the kind of line (dashed or solid).