Exercise 3 A: Exploratory Data Analysis (EDA) - Plotting

In this exercise you will do a lot of plotting with ggplot. For a reminder on how ggplot works you can have a look the ggplot material covered in our previous course, From Excel to R.

Getting started

  1. Load packages.

  2. Load data from the .rds file you created in Exercise 2. Have a guess at what the function is called.

Plotting - Part 1

You will first do some basic plots to get started with ggplot again.

If it has been a while since you worked with ggplot, have a look at the ggplot material from the FromExceltoR course.

  1. Create a scatter plot of Age and Blood Pressure. Do you notice a trend?

  2. Create a scatter plot of PhysicalActivity and BMI. Do you notice a trend?

  3. Now, create the same two plots as before, but this time stratify them by Diabetes. Do you notice any trends?

You can stratify a plot by a categorical variable in several ways, depending on the type of plot. The purpose of stratification is to distinguish samples based on their categorical values, making patterns or differences easier to identify. This can be done using aesthetics like color, fill, shape.

  1. Create a boxplot of BMI stratified by Diabetes. Give the plot a meaningful title.

  2. Create a boxplot of PhysicalActivity stratified by Smoker. Give the plot a meaningful title.

Plotting - Part 2

In order to plot the data inside the nested variable, the data needs to be unnested.

  1. Create a boxplot of the glucose measurements at time 0 stratified by Diabetes. Give the plot a meaningful title.

  2. Create these boxplots for each time point (0, 60, 120) by using faceting by Measurement. Give the plot a meaningful title.

Faceting allows you to create multiple plots based on the values of a categorical variable, making it easier to compare patterns across groups. In ggplot2, you can use facet_wrap for a single variable or facet_grid for multiple variables.

  1. Calculate the mean glucose levels for each time point.

You will need to use unnest(), group_by(), and summerise().

  1. Make the same calculation as above, but additionally group the results by Diabetes. Save the data frame in a variable. Compare your results to the boxplots you made above.

Group by several variables: group_by(var1, var2).

  1. Create a plot that visualizes glucose measurements across time points, with one line for each patient ID. Then color the lines by their diabetes status. In summary, each patient’s glucose measurements should be connected with a line, grouped by their ID, and color-coded by Diabetes. Give the plot a meaningful title.

If your time points are strangely ordered have a look at the levels of your Measurement variable (the one that specifies which time point the measurement was taken at) and if necessary fix their order.


Extra exercises

This exercise might be a bit more challenging. It requires multiple operations and might involve some techniques that were not explicitly shown in the presentations.

e1. Recreate the plot you made in Exercise 12 and include the mean value for each glucose measurement for the two diabetes statuses (0 and 1) you calculated in Exercise 11. This plot should look like this:

There are several ways to solve this task. Here is a workflow suggestion:

  • You want to show both the raw data and the means in the same plot. Data from another dataset can be added to a plot by pointing the geom to a specific dataset like so: + geom_point(data = different_data, aes(x = VAR1, y = VAR2, group = VAR3))

  • geom_line needs a variable to group by in order to know which dots should be connected. In exercise 10, this grouping variable was the patient ID. Which variable in glucose_mean tells which points are connected? Pass it as group aesthetic.

  • geom_line has a linetype aestethic to define the kind of line (dashed or solid).