pivot_longer(cols = LIST_WITH_COLUMNS_TO_PIVOT,
names_to = "NEW_COLUMN_CONTAINING_COLUMN_NAMES",
values_to = "NEW_COLUMN_CONTAINING_COLUMN_VALUES")
Exercise 2: Advanced Wrangling
In this exercise you will do some more advance tidyverse operations such as pivoting and nesting.
Getting started
Load packages.
Load the joined diabetes data set you created in exercise 1 (e.g. “diabetes_join.xlsx”) and the glucose dataset
df_glucose.xlsx
from the data folder.
Wrangling
Have a look at the glucose dataset. It has three columns with measurements from a Oral Glucose Tolerance Test where blood glucose is measured at fasting (Glucose_0), 1 hour/60 mins after glucose intake (Glucose_6), and 2 hours/120 mins after (Glucose_120). The last columns is an ID column. Change the data type of the ID column to
factor
in bothdiabetes_join
anddf_glucose
.Restructure the glucose dataset into a long format. Name the column that describes which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120,
Measurement
. How many rows are there per ID? Does that make sense?
Remember the flow:
Have a look at slide 16 for a visual overview.
- In your long format dataframe you should have one column that described which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120. Transform this column so that you only have the numerical part, i.e. only 0, 60 or 120. Then change the data type of that column to
factor
. Check the order of the factor levels and if necessary change them to the proper order.
The stringr
packages is a part of tidyverse and has many functions for manipulating strings. Find a function that can split the string so you can extract the numbers on the other side of the underscore.
Have a look at the help for factors ?factors
to see how to influence the levels.
Merge the glucose dataset with the joined diabetes dataset.
Pull the glucose measurements from your favorite ID.
First filter
for your favorite ID and then pull
the columns.
- Calculate the mean glucose measure for each measurement timepoint.
You will need to use group_by()
, and summerise()
.
- Calculate mean and standard deviation for all numeric columns.
You will need to use summarise()
and across()
, selecting numeric columns.
- Nest the glucose measurements and values such that there is only one row per ID and call the nested column OGTT (Oral Glucose Tolerance Test). Display the resulting tibble to confirm that you have succeeded.
Remember the flow:
group_by() %>%
nest() %>%
ungroup()
- Export the final dataset. Since the dataset is nested, you cannot export it as an excel file. Export the dataset as an
.rds
file. Have a guess at what the function is called.