library(tidyverse)
Exercise 2 - Solutions: Advanced Data Wrangling
Introduction
In this exercise you will do some more advance tidyverse operations such as pivoting and nesting.
First steps
- Load packages.
- Load the joined diabetes data set you created in exercise 1 (e.g. “diabetes_join.xlsx”) and the glucose dataset
df_glucose.xlsx
from the data folder.
<- readxl::read_excel('../out/diabetes_join.xlsx')
diabetes_join <- readxl::read_excel('../data/df_glucose.xlsx') df_glucose
Wrangling
- Have a look at the glucose dataset. It has three columns with measurements from a Oral Glucose Tolerance Test where blood glucose is measured at fasting (Glucose_0), 1 hour/60 mins after glucose intake (Glucose_6), and 2 hours/120 mins after (Glucose_120). The last columns is an ID column. Change the data type of the ID column to
factor
in bothdiabetes_join
anddf_glucose
.
head(df_glucose)
# A tibble: 6 × 4
Glucose_0 Glucose_60 Glucose_120 ID
<dbl> <dbl> <dbl> <dbl>
1 6.65 8.04 10.0 9046
2 4.49 5.40 6.22 51676
3 5.76 6.52 7.22 1665
4 6.13 6.94 8.09 12095
5 6.84 6.92 7.01 12175
6 6.84 7.62 8.42 8213
$ID <- as.factor(df_glucose$ID)
df_glucose$ID <- as.factor(diabetes_join$ID) diabetes_join
- Restructure the glucose dataset into a long format. Name the column that describes which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120,
Measurement
. How many rows are there per ID? Does that make sense?
Remember the flow:
pivot_longer(cols = LIST_WITH_COLUMNS_TO_PIVOT,
names_to = "NEW_COLUMN_CONTAINING_COLUMN_NAMES",
values_to = "NEW_COLUMN_CONTAINING_COLUMN_VALUES")
Have a look at slide 16 for a visual overview.
<- df_glucose %>%
df_glucose_long pivot_longer(cols = starts_with("Glucose"),
names_to = "Measurement",
values_to = "Glucose (mmol/L)"
)
head(df_glucose_long)
# A tibble: 6 × 3
ID Measurement `Glucose (mmol/L)`
<fct> <chr> <dbl>
1 9046 Glucose_0 6.65
2 9046 Glucose_60 8.04
3 9046 Glucose_120 10.0
4 51676 Glucose_0 4.49
5 51676 Glucose_60 5.40
6 51676 Glucose_120 6.22
There are three rows for each ID, corresponding to the three glucose measurements
%>%
df_glucose_long count(ID) %>%
head()
# A tibble: 6 × 2
ID n
<fct> <int>
1 129 3
2 210 3
3 491 3
4 530 3
5 621 3
6 712 3
- In your long format dataframe you should have one column that described which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120. Transform this column so that you only have the numerical part, i.e. only 0, 60 or 120. Then change the data type of that column to
factor
. Check the order of the factor levels and if necessary change them to the proper order.
The stringr
packages is a part of tidyverse and has many functions for manipulating strings. Find a function that can split the string so you can extract the numbers on the other side of the underscore.
Have a look at the help for factors ?factors
to see how to influence the levels.
<- df_glucose_long %>%
df_glucose_long mutate(Measurement = str_split_i(Measurement, '_', 2) %>% as.factor())
head(df_glucose_long)
# A tibble: 6 × 3
ID Measurement `Glucose (mmol/L)`
<fct> <fct> <dbl>
1 9046 0 6.65
2 9046 60 8.04
3 9046 120 10.0
4 51676 0 4.49
5 51676 60 5.40
6 51676 120 6.22
Check factor levels:
levels(df_glucose_long$Measurement)
[1] "0" "120" "60"
Adjust levels to proper order:
$Measurement <- factor(df_glucose_long$Measurement, levels = c('0', '60','120')) df_glucose_long
- Merge the glucose dataset with the joined diabetes dataset.
<- diabetes_join %>%
diabetes_glucose left_join(df_glucose_long, by = 'ID')
head(diabetes_glucose)
# A tibble: 6 × 13
ID Sex Age BloodPressure BMI PhysicalActivity Smoker Diabetes
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 9046 Male 34 84 24.7 93 Unknown 0
2 9046 Male 34 84 24.7 93 Unknown 0
3 9046 Male 34 84 24.7 93 Unknown 0
4 51676 Male 25 74 22.5 102 Unknown 0
5 51676 Male 25 74 22.5 102 Unknown 0
6 51676 Male 25 74 22.5 102 Unknown 0
# ℹ 5 more variables: Serum_ca2 <dbl>, Married <chr>, Work <chr>,
# Measurement <fct>, `Glucose (mmol/L)` <dbl>
- Pull the glucose measurements from your favorite ID.
First filter
for your favorite ID and then pull
the columns.
%>%
diabetes_glucose filter(ID == 9046) %>%
pull(Measurement,`Glucose (mmol/L)`)
6.652593325614 8.04416787019778 10.016298115626
0 60 120
Levels: 0 60 120
- Calculate the mean glucose measure for each measurement timepoint.
You will need to use group_by()
, and summarise()
.
%>%
diabetes_glucose group_by(Measurement) %>%
summarise(mean = mean(`Glucose (mmol/L)`))
# A tibble: 3 × 2
Measurement mean
<fct> <dbl>
1 0 8.06
2 60 9.73
3 120 11.1
- Calculate mean and standard deviation for all numeric columns.
You will need to use summarise()
and across()
, selecting numeric columns.
%>%
diabetes_glucose summarise(across(where(is.numeric), list(mean=mean, sd=sd),
.names = "{.col}-{.fn}"))
# A tibble: 1 × 12
`Age-mean` `Age-sd` `BloodPressure-mean` `BloodPressure-sd` `BMI-mean`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA 72.7 12.9 30.2
# ℹ 7 more variables: `BMI-sd` <dbl>, `PhysicalActivity-mean` <dbl>,
# `PhysicalActivity-sd` <dbl>, `Serum_ca2-mean` <dbl>, `Serum_ca2-sd` <dbl>,
# `Glucose (mmol/L)-mean` <dbl>, `Glucose (mmol/L)-sd` <dbl>
- Nest the glucose measurements and values such that there is only one row per ID and call the nested column OGTT (Oral Glucose Tolerance Test). Display the resulting tibble to confirm that you have succeeded.
Remember the flow:
group_by() %>%
nest() %>%
ungroup()
<- diabetes_glucose %>%
diabetes_glucose group_by(ID) %>%
nest(OGTT = c(Measurement, `Glucose (mmol/L)`)) %>%
ungroup()
#relocate the new nested column to after BMI so we can actually see it
head(relocate(diabetes_glucose, OGTT, .after = BMI))
# A tibble: 6 × 12
ID Sex Age BloodPressure BMI OGTT PhysicalActivity Smoker
<fct> <chr> <dbl> <dbl> <dbl> <list> <dbl> <chr>
1 9046 Male 34 84 24.7 <tibble> 93 Unknown
2 51676 Male 25 74 22.5 <tibble> 102 Unknown
3 60182 Male 50 80 34.5 <tibble> 98 Unknown
4 1665 Female 27 60 26.3 <tibble> 82 Never
5 56669 Male 35 84 35 <tibble> 58 Smoker
6 53882 Female 31 78 43.3 <tibble> 59 Smoker
# ℹ 4 more variables: Diabetes <chr>, Serum_ca2 <dbl>, Married <chr>,
# Work <chr>
- Export the final dataset. Since the dataset is nested, you cannot export it as an excel file. Export the dataset as an
.rds
file. Have a guess at what the function is called.
write_rds(diabetes_glucose, '../out/diabetes_glucose.rds')