Exercise 2 - Solutions: Data Clean Up

Introduction

In this exercise you will do some more advance tidyverse operations such as pivoting and nesting, as well as create plots to brush up on your ggplot skills.

First steps

  1. Load packages.
library(tidyverse)
Warning: pakke 'ggplot2' blev bygget under R version 4.2.3
Warning: pakke 'tibble' blev bygget under R version 4.2.3
Warning: pakke 'dplyr' blev bygget under R version 4.2.3
  1. Load the joined diabetes data set you created in exercise 1 and the glucose dataset from the data folder.
diabetes_join <- readxl::read_excel('../out/diabetes_join.xlsx')
df_glucose <- readxl::read_excel('../data/df_glucose.xlsx')
  1. Have a look at the glucose dataset. The OGTT column contains measurements from a Oral Glucose Tolerance Test where blood glucose is measured at fasting (Glucose_0), 6 hours after glucose intake (Glucose_6), and 12 hours after (Glucose_12).
head(df_glucose)
# A tibble: 6 × 4
  Glucose_0 Glucose_60 Glucose_120    ID
      <dbl>      <dbl>       <dbl> <dbl>
1      6.65       8.04       10.0   9046
2      4.49       5.40        6.22 51676
3      5.76       6.52        7.22  1665
4      6.13       6.94        8.09 12095
5      6.84       6.92        7.01 12175
6      6.84       7.62        8.42  8213
  1. Restructure the glucose dataset into a long format. How many rows are there per ID? Does that make sense?

Remember the flow:

pivot_longer(cols = LIST_WITH_COLUMNS_TO_PIVOT,
             names_to = "NEW_COLUMN_CONTAINING_COLUMN_NAMES",
             values_to = "NEW_COLUMN_CONTAINING_COLUMN_VALUES")

Have a look at slide 16 for a visual overview.

df_glucose_long <- df_glucose %>% 
  pivot_longer(cols = starts_with("Glucose"),
               names_to = "Measurement",
               values_to = "Glucose (mmol/L)"
               )

head(df_glucose_long)
# A tibble: 6 × 3
     ID Measurement `Glucose (mmol/L)`
  <dbl> <chr>                    <dbl>
1  9046 Glucose_0                 6.65
2  9046 Glucose_60                8.04
3  9046 Glucose_120              10.0 
4 51676 Glucose_0                 4.49
5 51676 Glucose_60                5.40
6 51676 Glucose_120               6.22
  1. Change the glucose measurements to numeric variable.

The stringr packages is a part of tidyverse and has many functions for manipulating strings. Find a function that can split the string so you can extract the numbers on the other side of the underscore.

df_glucose_long <- df_glucose_long %>% 
  mutate(Measurement = str_split_i(Measurement, '_', 2) %>% as.numeric())

head(df_glucose_long)
# A tibble: 6 × 3
     ID Measurement `Glucose (mmol/L)`
  <dbl>       <dbl>              <dbl>
1  9046           0               6.65
2  9046          60               8.04
3  9046         120              10.0 
4 51676           0               4.49
5 51676          60               5.40
6 51676         120               6.22
  1. Nest the glucose measurements and values such that there is only one row per ID.

Remember the flow:

group_by() %>% 
  nest() %>% 
  ungroup()
df_glucose_nested <- df_glucose_long %>% 
  group_by(ID) %>% 
  nest(OGTT = c(Measurement, `Glucose (mmol/L)`)) %>% 
  ungroup()
  
head(df_glucose_nested)
# A tibble: 6 × 2
     ID OGTT            
  <dbl> <list>          
1  9046 <tibble [3 × 2]>
2 51676 <tibble [3 × 2]>
3  1665 <tibble [3 × 2]>
4 12095 <tibble [3 × 2]>
5 12175 <tibble [3 × 2]>
6  8213 <tibble [3 × 2]>
  1. Merge the nested glucose dataset with the joined diabetes.
diabetes_glucose <- diabetes_join %>% 
  left_join(df_glucose_nested, by = 'ID')

head(diabetes_glucose)
# A tibble: 6 × 12
     ID Sex      Age BloodPressure GeneticRisk   BMI PhysicalActivity Smoker 
  <dbl> <chr>  <dbl>         <dbl>       <dbl> <dbl>            <dbl> <chr>  
1  9046 Male      34            84       0.619  24.7               93 Unknown
2 51676 Male      25            74       0.591  22.5              102 Unknown
3 60182 Male      50            80       0.178  34.5               98 Unknown
4  1665 Female    27            60       0.206  26.3               82 Never  
5 56669 Male      35            84       0.286  35                 58 Smoker 
6 53882 Female    31            78       1.22   43.3               59 Smoker 
# ℹ 4 more variables: Diabetes <chr>, Married <chr>, Work <chr>, OGTT <list>
  1. Pull the glucose measurements (OGTT) from your favorite ID.

First filter for your favorite ID and then pull the nested column.

diabetes_glucose %>% 
  filter(ID == 9046) %>% 
  pull(OGTT)
[[1]]
# A tibble: 3 × 2
  Measurement `Glucose (mmol/L)`
        <dbl>              <dbl>
1           0               6.65
2          60               8.04
3         120              10.0 
  1. Export the final dataset. Since the dataset is nested, you cannot export it as an excel file. Export the dataset as an .rds file. Have a guess at what the function is called.
write_rds(diabetes_glucose, '../out/diabetes_glucose.rds')