Exercise 2 - Solutions: Advanced Data Wrangling

Introduction

In this exercise you will do some more advance tidyverse operations such as pivoting and nesting.

First steps

Load packages.

library(tidyverse)

Load the joined diabetes data set you created in exercise 1 (e.g. “diabetes_join.xlsx”) and the glucose dataset df_glucose.xlsx from the data folder.

diabetes_join <- readxl::read_excel('../out/diabetes_join.xlsx')
df_glucose <- readxl::read_excel('../data/df_glucose.xlsx')

Wrangling

Have a look at the glucose dataset. It has three columns with measurements from a Oral Glucose Tolerance Test where blood glucose is measured at fasting (Glucose_0), 1 hour/60 mins after glucose intake (Glucose_6), and 2 hours/120 mins after (Glucose_120). The last columns is an ID column. Change the data type of the ID column to factor in both diabetes_join and df_glucose.

head(df_glucose)

# A tibble: 6 × 4
  Glucose_0 Glucose_60 Glucose_120    ID
      <dbl>      <dbl>       <dbl> <dbl>
1      6.65       8.04       10.0   9046
2      4.49       5.40        6.22 51676
3      5.76       6.52        7.22  1665
4      6.13       6.94        8.09 12095
5      6.84       6.92        7.01 12175
6      6.84       7.62        8.42  8213

df_glucose$ID <- as.factor(df_glucose$ID)
diabetes_join$ID <- as.factor(diabetes_join$ID)

Restructure the glucose dataset into a long format. Name the column that describes which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120, Measurement. How many rows are there per ID? Does that make sense?

Hint

Remember the flow:

pivot_longer(cols = LIST_WITH_COLUMNS_TO_PIVOT,
             names_to = "NEW_COLUMN_CONTAINING_COLUMN_NAMES",
             values_to = "NEW_COLUMN_CONTAINING_COLUMN_VALUES")

Have a look at slide 16 for a visual overview.

df_glucose_long <- df_glucose %>% 
  pivot_longer(cols = starts_with("Glucose"),
               names_to = "Measurement",
               values_to = "Glucose (mmol/L)"
               )

head(df_glucose_long)

# A tibble: 6 × 3
  ID    Measurement `Glucose (mmol/L)`
  <fct> <chr>                    <dbl>
1 9046  Glucose_0                 6.65
2 9046  Glucose_60                8.04
3 9046  Glucose_120              10.0 
4 51676 Glucose_0                 4.49
5 51676 Glucose_60                5.40
6 51676 Glucose_120               6.22

There are three rows for each ID, corresponding to the three glucose measurements

df_glucose_long %>%
  count(ID) %>%
  head()

# A tibble: 6 × 2
  ID        n
  <fct> <int>
1 129       3
2 210       3
3 491       3
4 530       3
5 621       3
6 712       3

In your long format dataframe you should have one column that described which measurement the row refers to, i.e. Glucose_0, Glucose_60 or Glucose_120. Transform this column so that you only have the numerical part, i.e. only 0, 60 or 120. Then change the data type of that column to factor. Check the order of the factor levels and if necessary change them to the proper order.

Hint

The stringr packages is a part of tidyverse and has many functions for manipulating strings. Find a function that can split the string so you can extract the numbers on the other side of the underscore.

Have a look at the help for factors ?factors to see how to influence the levels.

df_glucose_long <- df_glucose_long %>% 
  mutate(Measurement = str_split_i(Measurement, '_', 2) %>% as.factor())

head(df_glucose_long)

# A tibble: 6 × 3
  ID    Measurement `Glucose (mmol/L)`
  <fct> <fct>                    <dbl>
1 9046  0                         6.65
2 9046  60                        8.04
3 9046  120                      10.0 
4 51676 0                         4.49
5 51676 60                        5.40
6 51676 120                       6.22

Check factor levels:

levels(df_glucose_long$Measurement)

[1] "0"   "120" "60"

Adjust levels to proper order:

df_glucose_long$Measurement <- factor(df_glucose_long$Measurement, levels = c('0', '60','120'))

Merge the glucose dataset with the joined diabetes dataset.

diabetes_glucose <- diabetes_join %>% 
  left_join(df_glucose_long, by = 'ID')

head(diabetes_glucose)

# A tibble: 6 × 13
  ID    Sex     Age BloodPressure   BMI PhysicalActivity Smoker  Diabetes
  <fct> <chr> <dbl>         <dbl> <dbl>            <dbl> <chr>   <chr>   
1 9046  Male     34            84  24.7               93 Unknown 0       
2 9046  Male     34            84  24.7               93 Unknown 0       
3 9046  Male     34            84  24.7               93 Unknown 0       
4 51676 Male     25            74  22.5              102 Unknown 0       
5 51676 Male     25            74  22.5              102 Unknown 0       
6 51676 Male     25            74  22.5              102 Unknown 0       
# ℹ 5 more variables: Serum_ca2 <dbl>, Married <chr>, Work <chr>,
#   Measurement <fct>, `Glucose (mmol/L)` <dbl>

Pull the glucose measurements from your favorite ID.

Hint

First filter for your favorite ID and then pull the columns.

diabetes_glucose %>% 
  filter(ID == 9046) %>% 
  pull(Measurement,`Glucose (mmol/L)`)

  6.652593325614 8.04416787019778  10.016298115626 
               0               60              120 
Levels: 0 60 120

Calculate the mean glucose measure for each measurement timepoint.

Hint

You will need to use group_by(), and summarise().

diabetes_glucose %>%
  group_by(Measurement) %>%
  summarise(mean = mean(`Glucose (mmol/L)`))

# A tibble: 3 × 2
  Measurement  mean
  <fct>       <dbl>
1 0            8.06
2 60           9.73
3 120         11.1

Calculate mean and standard deviation for all numeric columns.

Hint

You will need to use summarise() and across(), selecting numeric columns.

diabetes_glucose %>%
  summarise(across(where(is.numeric), list(mean=mean, sd=sd), 
                   .names = "{.col}-{.fn}"))

# A tibble: 1 × 12
  `Age-mean` `Age-sd` `BloodPressure-mean` `BloodPressure-sd` `BMI-mean`
       <dbl>    <dbl>                <dbl>              <dbl>      <dbl>
1         NA       NA                 72.7               12.9       30.2
# ℹ 7 more variables: `BMI-sd` <dbl>, `PhysicalActivity-mean` <dbl>,
#   `PhysicalActivity-sd` <dbl>, `Serum_ca2-mean` <dbl>, `Serum_ca2-sd` <dbl>,
#   `Glucose (mmol/L)-mean` <dbl>, `Glucose (mmol/L)-sd` <dbl>

Nest the glucose measurements and values such that there is only one row per ID and call the nested column OGTT (Oral Glucose Tolerance Test). Display the resulting tibble to confirm that you have succeeded.

Hint

Remember the flow:

group_by() %>% 
  nest() %>% 
  ungroup()

diabetes_glucose <- diabetes_glucose %>% 
  group_by(ID) %>% 
  nest(OGTT = c(Measurement, `Glucose (mmol/L)`)) %>% 
  ungroup()
  
#relocate the new nested column to after BMI so we can actually see it
head(relocate(diabetes_glucose, OGTT, .after = BMI))

# A tibble: 6 × 12
  ID    Sex      Age BloodPressure   BMI OGTT     PhysicalActivity Smoker 
  <fct> <chr>  <dbl>         <dbl> <dbl> <list>              <dbl> <chr>  
1 9046  Male      34            84  24.7 <tibble>               93 Unknown
2 51676 Male      25            74  22.5 <tibble>              102 Unknown
3 60182 Male      50            80  34.5 <tibble>               98 Unknown
4 1665  Female    27            60  26.3 <tibble>               82 Never  
5 56669 Male      35            84  35   <tibble>               58 Smoker 
6 53882 Female    31            78  43.3 <tibble>               59 Smoker 
# ℹ 4 more variables: Diabetes <chr>, Serum_ca2 <dbl>, Married <chr>,
#   Work <chr>

Export the final dataset. Since the dataset is nested, you cannot export it as an excel file. Export the dataset as an .rds file. Have a guess at what the function is called.

write_rds(diabetes_glucose, '../out/diabetes_glucose.rds')