Presentation 1 - Solutions: Base R and Tidyverse / Data Clean-up and Wrangling

Getting started

  1. Load packages.
library(tidyverse)
Warning: pakke 'ggplot2' blev bygget under R version 4.2.3
Warning: pakke 'tibble' blev bygget under R version 4.2.3
Warning: pakke 'dplyr' blev bygget under R version 4.2.3
library(readxl)
Warning: pakke 'readxl' blev bygget under R version 4.2.3
  1. Load in the diabetes_clinical_toy_messy.xlsx data set.
diabetes_clinical <- read_excel('../data/diabetes_clinical_toy_messy.xlsx')
head(diabetes_clinical)
# A tibble: 6 × 9
     ID Sex      Age BloodPressure GeneticRisk   BMI PhysicalActivity Smoker 
  <dbl> <chr>  <dbl>         <dbl>       <dbl> <dbl>            <dbl> <chr>  
1  9046 Male      34            84       0.619  24.7               93 Unknown
2 51676 Male      25            74       0.591  22.5              102 Unknown
3 31112 Male      30             0       0.839  32.3               75 Former 
4 60182 Male      50            80       0.178  34.5               98 Unknown
5  1665 Female    27            60       0.206  26.3               82 Never  
6 56669 Male      35            84       0.286  35                 58 Smoker 
# ℹ 1 more variable: Diabetes <dbl>

Explore the data

Use can you either base R or/and tidyverse to solve the exercises.

  1. How many missing values (NA’s) are there in each column.
colSums(is.na(diabetes_clinical))
              ID              Sex              Age    BloodPressure 
               0                0                0                0 
     GeneticRisk              BMI PhysicalActivity           Smoker 
               5                0                0                0 
        Diabetes 
               0 
  1. Check the distribution of each of the variables. Consider that the variables are of different classes. Do any of the distributions seem odd to you?

For the categorical variables:

The Sex values are not consistent.

table(diabetes_clinical$Sex)

Female FEMALE   male   Male 
   291      2      2    237 
table(diabetes_clinical$Smoker)

 Former   Never  Smoker Unknown 
    132     159     162      79 
table(diabetes_clinical$Diabetes)

  0   1 
267 265 

For the numerical variables:

diabetes_clinical %>% 
  ggplot(aes(y = Age)) + 
  geom_boxplot()

Odd: Some BloodPressure values are 0.

diabetes_clinical %>% 
  ggplot(aes(y = BloodPressure)) + 
  geom_boxplot()

diabetes_clinical %>% 
  ggplot(aes(y = GeneticRisk)) + 
  geom_boxplot()
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Odd: Some BMI values are 0.

diabetes_clinical %>% 
  ggplot(aes(y = BMI)) + 
  geom_boxplot()

diabetes_clinical %>% 
  ggplot(aes(y = PhysicalActivity)) + 
  geom_boxplot()

Clean up the data

Now that we have had a look at the data, it is time to correct fixable mistakes and remove observations that cannot be corrected.

Consider the following:

  • What should we do with the rows that contain NA’s? Do we remove them or keep them?

  • Which odd things in the data can we correct with confidence and which cannot?

  • Are there zeros in the data? Are they true zeros or errors?

  • Do you want to change any of the classes of the variables?

  1. Clean the data according to your considerations.

Have a look at BloodPressure, BMI, Sex, and Diabetes.

My considerations:

  • When modelling, rows with NA’s in the variables we want to model should be removed as we cannot model on NAs. Since there are only NA’s in GeneticRisk, the rows can be left until we need to do a model with GeneticRisk.

  • The uppercase/lowercase mistakes in Sex does not influence the interpretability of the variables, so they are simply changes such that the first letter is a capital letter and the remaining letter are lowercase.

  • There are zeros in BMI and BloodPressure. These are considered false zeros as is does not make sense that these variables have a value of 0.

  • Diabetes is changed to factor.

Access number of rows before cleaning.

nrow(diabetes_clinical)
[1] 532

Cleaning data according to considerations.

diabetes_clinical_clean <- diabetes_clinical %>% 
  mutate(Sex = str_to_title(Sex),
         Diabetes = Diabetes %>% factor()) %>% 
  filter(BMI != 0, BloodPressure != 0) 

Check the unique sexes now.

diabetes_clinical_clean$Sex %>% unique()
[1] "Male"   "Female"

Access number of rows after cleaning.

nrow(diabetes_clinical_clean)
[1] 493

Meta Data

  1. Perform step 2-5 from above and do data exploration and cleaning workflow for the diabetes_meta_toy_messy.csv data set. Use the read_delim function to load in the dataset.

6.2. Load the data set.

diabetes_meta <- read_delim('../data/diabetes_meta_toy_messy.csv')
head(diabetes_meta)
# A tibble: 6 × 3
     ID Married Work         
  <dbl> <chr>   <chr>        
1 33879 Yes     Self-employed
2 52800 Yes     Private      
3 16817 Yes     Private      
4 70676 Yes     Self-employed
5  6319 No      Public       
6 71379 No      Public       

Use can you either base R or/and tidyverse to solve the exercises. For now, we just explore the data.

6.3. How many missing values (NA’s) are there in each column.

colSums(is.na(diabetes_meta))
     ID Married    Work 
      0       0       0 

6.4. Check the distribution of each of the variables. Consider that the variables are of different classes. Do any of the distributions seam odd to you?

For the categorical variables:

table(diabetes_meta$Married)

  No  No   Yes Yes  
 183    3  345    1 
table(diabetes_meta$Work)

      Private        Public       Retired Self-employed 
          283           154             6            89 

By investigating the unique values of the Married variable we see that some of the values have whitespace.

unique(diabetes_meta$Married)
[1] "Yes"  "No"   "Yes " "No " 
    1. Clean the data according to your considerations.

My considerations:

  • The Married variable has whitespace in the some of the values. The values “Yes” and “Yes” will be interpreted as different values. We can confidently remove all the whitespaces in this variable.

  • ID is changed to numerical to match the diabetes_clean dataset.

Access number of rows before cleaning.

nrow(diabetes_meta)
[1] 532
diabetes_meta_clean <- diabetes_meta %>% 
  mutate(Married = str_trim(Married),
         ID = as.numeric(ID))

Check the unique marital status now.

unique(diabetes_meta_clean$Married)
[1] "Yes" "No" 

Access number of rows after cleaning.

nrow(diabetes_meta_clean)
[1] 532

Join the datasets

  1. Consider what variable the datasets should be joined on.

The joining variable must be the same type in both datasets.

  1. Join the datasets by the variable you selected above.
diabetes_join <- diabetes_clinical_clean %>% 
  left_join(diabetes_meta_clean, by = 'ID')
  1. How many rows does the joined dataset have? Explain why.

Because we used left_join, only the IDs that are in diabetes_clinical_clean are kept.

nrow(diabetes_join)
[1] 493
  1. Export the joined dataset. Think about which directory you want to save the file in.
writexl::write_xlsx(diabetes_join, '../out/diabetes_join.xlsx')