Exercise 1: Data Cleaning

Getting started

  1. Load packages.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
  1. Load in the diabetes_clinical_toy_messy.xlsx data set.
diabetes <- read_xlsx("../data/diabetes_clinical_toy_messy.xlsx")
head(diabetes)
# A tibble: 6 × 9
     ID Sex      Age BloodPressure   BMI PhysicalActivity Smoker  Diabetes
  <dbl> <chr>  <dbl>         <dbl> <dbl>            <dbl> <chr>      <dbl>
1  9046 Male      34            84  24.7               93 Unknown        0
2 51676 Male      25            74  22.5              102 Unknown        0
3 31112 Male      30             0  32.3               75 Former         1
4 60182 Male      50            80  34.5               98 Unknown        1
5  1665 Female    27            60  26.3               82 Never          0
6 56669 Male      35            84  35                 58 Smoker         1
# ℹ 1 more variable: Serum_ca2 <dbl>

Explore the data

  1. How many missing values (NA’s) are there in each column.

  2. Check the ranges and distribution of each of the numeric variables in the dataset. Do any values seem weird or unexpected? Extract summary statistics on these, e.g. means and standard deviation.

  3. Some variables in the dataset are categorical or factor variables. Figure out what levels these have and how many observations there are for each level.

Clean up the data

Now that we have had a look at the data, it is time to correct fixable mistakes and remove observations that cannot be corrected.

Consider the following:

  • What should we do with the rows that contain NA’s? Do we remove them or keep them?

  • Which odd things in the data can we correct with confidence and which cannot?

  • Are there zeros in the data? Are they true zeros or errors?

  • Do you want to change any of the classes of the variables?

  1. Make a clean version of the dataset according to your considerations.

Have a look at BloodPressure, BMI, and Sex.

Metadata

  1. There is some metadata to accompany the dataset you have just cleaned in diabetes_meta_toy_messy.csv. This is a csv file, not an excel sheet, so you need to use the read_delim() function to load it. Load in the dataset and inspect it.

7.1. How many missing values (NA’s) are there in each column.

7.2. Check the distribution of each of the variables. Consider that the variables are of different classes. Do any of the distributions seam odd to you?

7.3. Make a clean version of the dataset according to your considerations.

Join the datasets

  1. Consider which variable the datasets should be joined on.

  2. Consider how you want to join the datasets. Do you want to use full_join, inner_join, left_join and rigth_join?

The joining variable must be the same type in both datasets.

  1. Join the cleaned versions of the clinical and metadataset by the variable and with the function you considered above.

  2. How many rows does the joined dataset have? Explain how the join-function you used resulted in the given number of rows.

Manipulate some strings

When we look at the column Work, we can see that we have three levels Public, Private and Self-employed. Some special characters, including: - / \[\] (), etc. can sometimes prove problematic in variable names, as they are also used for operations in R. For example, are dashes (-) used to indicate a contrast in some functions.

  1. To avoid potential issues in downstream analysis, change the Work variable so that Self-employed becomes SelfEmployed.

Additionally, we are not so happy with the fact that the ID is simply denoted by a number. The reason for this is that if we write out our dataset and read it in again (which we will do later), R will think it is a numeric or integer variable.

  1. Add the string ID_ in front of the number and convert it to a factor variable.

  2. Export the joined dataset. Think about which directory you want to save the file in.