Now that we have had a look at the data, it is time to correct fixable mistakes and remove observations that cannot be corrected.
Consider the following:
What should we do with the rows that contain NA’s? Do we remove them or keep them?
Which odd things in the data can we correct with confidence and which cannot?
Are there zeros in the data? Are they true zeros or errors?
Do you want to change any of the classes of the variables?
Clean the data according to your considerations.
Hint
Have a look at BloodPressure, BMI, Sex, and Diabetes.
My considerations:
When modelling, rows with NA’s in the variables we want to model should be removed as we cannot model on NAs. Since there are only NA’s in GeneticRisk, the rows can be left until we need to do a model with GeneticRisk.
The uppercase/lowercase mistakes in Sex does not influence the interpretability of the variables, so they are simply changes such that the first letter is a capital letter and the remaining letter are lowercase.
There are zeros in BMI and BloodPressure. These are considered false zeros as is does not make sense that these variables have a value of 0.
Perform step 2-5 from above and do data exploration and cleaning workflow for the diabetes_meta_toy_messy.csv data set. Use the read_delim function to load in the dataset.
# A tibble: 6 × 3
ID Married Work
<dbl> <chr> <chr>
1 33879 Yes Self-employed
2 52800 Yes Private
3 16817 Yes Private
4 70676 Yes Self-employed
5 6319 No Public
6 71379 No Public
Use can you either base R or/and tidyverse to solve the exercises. For now, we just explore the data.
6.3. How many missing values (NA’s) are there in each column.
colSums(is.na(diabetes_meta))
ID Married Work
0 0 0
6.4. Check the distribution of each of the variables. Consider that the variables are of different classes. Do any of the distributions seam odd to you?
For the categorical variables:
table(diabetes_meta$Married)
No No Yes Yes
183 3 345 1
table(diabetes_meta$Work)
Private Public Retired Self-employed
283 154 6 89
By investigating the unique values of the Married variable we see that some of the values have whitespace.
unique(diabetes_meta$Married)
[1] "Yes" "No" "Yes " "No "
Clean the data according to your considerations.
My considerations:
The Married variable has whitespace in the some of the values. The values “Yes” and “Yes” will be interpreted as different values. We can confidently remove all the whitespaces in this variable.
ID is changed to numerical to match the diabetes_clean dataset.