Exercise 6: Understanding and improving an R script
In this exercise you got through the script analysis.R together with your neighbors, find out what the data is that is being worked and which analysis is done and how. There are also some things that could be an issue, so have an eye out and try to think how you could improve the script.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Libraries and data are being loaded. The data is an R dataset (rds). The user then has a look at what the data contains with head and str. We can see that there are 7 columns. The last one, bp_readings is nested. Perhaps str is not such a good idea for this type of data since the output becomes very long due to the nesting.
colSums(is.na(df))
patient_id age sex hdl ldl tg
0 0 0 0 3 2
bp_readings
0
df <-na.omit(df)colSums(is.na(df))
patient_id age sex hdl ldl tg
0 0 0 0 0 0
bp_readings
0
The author of the script is checking for NA values in the data, then omitting them and checking again.
df$total_chol <- df$hdl + df$ldl + (df$tg/5)
The author adds a new column which is the total cholesterol after the Friedewald equation.
\[ Total Cholesterol = HDL + LDL + \frac{Triglycerides}{5} \]
A subset of patients with high total cholesterol is made. The author investigates the distribution of sexes in the subset and discovers that in some rows female is coded as ‘f’ instead of ‘F’.
The author fixes the lower case ‘f’ but only in the subset, not the whole dataset. They also do not investigate whether the same problem is present for ‘m’/‘M’ in the whole dataset.
The author unnests the original dataset into a new dataframe (which is fine). They then create another new dataframe which has the same information in wide format. This could have been done in one step to avoid having too many very similar dataframes. Also, it is not very clear from the naming what is in the mean column, and generally names that are also function names should be avoided since especially in tidyverse it is not always clear whether code refers to the column mean or the function mean.
A mean value is calculated across the 5 blood pressure measurements and added to the newest dataframe.
Warning: Unknown or uninitialised column: `bp_category`.
The author new iterates over the new mean column and creates a new column that contains an evaluation of the blood pressure. This could have been done more elegantly with mutate and case_when. The more severe problem is that the second condition is the wrong way around: bp2$mean[i] < 120. Blood pressure levels are considered elevated at above 120, not below. This is also has a consequence that the elevated and normal labels are switched. Lastly, the elevated label is spelled with a lower case letter whereas the other categories begin with an upper case letter (inconsistent naming).
merge_df <-merge(df, bp2, by ='patient_id')
The author now chooses to merge the blood pressure dataframe back into the original dataset. This is messy because the blood pressure measurements now exist twice. They should at least have dropped the nested column.
men <- merge_df %>%filter(sex =='M')women <- merge_df %>%filter(sex =='F')
The author now divides the merged data including the blood pressure category into two more dataframes for men and women. They loose some of the data because they have not investigated and fixed misspellings of ‘M’ and ‘F’ in the original dataframe.
The plots are fine, though they are missing data points as discussed above. The same could have been achived without two extra dataframes by using filter on the merged dataframe and then piping into ggplot.