Exercise 6: Understanding and improving an R script

In this exercise you got through the script analysis.R together with your neighbors, find out what the data is that is being worked and which analysis is done and how. There are also some things that could be an issue, so have an eye out and try to think how you could improve the script.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

df <- read_rds('../data/patient_data.rds')
head(df)
# A tibble: 6 × 7
  patient_id   age sex     hdl   ldl    tg bp_readings     
       <int> <int> <chr> <dbl> <dbl> <dbl> <list>          
1       2463    51 M        60   166   186 <tibble [5 × 2]>
2       2511    51 F        55    NA   118 <tibble [5 × 2]>
3       8718    68 F        61    81    93 <tibble [5 × 2]>
4       2986    60 F        58   154   257 <tibble [5 × 2]>
5       1842    77 M        77    60   101 <tibble [5 × 2]>
6       9334    46 F        56   146   159 <tibble [5 × 2]>
str(df)
tibble [50 × 7] (S3: tbl_df/tbl/data.frame)
 $ patient_id : int [1:50] 2463 2511 8718 2986 1842 9334 3371 4761 6746 9819 ...
 $ age        : int [1:50] 51 51 68 60 77 46 79 78 63 33 ...
 $ sex        : chr [1:50] "M" "F" "F" "F" ...
 $ hdl        : num [1:50] 60 55 61 58 77 56 70 49 55 48 ...
 $ ldl        : num [1:50] 166 NA 81 154 60 146 123 62 124 122 ...
 $ tg         : num [1:50] 186 118 93 257 101 159 144 88 171 193 ...
 $ bp_readings:List of 50
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 107 109 109 108 105
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 130 130 129 131 129
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 118 120 117 120 116
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 135 132 141 134 136
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 144 145 143 145 145
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 119 119 119 119 123
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 112 111 110 112 113
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 140 141 139 138 143
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 107 106 105 107 107
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 108 110 108 110 107
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 147 147 146 147 145
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 112 109 116 113 109
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 129 128 127 133 132
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 111 110 112 110 110
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 107 105 106 110 107
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 152 152 152 154 151
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 112 110 115 111 111
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 100 98 97 99 101
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 114 116 115 113 114
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 144 143 143 146 142
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 128 132 128 128 127
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 121 120 118 121 122
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 134 135 132 132 133
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 148 151 146 148 152
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 127 127 124 126 128
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 134 133 133 133 134
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 114 117 114 116 115
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 111 111 108 110 110
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 104 104 107 109 107
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 116 116 112 115 116
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 159 161 161 160 156
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 153 155 152 153 153
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 129 130 129 126 130
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 133 134 132 133 133
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 114 114 117 114 114
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 113 115 115 115 112
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 134 138 134 138 131
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 110 110 112 109 108
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 108 106 106 107 109
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 108 104 108 110 112
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 107 110 109 104 106
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 109 108 110 109 106
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 104 107 106 104 106
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 121 118 122 120 122
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 116 116 117 119 116
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 104 106 102 103 107
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 108 109 104 105 108
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 108 110 108 109 110
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 114 117 114 114 110
  ..$ : tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  .. ..$ time_point    : chr [1:5] "bp1" "bp2" "bp3" "bp4" ...
  .. ..$ blood_pressure: num [1:5] 126 126 125 124 126

Libraries and data are being loaded. The data is an R dataset (rds). The user then has a look at what the data contains with head and str. We can see that there are 7 columns. The last one, bp_readings is nested. Perhaps str is not such a good idea for this type of data since the output becomes very long due to the nesting.

colSums(is.na(df))
 patient_id         age         sex         hdl         ldl          tg 
          0           0           0           0           3           2 
bp_readings 
          0 
df <- na.omit(df)
colSums(is.na(df))
 patient_id         age         sex         hdl         ldl          tg 
          0           0           0           0           0           0 
bp_readings 
          0 

The author of the script is checking for NA values in the data, then omitting them and checking again.

df$total_chol <- df$hdl + df$ldl + (df$tg/5) 

The author adds a new column which is the total cholesterol after the Friedewald equation.

\[ Total Cholesterol = HDL + LDL + \frac{Triglycerides}{5} \]

high_chol_patients <- df %>%
  filter(total_chol > 240) %>%
  select(patient_id, age, total_chol, sex)

table(high_chol_patients$sex)

F f M 
4 2 6 

A subset of patients with high total cholesterol is made. The author investigates the distribution of sexes in the subset and discovers that in some rows female is coded as ‘f’ instead of ‘F’.

high_chol_patients <- high_chol_patients %>%
  mutate(sex_fixed = ifelse(sex == 'f', 'F', sex))

The author fixes the lower case ‘f’ but only in the subset, not the whole dataset. They also do not investigate whether the same problem is present for ‘m’/‘M’ in the whole dataset.

bp1 <- df %>%
  unnest(bp_readings) %>%
  select(patient_id, time_point, blood_pressure)

bp2 <- bp1 %>% pivot_wider(names_from = time_point, values_from = blood_pressure)
mean_bp <- rowMeans(bp2[,2:6])
bp2$mean <- mean_bp

The author unnests the original dataset into a new dataframe (which is fine). They then create another new dataframe which has the same information in wide format. This could have been done in one step to avoid having too many very similar dataframes. Also, it is not very clear from the naming what is in the mean column, and generally names that are also function names should be avoided since especially in tidyverse it is not always clear whether code refers to the column mean or the function mean.

A mean value is calculated across the 5 blood pressure measurements and added to the newest dataframe.

for (i in 1:nrow(bp2)) {
  if (bp2$mean[i] > 140) {
    bp2$bp_category[i] <- "Hypertension"
  } else if (bp2$mean[i] < 120) {
    bp2$bp_category[i] <- "elevated BP"
  } else {
    bp2$bp_category[i] <- "Normal BP"
  }
}
Warning: Unknown or uninitialised column: `bp_category`.

The author new iterates over the new mean column and creates a new column that contains an evaluation of the blood pressure. This could have been done more elegantly with mutate and case_when. The more severe problem is that the second condition is the wrong way around: bp2$mean[i] < 120. Blood pressure levels are considered elevated at above 120, not below. This is also has a consequence that the elevated and normal labels are switched. Lastly, the elevated label is spelled with a lower case letter whereas the other categories begin with an upper case letter (inconsistent naming).

merge_df <- merge(df, bp2, by = 'patient_id')

The author now chooses to merge the blood pressure dataframe back into the original dataset. This is messy because the blood pressure measurements now exist twice. They should at least have dropped the nested column.

men <- merge_df %>%
  filter(sex == 'M')
women <- merge_df %>%
  filter(sex == 'F')

The author now divides the merged data including the blood pressure category into two more dataframes for men and women. They loose some of the data because they have not investigated and fixed misspellings of ‘M’ and ‘F’ in the original dataframe.

ggplot(men, aes(x = age, fill = bp_category)) +
  geom_histogram(position = 'dodge', binwidth = 10)

ggplot(women, aes(x = age, fill = bp_category)) +
  geom_histogram(position = 'dodge', binwidth = 10)

The plots are fine, though they are missing data points as discussed above. The same could have been achived without two extra dataframes by using filter on the merged dataframe and then piping into ggplot.