Presentation 2: Tidyverse

Want to code along? If you haven’t already, go to the Data tab of the website and press the DOWNLOAD PRESENTATIONS button. This is presentation2.

Load packages

# Load tidyverse package
library(tidyverse)
# Load a package that can read excel files
library(readxl)
library(readr)

Check working directory

Check working directory so you know from where you work.

getwd()

[1] "/Users/srz223/Desktop/DataLab/FromExceltoR/Teachers/Presentations"

Importing data

Often we will work with large datasets that already exist in i.e. an excel sheet or a tab separated file (.tsv). We can easily load that data into R, either with the read_excel function or by clicking on ‘Import Dataset’ in the Environment tab (right). We can also load data in via a command. Let’s do this now. Navigate to the data from your working directory. Use the tap-button to check what your options are.

crohns <- read_excel("../../Data/crohns_disease.xlsx")

A first look at the data

Print first few lines of your dataset

head(crohns)

# A tibble: 6 × 9
     ID nrAdvE   BMI height country sex     age weight treat  
  <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
1 19908      4  25.2    163 c1      F        47     67 placebo
2 19909      4  23.8    164 c1      F        53     64 d1     
3 19910      1  23.0    164 c1      F        68     62 placebo
4 20908      1  25.7    165 c1      F        48     70 d2     
5 20909      2  26.0    170 c1      F        67     75 placebo
6 20910      2  28.7    168 c1      F        54     81 d1

Get the dimension of your dataset

dim(crohns)

[1] 117   9

How many observations (rows) do have?

nrow(crohns)

[1] 117

How many data columns are there and what are their types? Both ‘str’ and ‘summary’ will you what column types you have. Summary has some extra summary stats on numeric columns.

summary(crohns)

       ID            nrAdvE            BMI            height     
 Min.   :19908   Min.   : 0.000   Min.   :16.00   Min.   :124.0  
 1st Qu.:23909   1st Qu.: 0.000   1st Qu.:23.05   1st Qu.:157.0  
 Median :25919   Median : 1.000   Median :25.15   Median :162.0  
 Mean   :34103   Mean   : 2.034   Mean   :26.06   Mean   :162.7  
 3rd Qu.:51909   3rd Qu.: 3.000   3rd Qu.:28.40   3rd Qu.:166.0  
 Max.   :54937   Max.   :12.000   Max.   :44.06   Max.   :182.0  
   country              sex                 age            weight      
 Length:117         Length:117         Min.   :19.00   Min.   : 36.00  
 Class :character   Class :character   1st Qu.:48.00   1st Qu.: 59.00  
 Mode  :character   Mode  :character   Median :56.00   Median : 68.00  
                                       Mean   :54.66   Mean   : 69.03  
                                       3rd Qu.:62.00   3rd Qu.: 76.00  
                                       Max.   :75.00   Max.   :117.00  
    treat          
 Length:117        
 Class :character  
 Mode  :character

str(crohns)

tibble [117 × 9] (S3: tbl_df/tbl/data.frame)
 $ ID     : num [1:117] 19908 19909 19910 20908 20909 ...
 $ nrAdvE : num [1:117] 4 4 1 1 2 2 3 0 1 0 ...
 $ BMI    : num [1:117] 25.2 23.8 23.1 25.7 25.9 ...
 $ height : num [1:117] 163 164 164 165 170 168 161 168 154 157 ...
 $ country: chr [1:117] "c1" "c1" "c1" "c1" ...
 $ sex    : chr [1:117] "F" "F" "F" "F" ...
 $ age    : num [1:117] 47 53 68 48 67 54 53 53 47 58 ...
 $ weight : num [1:117] 67 64 62 70 75 81 69 74 76 82 ...
 $ treat  : chr [1:117] "placebo" "d1" "placebo" "d2" ...

The anatomy of tidyverse

Tidyverse is a collection of R packages that are great for data wrangling and visualizations. Data wrangling with functions from the Tidyverse are often used with a specific syntax:

The name of the variable you are creating. Can omit if you don’t want to save the result.
The name of the dataset we are working on.
The function you want to apply on the dataset (and whatever arguments must be provided to the function).

In tidyverse we use the pipe symbol %>% to chain multiple functions together. The term pipe comes from the fact that we pipe the output from one function into another function as the input.
It is a good idea to make a new line after each pipe symbol.

# new_object <- dataset %>%
#   function1(arguments...) %>% 
#   function2(arguments...)

Count, distinct, sort

Count and distinct are very useful to get information about your dataset!

Variables (columns) can be numeric or categorical (characters, factors). Use the str() function to see the structure of your dataset.

crohns %>%
  str()

tibble [117 × 9] (S3: tbl_df/tbl/data.frame)
 $ ID     : num [1:117] 19908 19909 19910 20908 20909 ...
 $ nrAdvE : num [1:117] 4 4 1 1 2 2 3 0 1 0 ...
 $ BMI    : num [1:117] 25.2 23.8 23.1 25.7 25.9 ...
 $ height : num [1:117] 163 164 164 165 170 168 161 168 154 157 ...
 $ country: chr [1:117] "c1" "c1" "c1" "c1" ...
 $ sex    : chr [1:117] "F" "F" "F" "F" ...
 $ age    : num [1:117] 47 53 68 48 67 54 53 53 47 58 ...
 $ weight : num [1:117] 67 64 62 70 75 81 69 74 76 82 ...
 $ treat  : chr [1:117] "placebo" "d1" "placebo" "d2" ...

distinct() tells us how many different levels a categorical variable has.

# How many different treatments do we have? 
crohns %>% 
  distinct(treat)

# A tibble: 3 × 1
  treat  
  <chr>  
1 placebo
2 d1     
3 d2

#From how many different countries do we have data?
crohns %>% 
  distinct(country)

# A tibble: 2 × 1
  country
  <chr>  
1 c1     
2 c2

count() does tabulation of categorical variables. Total number of lines, i.e. patients in the current dataset. Observe, this matches with the number of lines you can see in the Environment tab.

crohns %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1   117

# How many lines, i.e. patients do we have per treatment?
crohns %>% 
  count(treat)

# A tibble: 3 × 2
  treat       n
  <chr>   <int>
1 d1         39
2 d2         39
3 placebo    39

# Is our dataset balanced?

# How many patients do we have for each age?
crohns %>% 
  count(age)

# A tibble: 43 × 2
     age     n
   <dbl> <int>
 1    19     1
 2    28     1
 3    29     1
 4    30     1
 5    33     1
 6    35     1
 7    36     1
 8    38     1
 9    39     3
10    40     2
# ℹ 33 more rows

# Perhaps this is more useful: How many patients are older than 65?
crohns %>% 
  count(age > 65)

# A tibble: 2 × 2
  `age > 65`     n
  <lgl>      <int>
1 FALSE         96
2 TRUE          21

Note we haven’t saved anything here, we just get output to the console sorted in a certain way. This helps us to check if the data looks correct and get an impression.

arrange is used to impose a sort on the passed column

crohns %>% 
  arrange(age)

# A tibble: 117 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 28912      0  24.0    166 c1      F        19     66 d1     
 2 23912      0  16      150 c1      F        28     36 d1     
 3 22915      3  25.3    164 c1      F        29     68 d2     
 4 54924      0  22.8    161 c2      F        30     59 d2     
 5 28909      0  24.3    172 c1      F        33     72 d1     
 6 53909      3  23.7    172 c2      M        35     70 placebo
 7 22912      2  21.0    162 c1      F        36     55 placebo
 8 51908      0  23.7    172 c2      M        38     70 d2     
 9 24908      1  27.1    173 c1      F        39     81 placebo
10 25911      8  21.1    154 c1      F        39     50 d2     
# ℹ 107 more rows

#reverse sort
crohns %>% 
  arrange(desc(age))

# A tibble: 117 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 22914      3  25.2    157 c1      F        75     62 d2     
 2 54926      0  20.5    153 c2      F        74     48 placebo
 3 26908      0  25.8    150 c1      F        73     58 d1     
 4 54933      2  26.4    165 c2      F        73     72 placebo
 5 22909      8  30.8    156 c1      M        71     75 placebo
 6 53910      6  24.8    162 c2      F        70     65 d1     
 7 54929      1  25.5    152 c2      F        70     59 d2     
 8 24912      0  25.2    162 c1      F        69     66 placebo
 9 25920      1  27.9    164 c1      F        69     75 d1     
10 53908      1  31.2    152 c2      F        69     72 placebo
# ℹ 107 more rows

#sort by two (or more!) columns
crohns %>% 
  arrange(sex, desc(age))

# A tibble: 117 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 22914      3  25.2    157 c1      F        75     62 d2     
 2 54926      0  20.5    153 c2      F        74     48 placebo
 3 26908      0  25.8    150 c1      F        73     58 d1     
 4 54933      2  26.4    165 c2      F        73     72 placebo
 5 53910      6  24.8    162 c2      F        70     65 d1     
 6 54929      1  25.5    152 c2      F        70     59 d2     
 7 24912      0  25.2    162 c1      F        69     66 placebo
 8 25920      1  27.9    164 c1      F        69     75 d1     
 9 53908      1  31.2    152 c2      F        69     72 placebo
10 19910      1  23.0    164 c1      F        68     62 placebo
# ℹ 107 more rows

Just like with count earlier this is not a permanent sort and does not change the order of rows in the original tibble, crohns. Without assignment (<-) tidyverse commands only display the result, not save it.

Filtering data (selecting rows) with `filter()`

How we subset dataset into subsets we find interesting. For example only female patients:

crohns %>% 
  filter(sex == 'F') # processed from left to right

# A tibble: 100 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 19908      4  25.2    163 c1      F        47     67 placebo
 2 19909      4  23.8    164 c1      F        53     64 d1     
 3 19910      1  23.0    164 c1      F        68     62 placebo
 4 20908      1  25.7    165 c1      F        48     70 d2     
 5 20909      2  26.0    170 c1      F        67     75 placebo
 6 20910      2  28.7    168 c1      F        54     81 d1     
 7 21908      3  26.6    161 c1      F        53     69 d1     
 8 21909      0  26.2    168 c1      F        53     74 placebo
 9 21910      1  32.0    154 c1      F        47     76 d2     
10 21911      0  33.3    157 c1      F        58     82 placebo
# ℹ 90 more rows

A great about tidyverse: write code the way you think. You always filter by defining conditions. If the condition evaluates to ‘TRUE’ the line is included. See only data lines for patients over 65:

crohns %>% 
  filter(age > 65)

# A tibble: 21 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 19910      1  23.0    164 c1      F        68     62 placebo
 2 20909      2  26.0    170 c1      F        67     75 placebo
 3 22908      5  18.2    159 c1      F        66     46 d2     
 4 22909      8  30.8    156 c1      M        71     75 placebo
 5 22911      3  24.8    182 c1      M        68     82 d2     
 6 22914      3  25.2    157 c1      F        75     62 d2     
 7 24912      0  25.2    162 c1      F        69     66 placebo
 8 25920      1  27.9    164 c1      F        69     75 d1     
 9 26908      0  25.8    150 c1      F        73     58 d1     
10 26910      0  19.1    165 c1      F        66     52 d1     
# ℹ 11 more rows

From the above commands we are getting the result printed to the console. This is useful to check something. To save the result, we need to re-assign:

seniors <- crohns %>% 
  filter(age > 65)

View newly created data frame:

seniors

# A tibble: 21 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 19910      1  23.0    164 c1      F        68     62 placebo
 2 20909      2  26.0    170 c1      F        67     75 placebo
 3 22908      5  18.2    159 c1      F        66     46 d2     
 4 22909      8  30.8    156 c1      M        71     75 placebo
 5 22911      3  24.8    182 c1      M        68     82 d2     
 6 22914      3  25.2    157 c1      F        75     62 d2     
 7 24912      0  25.2    162 c1      F        69     66 placebo
 8 25920      1  27.9    164 c1      F        69     75 d1     
 9 26908      0  25.8    150 c1      F        73     58 d1     
10 26910      0  19.1    165 c1      F        66     52 d1     
# ℹ 11 more rows

Do we still have all three treatment groups in our subset?

seniors %>%
  count(treat)

# A tibble: 3 × 2
  treat       n
  <chr>   <int>
1 d1          6
2 d2          5
3 placebo    10

The world of conditional operators

Now we get lines that fit certain conditions but what if I want to filter on more than one condition? Enter conditional operators!

The ‘and’ operator: &

We can also subset on several conditions. Here are younger patients who received drug 1:

crohns %>% 
  filter(age <= 65 & treat == 'd1')

# A tibble: 33 × 9
      ID nrAdvE   BMI height country sex     age weight treat
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>
 1 19909      4  23.8    164 c1      F        53     64 d1   
 2 20910      2  28.7    168 c1      F        54     81 d1   
 3 21908      3  26.6    161 c1      F        53     69 d1   
 4 21916      0  23.9    177 c1      M        56     75 d1   
 5 22916      2  30.9    163 c1      F        53     82 d1   
 6 23908      0  30.4    158 c1      F        55     76 d1   
 7 23909      0  23.4    156 c1      F        44     57 d1   
 8 23910      0  26.7    156 c1      F        59     65 d1   
 9 23912      0  16      150 c1      F        28     36 d1   
10 24909      0  22.5    155 c1      F        52     54 d1   
# ℹ 23 more rows

The ‘or’ operator: |

Get patients that were treated with either drug 1 or the placebo:

crohns %>% 
  filter(treat == 'placebo' | treat == 'd1')

# A tibble: 78 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 19908      4  25.2    163 c1      F        47     67 placebo
 2 19909      4  23.8    164 c1      F        53     64 d1     
 3 19910      1  23.0    164 c1      F        68     62 placebo
 4 20909      2  26.0    170 c1      F        67     75 placebo
 5 20910      2  28.7    168 c1      F        54     81 d1     
 6 21908      3  26.6    161 c1      F        53     69 d1     
 7 21909      0  26.2    168 c1      F        53     74 placebo
 8 21911      0  33.3    157 c1      F        58     82 placebo
 9 21914      6  28.4    170 c1      M        58     82 placebo
10 21916      0  23.9    177 c1      M        56     75 d1     
# ℹ 68 more rows

The ‘not’ operator: !

crohns %>% 
  filter(treat != 'placebo')

# A tibble: 78 × 9
      ID nrAdvE   BMI height country sex     age weight treat
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>
 1 19909      4  23.8    164 c1      F        53     64 d1   
 2 20908      1  25.7    165 c1      F        48     70 d2   
 3 20910      2  28.7    168 c1      F        54     81 d1   
 4 21908      3  26.6    161 c1      F        53     69 d1   
 5 21910      1  32.0    154 c1      F        47     76 d2   
 6 21912      5  32.5    152 c1      F        63     75 d2   
 7 21913      2  37.6    159 c1      F        54     95 d2   
 8 21915      0  23.0    160 c1      F        54     59 d2   
 9 21916      0  23.9    177 c1      M        56     75 d1   
10 21917      0  36.4    164 c1      F        51     98 d2   
# ℹ 68 more rows

Other conditional operators can be found in the first presentation or when querying the function (or just google it).

?dplyr::filter

You can also allow multiple arguments in a variable. Here are the young patients who got treatment with either drug 1 or 2:

crohns %>% 
  filter(age <= 65 & treat %in% c("d1","d2"))

# A tibble: 67 × 9
      ID nrAdvE   BMI height country sex     age weight treat
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>
 1 19909      4  23.8    164 c1      F        53     64 d1   
 2 20908      1  25.7    165 c1      F        48     70 d2   
 3 20910      2  28.7    168 c1      F        54     81 d1   
 4 21908      3  26.6    161 c1      F        53     69 d1   
 5 21910      1  32.0    154 c1      F        47     76 d2   
 6 21912      5  32.5    152 c1      F        63     75 d2   
 7 21913      2  37.6    159 c1      F        54     95 d2   
 8 21915      0  23.0    160 c1      F        54     59 d2   
 9 21916      0  23.9    177 c1      M        56     75 d1   
10 21917      0  36.4    164 c1      F        51     98 d2   
# ℹ 57 more rows

Selecting variables (columns) with `select()`

We can choose to only include certain columns. Here, we select only BMI, age and the number of adverse events:

crohns %>% 
  select(nrAdvE, BMI, age)

# A tibble: 117 × 3
   nrAdvE   BMI   age
    <dbl> <dbl> <dbl>
 1      4  25.2    47
 2      4  23.8    53
 3      1  23.0    68
 4      1  25.7    48
 5      2  26.0    67
 6      2  28.7    54
 7      3  26.6    53
 8      0  26.2    53
 9      1  32.0    47
10      0  33.3    58
# ℹ 107 more rows

We can also make a negative selection that excludes the named column(s). The ID doesn’t give us any information since the data is anonymized:

without_id <- crohns %>% 
  select(-ID)

We have saved the dataset without the ID column in at new variable. Let’s have a look at this:

without_id

# A tibble: 117 × 8
   nrAdvE   BMI height country sex     age weight treat  
    <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1      4  25.2    163 c1      F        47     67 placebo
 2      4  23.8    164 c1      F        53     64 d1     
 3      1  23.0    164 c1      F        68     62 placebo
 4      1  25.7    165 c1      F        48     70 d2     
 5      2  26.0    170 c1      F        67     75 placebo
 6      2  28.7    168 c1      F        54     81 d1     
 7      3  26.6    161 c1      F        53     69 d1     
 8      0  26.2    168 c1      F        53     74 placebo
 9      1  32.0    154 c1      F        47     76 d2     
10      0  33.3    157 c1      F        58     82 placebo
# ℹ 107 more rows

Transformation of data with `mutate()`

We can create new columns based on other columns with the mutate() function.

This is our original tibble:

crohns

# A tibble: 117 × 9
      ID nrAdvE   BMI height country sex     age weight treat  
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>  
 1 19908      4  25.2    163 c1      F        47     67 placebo
 2 19909      4  23.8    164 c1      F        53     64 d1     
 3 19910      1  23.0    164 c1      F        68     62 placebo
 4 20908      1  25.7    165 c1      F        48     70 d2     
 5 20909      2  26.0    170 c1      F        67     75 placebo
 6 20910      2  28.7    168 c1      F        54     81 d1     
 7 21908      3  26.6    161 c1      F        53     69 d1     
 8 21909      0  26.2    168 c1      F        53     74 placebo
 9 21910      1  32.0    154 c1      F        47     76 d2     
10 21911      0  33.3    157 c1      F        58     82 placebo
# ℹ 107 more rows

We want to add height in meters in a new column. It is important to reassign the dataframe if you want to save the new column.

crohns <- crohns %>% 
  mutate(height_m = height/100)

crohns

# A tibble: 117 × 10
      ID nrAdvE   BMI height country sex     age weight treat   height_m
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>      <dbl>
 1 19908      4  25.2    163 c1      F        47     67 placebo     1.63
 2 19909      4  23.8    164 c1      F        53     64 d1          1.64
 3 19910      1  23.0    164 c1      F        68     62 placebo     1.64
 4 20908      1  25.7    165 c1      F        48     70 d2          1.65
 5 20909      2  26.0    170 c1      F        67     75 placebo     1.7 
 6 20910      2  28.7    168 c1      F        54     81 d1          1.68
 7 21908      3  26.6    161 c1      F        53     69 d1          1.61
 8 21909      0  26.2    168 c1      F        53     74 placebo     1.68
 9 21910      1  32.0    154 c1      F        47     76 d2          1.54
10 21911      0  33.3    157 c1      F        58     82 placebo     1.57
# ℹ 107 more rows

We can also create columns based on TRUE/FALSE conditions. According to the CDC, a person with a BMI < 18.5 is underweight:

crohns <- crohns %>% 
  mutate(underweight = ifelse(BMI < 18.5, "Yes", "No"))

crohns

# A tibble: 117 × 11
      ID nrAdvE   BMI height country sex     age weight treat   height_m
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>      <dbl>
 1 19908      4  25.2    163 c1      F        47     67 placebo     1.63
 2 19909      4  23.8    164 c1      F        53     64 d1          1.64
 3 19910      1  23.0    164 c1      F        68     62 placebo     1.64
 4 20908      1  25.7    165 c1      F        48     70 d2          1.65
 5 20909      2  26.0    170 c1      F        67     75 placebo     1.7 
 6 20910      2  28.7    168 c1      F        54     81 d1          1.68
 7 21908      3  26.6    161 c1      F        53     69 d1          1.61
 8 21909      0  26.2    168 c1      F        53     74 placebo     1.68
 9 21910      1  32.0    154 c1      F        47     76 d2          1.54
10 21911      0  33.3    157 c1      F        58     82 placebo     1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>

How many patients are underweight?

crohns %>%
  count(underweight)

# A tibble: 2 × 2
  underweight     n
  <chr>       <int>
1 No            113
2 Yes             4

Have a look at the mutate() function:

?mutate

Summary statistics, revisited with `summarize()`

Methods from before:

mean(crohns$age)

[1] 54.65812

max(crohns$age)

[1] 75

summary(crohns)

       ID            nrAdvE            BMI            height     
 Min.   :19908   Min.   : 0.000   Min.   :16.00   Min.   :124.0  
 1st Qu.:23909   1st Qu.: 0.000   1st Qu.:23.05   1st Qu.:157.0  
 Median :25919   Median : 1.000   Median :25.15   Median :162.0  
 Mean   :34103   Mean   : 2.034   Mean   :26.06   Mean   :162.7  
 3rd Qu.:51909   3rd Qu.: 3.000   3rd Qu.:28.40   3rd Qu.:166.0  
 Max.   :54937   Max.   :12.000   Max.   :44.06   Max.   :182.0  
   country              sex                 age            weight      
 Length:117         Length:117         Min.   :19.00   Min.   : 36.00  
 Class :character   Class :character   1st Qu.:48.00   1st Qu.: 59.00  
 Mode  :character   Mode  :character   Median :56.00   Median : 68.00  
                                       Mean   :54.66   Mean   : 69.03  
                                       3rd Qu.:62.00   3rd Qu.: 76.00  
                                       Max.   :75.00   Max.   :117.00  
    treat              height_m     underweight       
 Length:117         Min.   :1.240   Length:117        
 Class :character   1st Qu.:1.570   Class :character  
 Mode  :character   Median :1.620   Mode  :character  
                    Mean   :1.627                     
                    3rd Qu.:1.660                     
                    Max.   :1.820

The summarize() function does the same but in a tidyverse way and gives the result in a table which you can export and send to your colleagues.

crohns %>% 
  summarize(mean(age),
            max(age))

# A tibble: 1 × 2
  `mean(age)` `max(age)`
        <dbl>      <dbl>
1        54.7         75

We can also specify names for the new columns:

crohns %>% 
  summarize(mean_age = mean(age),
            max_age = max(age))

# A tibble: 1 × 2
  mean_age max_age
     <dbl>   <dbl>
1     54.7      75

What kind of things can you summarize? Have a look at the help by typing ?summarize into the console, or ‘summarize’ into the help panel and scroll down to ‘Useful functions’.

A useful summarize function is n() which counts the number of lines.

crohns %>% 
  summarize(mean_age = mean(age),
            max_age = max(age),
            number_lines = n())

# A tibble: 1 × 3
  mean_age max_age number_lines
     <dbl>   <dbl>        <int>
1     54.7      75          117

Note that R is tolerant of BE/AE spelling differences. summarise() and summarize() are the same function, likewise with color and colour.

Grouping with `group_by()`

The function group_by() imposes a grouping on a tibble. Group according to sex:

crohns %>%
  group_by(sex)

# A tibble: 117 × 11
# Groups:   sex [2]
      ID nrAdvE   BMI height country sex     age weight treat   height_m
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>      <dbl>
 1 19908      4  25.2    163 c1      F        47     67 placebo     1.63
 2 19909      4  23.8    164 c1      F        53     64 d1          1.64
 3 19910      1  23.0    164 c1      F        68     62 placebo     1.64
 4 20908      1  25.7    165 c1      F        48     70 d2          1.65
 5 20909      2  26.0    170 c1      F        67     75 placebo     1.7 
 6 20910      2  28.7    168 c1      F        54     81 d1          1.68
 7 21908      3  26.6    161 c1      F        53     69 d1          1.61
 8 21909      0  26.2    168 c1      F        53     74 placebo     1.68
 9 21910      1  32.0    154 c1      F        47     76 d2          1.54
10 21911      0  33.3    157 c1      F        58     82 placebo     1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>

We can also group according to several variables. How many groups will we get?

crohns %>%
  group_by(sex, treat)

# A tibble: 117 × 11
# Groups:   sex, treat [6]
      ID nrAdvE   BMI height country sex     age weight treat   height_m
   <dbl>  <dbl> <dbl>  <dbl> <chr>   <chr> <dbl>  <dbl> <chr>      <dbl>
 1 19908      4  25.2    163 c1      F        47     67 placebo     1.63
 2 19909      4  23.8    164 c1      F        53     64 d1          1.64
 3 19910      1  23.0    164 c1      F        68     62 placebo     1.64
 4 20908      1  25.7    165 c1      F        48     70 d2          1.65
 5 20909      2  26.0    170 c1      F        67     75 placebo     1.7 
 6 20910      2  28.7    168 c1      F        54     81 d1          1.68
 7 21908      3  26.6    161 c1      F        53     69 d1          1.61
 8 21909      0  26.2    168 c1      F        53     74 placebo     1.68
 9 21910      1  32.0    154 c1      F        47     76 d2          1.54
10 21911      0  33.3    157 c1      F        58     82 placebo     1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>

By itself, group_by does nothing, we still get the same dataset returned. But it is very useful in combination with other commands! The reason we want to do it this is way is that we can first impose grouping with group_by() and then pipe, %>%, the resulting tibble into summarize which will respect our grouping. So smart!

crohns %>%                      # the dataset
  group_by(sex) %>%             # grouped by sex
  summarise(avg = mean(age),    # calculate mean of the age
            med = median(age),  # calc median
            stdev = sd(age),    # calc standard dev.
            n = n())            # get the number of observations

# A tibble: 2 × 5
  sex     avg   med stdev     n
  <chr> <dbl> <dbl> <dbl> <int>
1 F      54.7    55  10.8   100
2 M      54.3    56  10.4    17

Now we see why n() is useful: It tells us how many lines, i.e. patients are in each group.

Group by sex and treatment, and calculate stats for the number of adverse events.

crohns %>%                              # the dataset
  group_by(sex, treat) %>%              # grouped by sex
  summarise(avg = mean(nrAdvE),         # calculate mean number of adverse events
            med = median(nrAdvE),       # calc median
            max = max(nrAdvE),          # calc max 
            stdev = sd(nrAdvE),         # calc standard dev.
            total_events = sum(nrAdvE), # calc cumulative sum 
            n = n())                    # get the number of observations

# A tibble: 6 × 8
# Groups:   sex [2]
  sex   treat     avg   med   max stdev total_events     n
  <chr> <chr>   <dbl> <dbl> <dbl> <dbl>        <dbl> <int>
1 F     d1       1.5      0     7  2.08           51    34
2 F     d2       2.12     1     9  2.71           72    34
3 F     placebo  2.16     1    12  3.09           69    32
4 M     d1       2        0     9  3.94           10     5
5 M     d2       2.2      2     6  2.49           11     5
6 M     placebo  3.57     3     8  3.41           25     7

The might of the pipe operator: `%>%`

Many commands can be combined with the pipe operator to pipe data through an analysis workflow.

crohns %>%                              # the dataset
  filter(age > 65) %>%                  # filtered to only people over 65
  group_by(sex, treat) %>%              # Grouping 
  summarise(avg = mean(nrAdvE),         # calculate mean number of adverse events
            med = median(nrAdvE),       # calc median
            max = max(nrAdvE),          # calc max 
            stdev = sd(nrAdvE),         # calc standard dev.
            total_events = sum(nrAdvE), # calc cumulative sum 
            n = n()) %>%                # get the number of observations
  arrange(avg)                          # Sort output by the mean

# A tibble: 5 × 8
# Groups:   sex [2]
  sex   treat     avg   med   max stdev total_events     n
  <chr> <chr>   <dbl> <dbl> <dbl> <dbl>        <dbl> <int>
1 F     d1       1.83   0.5     6  2.56           11     6
2 F     placebo  2.22   1      12  3.73           20     9
3 F     d2       3      3       5  2               9     3
4 M     d2       4.5    4.5     6  2.12            9     2
5 M     placebo  8      8       8 NA               8     1

What if I want to do the same analysis but with only obese patients? The CDC lists a BMI of > 30 as obese.

Load packages

Check working directory

Importing data

A first look at the data

The anatomy of tidyverse

Count, distinct, sort

Filtering data (selecting rows) with filter()

The world of conditional operators

Selecting variables (columns) with select()

Transformation of data with mutate()

Summary statistics, revisited with summarize()

Grouping with group_by()

The might of the pipe operator: %>%