# Load tidyverse package
library(tidyverse)
# Load a package that can read excel files
library(readxl)
library(readr)
Presentation 2: Tidyverse
Want to code along?
Load packages
Check working directory
Check working directory so you know from where you work.
getwd()
[1] "/Users/srz223/Desktop/DataLab/FromExceltoR/Teachers/Presentations"
Importing data
Often we will work with large datasets that already exist in i.e. an excel sheet or a tab separated file (.tsv). We can easily load that data into R, either with the read_excel
function or by clicking on ‘Import Dataset’ in the Environment tab (right). We can also load data in via a command. Let’s do this now. Navigate to the data from your working directory. Use the tap-button to check what your options are.
<- read_excel("../../Data/crohns_disease.xlsx") crohns
A first look at the data
Print first few lines of your dataset
head(crohns)
# A tibble: 6 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19908 4 25.2 163 c1 F 47 67 placebo
2 19909 4 23.8 164 c1 F 53 64 d1
3 19910 1 23.0 164 c1 F 68 62 placebo
4 20908 1 25.7 165 c1 F 48 70 d2
5 20909 2 26.0 170 c1 F 67 75 placebo
6 20910 2 28.7 168 c1 F 54 81 d1
Get the dimension of your dataset
dim(crohns)
[1] 117 9
How many observations (rows) do have?
nrow(crohns)
[1] 117
How many data columns are there and what are their types? Both ‘str’ and ‘summary’ will you what column types you have. Summary has some extra summary stats on numeric columns.
summary(crohns)
ID nrAdvE BMI height
Min. :19908 Min. : 0.000 Min. :16.00 Min. :124.0
1st Qu.:23909 1st Qu.: 0.000 1st Qu.:23.05 1st Qu.:157.0
Median :25919 Median : 1.000 Median :25.15 Median :162.0
Mean :34103 Mean : 2.034 Mean :26.06 Mean :162.7
3rd Qu.:51909 3rd Qu.: 3.000 3rd Qu.:28.40 3rd Qu.:166.0
Max. :54937 Max. :12.000 Max. :44.06 Max. :182.0
country sex age weight
Length:117 Length:117 Min. :19.00 Min. : 36.00
Class :character Class :character 1st Qu.:48.00 1st Qu.: 59.00
Mode :character Mode :character Median :56.00 Median : 68.00
Mean :54.66 Mean : 69.03
3rd Qu.:62.00 3rd Qu.: 76.00
Max. :75.00 Max. :117.00
treat
Length:117
Class :character
Mode :character
str(crohns)
tibble [117 × 9] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:117] 19908 19909 19910 20908 20909 ...
$ nrAdvE : num [1:117] 4 4 1 1 2 2 3 0 1 0 ...
$ BMI : num [1:117] 25.2 23.8 23.1 25.7 25.9 ...
$ height : num [1:117] 163 164 164 165 170 168 161 168 154 157 ...
$ country: chr [1:117] "c1" "c1" "c1" "c1" ...
$ sex : chr [1:117] "F" "F" "F" "F" ...
$ age : num [1:117] 47 53 68 48 67 54 53 53 47 58 ...
$ weight : num [1:117] 67 64 62 70 75 81 69 74 76 82 ...
$ treat : chr [1:117] "placebo" "d1" "placebo" "d2" ...
The anatomy of tidyverse
Tidyverse is a collection of R packages that are great for data wrangling and visualizations. Data wrangling with functions from the Tidyverse are often used with a specific syntax:
The name of the variable you are creating. Can omit if you don’t want to save the result.
The name of the dataset we are working on.
The function you want to apply on the dataset (and whatever arguments must be provided to the function).
In tidyverse we use the pipe symbol %>%
to chain multiple functions together. The term pipe
comes from the fact that we pipe the output from one function into another function as the input.
It is a good idea to make a new line after each pipe symbol.
# new_object <- dataset %>%
# function1(arguments...) %>%
# function2(arguments...)
Count, distinct, sort
Count and distinct are very useful to get information about your dataset!
Variables (columns) can be numeric or categorical (characters, factors). Use the str()
function to see the structure of your dataset.
%>%
crohns str()
tibble [117 × 9] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:117] 19908 19909 19910 20908 20909 ...
$ nrAdvE : num [1:117] 4 4 1 1 2 2 3 0 1 0 ...
$ BMI : num [1:117] 25.2 23.8 23.1 25.7 25.9 ...
$ height : num [1:117] 163 164 164 165 170 168 161 168 154 157 ...
$ country: chr [1:117] "c1" "c1" "c1" "c1" ...
$ sex : chr [1:117] "F" "F" "F" "F" ...
$ age : num [1:117] 47 53 68 48 67 54 53 53 47 58 ...
$ weight : num [1:117] 67 64 62 70 75 81 69 74 76 82 ...
$ treat : chr [1:117] "placebo" "d1" "placebo" "d2" ...
distinct()
tells us how many different levels a categorical variable has.
# How many different treatments do we have?
%>%
crohns distinct(treat)
# A tibble: 3 × 1
treat
<chr>
1 placebo
2 d1
3 d2
#From how many different countries do we have data?
%>%
crohns distinct(country)
# A tibble: 2 × 1
country
<chr>
1 c1
2 c2
count()
does tabulation of categorical variables. Total number of lines, i.e. patients in the current dataset. Observe, this matches with the number of lines you can see in the Environment tab.
%>%
crohns count()
# A tibble: 1 × 1
n
<int>
1 117
# How many lines, i.e. patients do we have per treatment?
%>%
crohns count(treat)
# A tibble: 3 × 2
treat n
<chr> <int>
1 d1 39
2 d2 39
3 placebo 39
# Is our dataset balanced?
# How many patients do we have for each age?
%>%
crohns count(age)
# A tibble: 43 × 2
age n
<dbl> <int>
1 19 1
2 28 1
3 29 1
4 30 1
5 33 1
6 35 1
7 36 1
8 38 1
9 39 3
10 40 2
# ℹ 33 more rows
# Perhaps this is more useful: How many patients are older than 65?
%>%
crohns count(age > 65)
# A tibble: 2 × 2
`age > 65` n
<lgl> <int>
1 FALSE 96
2 TRUE 21
Note we haven’t saved anything here, we just get output to the console sorted in a certain way. This helps us to check if the data looks correct and get an impression.
Filtering data (selecting rows) with filter()
How we subset dataset into subsets we find interesting. For example only female patients:
%>%
crohns filter(sex == 'F') # processed from left to right
# A tibble: 100 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19908 4 25.2 163 c1 F 47 67 placebo
2 19909 4 23.8 164 c1 F 53 64 d1
3 19910 1 23.0 164 c1 F 68 62 placebo
4 20908 1 25.7 165 c1 F 48 70 d2
5 20909 2 26.0 170 c1 F 67 75 placebo
6 20910 2 28.7 168 c1 F 54 81 d1
7 21908 3 26.6 161 c1 F 53 69 d1
8 21909 0 26.2 168 c1 F 53 74 placebo
9 21910 1 32.0 154 c1 F 47 76 d2
10 21911 0 33.3 157 c1 F 58 82 placebo
# ℹ 90 more rows
A great about tidyverse: write code the way you think. You always filter by defining conditions. If the condition evaluates to ‘TRUE’ the line is included. See only data lines for patients over 65:
%>%
crohns filter(age > 65)
# A tibble: 21 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19910 1 23.0 164 c1 F 68 62 placebo
2 20909 2 26.0 170 c1 F 67 75 placebo
3 22908 5 18.2 159 c1 F 66 46 d2
4 22909 8 30.8 156 c1 M 71 75 placebo
5 22911 3 24.8 182 c1 M 68 82 d2
6 22914 3 25.2 157 c1 F 75 62 d2
7 24912 0 25.2 162 c1 F 69 66 placebo
8 25920 1 27.9 164 c1 F 69 75 d1
9 26908 0 25.8 150 c1 F 73 58 d1
10 26910 0 19.1 165 c1 F 66 52 d1
# ℹ 11 more rows
From the above commands we are getting the result printed to the console. This is useful to check something. To save the result, we need to re-assign:
<- crohns %>%
seniors filter(age > 65)
View newly created data frame:
seniors
# A tibble: 21 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19910 1 23.0 164 c1 F 68 62 placebo
2 20909 2 26.0 170 c1 F 67 75 placebo
3 22908 5 18.2 159 c1 F 66 46 d2
4 22909 8 30.8 156 c1 M 71 75 placebo
5 22911 3 24.8 182 c1 M 68 82 d2
6 22914 3 25.2 157 c1 F 75 62 d2
7 24912 0 25.2 162 c1 F 69 66 placebo
8 25920 1 27.9 164 c1 F 69 75 d1
9 26908 0 25.8 150 c1 F 73 58 d1
10 26910 0 19.1 165 c1 F 66 52 d1
# ℹ 11 more rows
Do we still have all three treatment groups in our subset?
%>%
seniors count(treat)
# A tibble: 3 × 2
treat n
<chr> <int>
1 d1 6
2 d2 5
3 placebo 10
The world of conditional operators
Now we get lines that fit certain conditions but what if I want to filter on more than one condition? Enter conditional operators!
The ‘and’ operator: &
We can also subset on several conditions. Here are younger patients who received drug 1:
%>%
crohns filter(age <= 65 & treat == 'd1')
# A tibble: 33 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19909 4 23.8 164 c1 F 53 64 d1
2 20910 2 28.7 168 c1 F 54 81 d1
3 21908 3 26.6 161 c1 F 53 69 d1
4 21916 0 23.9 177 c1 M 56 75 d1
5 22916 2 30.9 163 c1 F 53 82 d1
6 23908 0 30.4 158 c1 F 55 76 d1
7 23909 0 23.4 156 c1 F 44 57 d1
8 23910 0 26.7 156 c1 F 59 65 d1
9 23912 0 16 150 c1 F 28 36 d1
10 24909 0 22.5 155 c1 F 52 54 d1
# ℹ 23 more rows
The ‘or’ operator: |
Get patients that were treated with either drug 1 or the placebo:
%>%
crohns filter(treat == 'placebo' | treat == 'd1')
# A tibble: 78 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19908 4 25.2 163 c1 F 47 67 placebo
2 19909 4 23.8 164 c1 F 53 64 d1
3 19910 1 23.0 164 c1 F 68 62 placebo
4 20909 2 26.0 170 c1 F 67 75 placebo
5 20910 2 28.7 168 c1 F 54 81 d1
6 21908 3 26.6 161 c1 F 53 69 d1
7 21909 0 26.2 168 c1 F 53 74 placebo
8 21911 0 33.3 157 c1 F 58 82 placebo
9 21914 6 28.4 170 c1 M 58 82 placebo
10 21916 0 23.9 177 c1 M 56 75 d1
# ℹ 68 more rows
The ‘not’ operator: !
%>%
crohns filter(treat != 'placebo')
# A tibble: 78 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19909 4 23.8 164 c1 F 53 64 d1
2 20908 1 25.7 165 c1 F 48 70 d2
3 20910 2 28.7 168 c1 F 54 81 d1
4 21908 3 26.6 161 c1 F 53 69 d1
5 21910 1 32.0 154 c1 F 47 76 d2
6 21912 5 32.5 152 c1 F 63 75 d2
7 21913 2 37.6 159 c1 F 54 95 d2
8 21915 0 23.0 160 c1 F 54 59 d2
9 21916 0 23.9 177 c1 M 56 75 d1
10 21917 0 36.4 164 c1 F 51 98 d2
# ℹ 68 more rows
Other conditional operators can be found in the first presentation or when querying the function (or just google it).
::filter ?dplyr
You can also allow multiple arguments in a variable. Here are the young patients who got treatment with either drug 1 or 2:
%>%
crohns filter(age <= 65 & treat %in% c("d1","d2"))
# A tibble: 67 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19909 4 23.8 164 c1 F 53 64 d1
2 20908 1 25.7 165 c1 F 48 70 d2
3 20910 2 28.7 168 c1 F 54 81 d1
4 21908 3 26.6 161 c1 F 53 69 d1
5 21910 1 32.0 154 c1 F 47 76 d2
6 21912 5 32.5 152 c1 F 63 75 d2
7 21913 2 37.6 159 c1 F 54 95 d2
8 21915 0 23.0 160 c1 F 54 59 d2
9 21916 0 23.9 177 c1 M 56 75 d1
10 21917 0 36.4 164 c1 F 51 98 d2
# ℹ 57 more rows
Selecting variables (columns) with select()
We can choose to only include certain columns. Here, we select only BMI, age and the number of adverse events:
%>%
crohns select(nrAdvE, BMI, age)
# A tibble: 117 × 3
nrAdvE BMI age
<dbl> <dbl> <dbl>
1 4 25.2 47
2 4 23.8 53
3 1 23.0 68
4 1 25.7 48
5 2 26.0 67
6 2 28.7 54
7 3 26.6 53
8 0 26.2 53
9 1 32.0 47
10 0 33.3 58
# ℹ 107 more rows
We can also make a negative selection that excludes the named column(s). The ID doesn’t give us any information since the data is anonymized:
<- crohns %>%
without_id select(-ID)
We have saved the dataset without the ID column in at new variable. Let’s have a look at this:
without_id
# A tibble: 117 × 8
nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 4 25.2 163 c1 F 47 67 placebo
2 4 23.8 164 c1 F 53 64 d1
3 1 23.0 164 c1 F 68 62 placebo
4 1 25.7 165 c1 F 48 70 d2
5 2 26.0 170 c1 F 67 75 placebo
6 2 28.7 168 c1 F 54 81 d1
7 3 26.6 161 c1 F 53 69 d1
8 0 26.2 168 c1 F 53 74 placebo
9 1 32.0 154 c1 F 47 76 d2
10 0 33.3 157 c1 F 58 82 placebo
# ℹ 107 more rows
Transformation of data with mutate()
We can create new columns based on other columns with the mutate()
function.
This is our original tibble:
crohns
# A tibble: 117 × 9
ID nrAdvE BMI height country sex age weight treat
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
1 19908 4 25.2 163 c1 F 47 67 placebo
2 19909 4 23.8 164 c1 F 53 64 d1
3 19910 1 23.0 164 c1 F 68 62 placebo
4 20908 1 25.7 165 c1 F 48 70 d2
5 20909 2 26.0 170 c1 F 67 75 placebo
6 20910 2 28.7 168 c1 F 54 81 d1
7 21908 3 26.6 161 c1 F 53 69 d1
8 21909 0 26.2 168 c1 F 53 74 placebo
9 21910 1 32.0 154 c1 F 47 76 d2
10 21911 0 33.3 157 c1 F 58 82 placebo
# ℹ 107 more rows
We want to add height in meters in a new column. It is important to reassign the dataframe if you want to save the new column.
<- crohns %>%
crohns mutate(height_m = height/100)
crohns
# A tibble: 117 × 10
ID nrAdvE BMI height country sex age weight treat height_m
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 19908 4 25.2 163 c1 F 47 67 placebo 1.63
2 19909 4 23.8 164 c1 F 53 64 d1 1.64
3 19910 1 23.0 164 c1 F 68 62 placebo 1.64
4 20908 1 25.7 165 c1 F 48 70 d2 1.65
5 20909 2 26.0 170 c1 F 67 75 placebo 1.7
6 20910 2 28.7 168 c1 F 54 81 d1 1.68
7 21908 3 26.6 161 c1 F 53 69 d1 1.61
8 21909 0 26.2 168 c1 F 53 74 placebo 1.68
9 21910 1 32.0 154 c1 F 47 76 d2 1.54
10 21911 0 33.3 157 c1 F 58 82 placebo 1.57
# ℹ 107 more rows
We can also create columns based on TRUE/FALSE conditions. According to the CDC, a person with a BMI < 18.5 is underweight:
<- crohns %>%
crohns mutate(underweight = ifelse(BMI < 18.5, "Yes", "No"))
crohns
# A tibble: 117 × 11
ID nrAdvE BMI height country sex age weight treat height_m
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 19908 4 25.2 163 c1 F 47 67 placebo 1.63
2 19909 4 23.8 164 c1 F 53 64 d1 1.64
3 19910 1 23.0 164 c1 F 68 62 placebo 1.64
4 20908 1 25.7 165 c1 F 48 70 d2 1.65
5 20909 2 26.0 170 c1 F 67 75 placebo 1.7
6 20910 2 28.7 168 c1 F 54 81 d1 1.68
7 21908 3 26.6 161 c1 F 53 69 d1 1.61
8 21909 0 26.2 168 c1 F 53 74 placebo 1.68
9 21910 1 32.0 154 c1 F 47 76 d2 1.54
10 21911 0 33.3 157 c1 F 58 82 placebo 1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>
How many patients are underweight?
%>%
crohns count(underweight)
# A tibble: 2 × 2
underweight n
<chr> <int>
1 No 113
2 Yes 4
Have a look at the mutate()
function:
?mutate
Summary statistics, revisited with summarize()
Methods from before:
mean(crohns$age)
[1] 54.65812
max(crohns$age)
[1] 75
summary(crohns)
ID nrAdvE BMI height
Min. :19908 Min. : 0.000 Min. :16.00 Min. :124.0
1st Qu.:23909 1st Qu.: 0.000 1st Qu.:23.05 1st Qu.:157.0
Median :25919 Median : 1.000 Median :25.15 Median :162.0
Mean :34103 Mean : 2.034 Mean :26.06 Mean :162.7
3rd Qu.:51909 3rd Qu.: 3.000 3rd Qu.:28.40 3rd Qu.:166.0
Max. :54937 Max. :12.000 Max. :44.06 Max. :182.0
country sex age weight
Length:117 Length:117 Min. :19.00 Min. : 36.00
Class :character Class :character 1st Qu.:48.00 1st Qu.: 59.00
Mode :character Mode :character Median :56.00 Median : 68.00
Mean :54.66 Mean : 69.03
3rd Qu.:62.00 3rd Qu.: 76.00
Max. :75.00 Max. :117.00
treat height_m underweight
Length:117 Min. :1.240 Length:117
Class :character 1st Qu.:1.570 Class :character
Mode :character Median :1.620 Mode :character
Mean :1.627
3rd Qu.:1.660
Max. :1.820
The summarize()
function does the same but in a tidyverse way and gives the result in a table which you can export and send to your colleagues.
%>%
crohns summarize(mean(age),
max(age))
# A tibble: 1 × 2
`mean(age)` `max(age)`
<dbl> <dbl>
1 54.7 75
We can also specify names for the new columns:
%>%
crohns summarize(mean_age = mean(age),
max_age = max(age))
# A tibble: 1 × 2
mean_age max_age
<dbl> <dbl>
1 54.7 75
What kind of things can you summarize? Have a look at the help by typing ?summarize
into the console, or ‘summarize’ into the help panel and scroll down to ‘Useful functions’.
A useful summarize function is n()
which counts the number of lines.
%>%
crohns summarize(mean_age = mean(age),
max_age = max(age),
number_lines = n())
# A tibble: 1 × 3
mean_age max_age number_lines
<dbl> <dbl> <int>
1 54.7 75 117
Note that R is tolerant of BE/AE spelling differences. summarise()
and summarize()
are the same function, likewise with color
and colour
.
Grouping with group_by()
The function group_by()
imposes a grouping on a tibble. Group according to sex:
%>%
crohns group_by(sex)
# A tibble: 117 × 11
# Groups: sex [2]
ID nrAdvE BMI height country sex age weight treat height_m
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 19908 4 25.2 163 c1 F 47 67 placebo 1.63
2 19909 4 23.8 164 c1 F 53 64 d1 1.64
3 19910 1 23.0 164 c1 F 68 62 placebo 1.64
4 20908 1 25.7 165 c1 F 48 70 d2 1.65
5 20909 2 26.0 170 c1 F 67 75 placebo 1.7
6 20910 2 28.7 168 c1 F 54 81 d1 1.68
7 21908 3 26.6 161 c1 F 53 69 d1 1.61
8 21909 0 26.2 168 c1 F 53 74 placebo 1.68
9 21910 1 32.0 154 c1 F 47 76 d2 1.54
10 21911 0 33.3 157 c1 F 58 82 placebo 1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>
We can also group according to several variables. How many groups will we get?
%>%
crohns group_by(sex, treat)
# A tibble: 117 × 11
# Groups: sex, treat [6]
ID nrAdvE BMI height country sex age weight treat height_m
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 19908 4 25.2 163 c1 F 47 67 placebo 1.63
2 19909 4 23.8 164 c1 F 53 64 d1 1.64
3 19910 1 23.0 164 c1 F 68 62 placebo 1.64
4 20908 1 25.7 165 c1 F 48 70 d2 1.65
5 20909 2 26.0 170 c1 F 67 75 placebo 1.7
6 20910 2 28.7 168 c1 F 54 81 d1 1.68
7 21908 3 26.6 161 c1 F 53 69 d1 1.61
8 21909 0 26.2 168 c1 F 53 74 placebo 1.68
9 21910 1 32.0 154 c1 F 47 76 d2 1.54
10 21911 0 33.3 157 c1 F 58 82 placebo 1.57
# ℹ 107 more rows
# ℹ 1 more variable: underweight <chr>
By itself, group_by does nothing, we still get the same dataset returned. But it is very useful in combination with other commands! The reason we want to do it this is way is that we can first impose grouping with group_by()
and then pipe, %>%
, the resulting tibble into summarize which will respect our grouping. So smart!
%>% # the dataset
crohns group_by(sex) %>% # grouped by sex
summarise(avg = mean(age), # calculate mean of the age
med = median(age), # calc median
stdev = sd(age), # calc standard dev.
n = n()) # get the number of observations
# A tibble: 2 × 5
sex avg med stdev n
<chr> <dbl> <dbl> <dbl> <int>
1 F 54.7 55 10.8 100
2 M 54.3 56 10.4 17
Now we see why n()
is useful: It tells us how many lines, i.e. patients are in each group.
Group by sex and treatment, and calculate stats for the number of adverse events.
%>% # the dataset
crohns group_by(sex, treat) %>% # grouped by sex
summarise(avg = mean(nrAdvE), # calculate mean number of adverse events
med = median(nrAdvE), # calc median
max = max(nrAdvE), # calc max
stdev = sd(nrAdvE), # calc standard dev.
total_events = sum(nrAdvE), # calc cumulative sum
n = n()) # get the number of observations
# A tibble: 6 × 8
# Groups: sex [2]
sex treat avg med max stdev total_events n
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 F d1 1.5 0 7 2.08 51 34
2 F d2 2.12 1 9 2.71 72 34
3 F placebo 2.16 1 12 3.09 69 32
4 M d1 2 0 9 3.94 10 5
5 M d2 2.2 2 6 2.49 11 5
6 M placebo 3.57 3 8 3.41 25 7
The might of the pipe operator: %>%
Many commands can be combined with the pipe operator to pipe data through an analysis workflow.
%>% # the dataset
crohns filter(age > 65) %>% # filtered to only people over 65
group_by(sex, treat) %>% # Grouping
summarise(avg = mean(nrAdvE), # calculate mean number of adverse events
med = median(nrAdvE), # calc median
max = max(nrAdvE), # calc max
stdev = sd(nrAdvE), # calc standard dev.
total_events = sum(nrAdvE), # calc cumulative sum
n = n()) %>% # get the number of observations
arrange(avg) # Sort output by the mean
# A tibble: 5 × 8
# Groups: sex [2]
sex treat avg med max stdev total_events n
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 F d1 1.83 0.5 6 2.56 11 6
2 F placebo 2.22 1 12 3.73 20 9
3 F d2 3 3 5 2 9 3
4 M d2 4.5 4.5 6 2.12 9 2
5 M placebo 8 8 8 NA 8 1
What if I want to do the same analysis but with only obese patients? The CDC lists a BMI of > 30 as obese.