library(tidyverse)
library(ggforce) # extension of ggplot2
# unload MASS since it has select function we don't want to use.
Presentation 1: Data Cleaning
In this section, we’ll focus on cleaning a dataset. We’ll primarily use functions from the tidyverse
package, but in some cases base R offers quick and effective alternatives.
Load Packages
Load data
We’ll work with two datasets:
df_ov
: clinical information for ovarian cancer patients.df_exp
: gene expression data for collagen genes.
We use the load()
function, which automatically imports an R object - let’s have a look at it.
# .RData can contain several objects
load("../data/Ovarian_data_Col.RData")
class(df_ov)
[1] "tbl_df" "tbl" "data.frame"
dim(df_ov)
[1] 578 23
%>% head() df_ov
# A tibble: 6 × 23
alt_sample_name unique_patient_ID sample_type histological_type primarysite
<chr> <chr> <chr> <chr> <chr>
1 TCGA-20-0987-01A-… TCGA-20-0987 tumor ser ov
2 TCGA-23-1031-01A-… TCGA-23-1031 unknown ser ov
3 TCGA-24-0979-01A-… TCGA-24-0979 tumor ser ov
4 TCGA-23-1117-01A-… TCGA-23-1117 unknown ser ov
5 TCGA-23-1021-01B-… TCGA-23-1021 unknown ser ov
6 TCGA-04-1337-01A-… TCGA-04-1337 tumor ser ov
# ℹ 18 more variables: arrayedsite <lgl>, summarygrade <chr>, tumorstage <int>,
# substage <chr>, grade <int>, age_at_initial_path_diagn <int>,
# days_to_tumor_recurrence <int>, recurrence_status <chr>,
# days_to_death <int>, vital_status <chr>, os_binary <lgl>,
# relapse_binary <lgl>, site_of_tumor_first_recurrence <chr>,
# primary_therapy_outcome_success <chr>, percent_normal_cells <int>,
# percent_stromal_cells <int>, percent_tumor_cells <int>, batch <dbl>
class(df_exp)
[1] "tbl_df" "tbl" "data.frame"
dim(df_exp)
[1] 578 35
%>% head() df_exp
# A tibble: 6 × 35
unique_patient_ID COL10A1 COL11A1 COL11A2 COL13A1 COL14A1 COL15A1 COL16A1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TCGA-20-0987 3.66 3.20 3.65 4.07 4.05 4.98 3.57
2 TCGA-23-1031 3.49 6.82 3.83 5.32 5.11 4.32 4.57
3 TCGA-24-0979 3.43 3.43 3.59 3.92 5.86 5.89 4.82
4 TCGA-23-1117 7.17 10.7 3.48 7.18 7.31 7.07 7.48
5 TCGA-23-1021 6.41 4.06 3.63 8.60 5.42 4.27 5.66
6 TCGA-04-1337 7.63 10.6 3.58 7.78 5.62 7.97 7.53
# ℹ 27 more variables: COL17A1 <dbl>, COL18A1 <dbl>, COL19A1 <dbl>,
# COL1A1 <dbl>, COL1A2 <dbl>, COL21A1 <dbl>, COL2A1 <dbl>, COL3A1 <dbl>,
# COL4A1 <dbl>, COL4A2 <dbl>, COL4A3 <dbl>, COL4A3BP <dbl>, COL4A4 <dbl>,
# COL4A5 <dbl>, COL4A6 <dbl>, COL5A1 <dbl>, COL5A2 <dbl>, COL5A3 <dbl>,
# COL6A1 <dbl>, COL6A2 <dbl>, COL6A3 <dbl>, COL7A1 <dbl>, COL8A1 <dbl>,
# COL8A2 <dbl>, COL9A1 <dbl>, COL9A2 <dbl>, COL9A3 <dbl>
Basics
We can get a little reminder of base R and tidyverse functions and explore some basic data characteristics.
Take a look at categorical and numerical variables, access elements in list.
# output: vector
$vital_status %>%
df_ovhead()
[1] "deceased" "deceased" "deceased" "deceased" "deceased" "deceased"
# output: one-column-tibble
%>%
df_ov select(vital_status) %>%
head()
# A tibble: 6 × 1
vital_status
<chr>
1 deceased
2 deceased
3 deceased
4 deceased
5 deceased
6 deceased
Get a vector the tidyverse way:
%>%
df_ov select(vital_status) %>%
pull() %>%
head()
[1] "deceased" "deceased" "deceased" "deceased" "deceased" "deceased"
Check for Missing Values
Check which columns contain NA values and how many:
%>%
df_ov is.na() %>%
colSums() %>%
barplot(las=2, cex.names=0.6) # baseR barplot since we are plotting a vector
Remove rows with missing data (careful - this can reduce sample size too much):
%>%
df_ov drop_na()
# A tibble: 0 × 23
# ℹ 23 variables: alt_sample_name <chr>, unique_patient_ID <chr>,
# sample_type <chr>, histological_type <chr>, primarysite <chr>,
# arrayedsite <lgl>, summarygrade <chr>, tumorstage <int>, substage <chr>,
# grade <int>, age_at_initial_path_diagn <int>,
# days_to_tumor_recurrence <int>, recurrence_status <chr>,
# days_to_death <int>, vital_status <chr>, os_binary <lgl>,
# relapse_binary <lgl>, site_of_tumor_first_recurrence <chr>, …
Data wrangling and cleaning
Let’s do some cleanup
Drop Non-Informative Variables
Drop Variables with Too Many Missing Values
Standardize Categorical Data
Fix Variable Types
Recode and Reorder Factors
Create New Variables from existing ones
Step 1: Drop Non-Informative Variables
Let’s check a few categorical variables and remove any that are redundant or irrelevant. If they don’t provide useful or well-curated information, we can drop them:
$histological_type %>% table() df_ov
.
ser
568
<- df_ov %>%
df_ov select(!histological_type)
Step 2: Drop Variables with Too Many Missing Values
If a variable has more than ~20% missing data, it probably won’t be useful. Let’s remove those:
~ mean(is.na(.)) < threshold
explained: ~
= “Define a function” .
is a placeholder for the input (in this case, a column of the data frame) mean(is.na(.))
calculates the proportion of missing values < threshold
returns TRUE or FALSE — which where()
uses to decide if a column should be kept
<- 0.2 # 20% threshold
threshold
<- df_ov %>%
df_ov select(where(~mean(is.na(.)) < threshold))
dim(df_ov)
[1] 578 17
This keeps only the variables that are mostly complete.
Step 3: Standardize Categorical Variables
Counting the number of occurrences for each level of a categorical variable.
Here we count how many dead or alive individuals we have. This is useful for checking balance across groups or categories.
%>%
df_ov select(vital_status) %>%
table(useNA = 'ifany') #%>% barplot()
vital_status
deceased deceased living living NA <NA>
280 10 261 9 1 17
Looks like we have two groups of deceased and living. Let’s have a closer look at the unique values of this variable.
$vital_status %>%
df_ovunique()
[1] "deceased" "living" "deceased " "living " NA "NA "
Whitespace includes spaces, newlines, and other blank characters in text. It can cause errors or inconsistencies in data, so removing unnecessary whitespace is an important step in cleaning data. Sometimes, hidden characters or spaces cause misgrouped levels.
Use the str_trim()
function to remove whitespace.
str_trim(" Hello World! ")
[1] "Hello World!"
Trim whitespace and standardize:
Like other function, the str_trim()
function can also be used inside the mutate()
function to alter the dataframe.
<- df_ov %>%
df_ov mutate(vital_status = str_trim(vital_status))
Accessing the unique values of the vital_status
column.
$vital_status %>%
df_ovunique()
[1] "deceased" "living" NA "NA"
Change “NA” to actual NA.
<- df_ov %>%
df_ov mutate(vital_status = na_if(vital_status, "NA"))
After clean up we only have one group of deceased, living, and NA - perfect!
$vital_status %>%
df_ovunique()
[1] "deceased" "living" NA
Step 4: Fix Variable Types
Some variables are stored as character or numeric when they should actually be factors (i.e., categorical variables). Convert character/numeric columns to factors:
# One by one...
$sample_type <- as.factor(df_ov$sample_type)
df_ov$primarysite <- as.factor(df_ov$primarysite)
df_ov
# In one go:
<- c('sample_type', 'primarysite', 'summarygrade',
cols_to_factor 'recurrence_status', 'vital_status',
'tumorstage', 'grade', 'batch')
<- df_ov %>%
df_ov mutate(across(.cols = cols_to_factor,
.fns = as.factor))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(.cols = cols_to_factor, .fns = as.factor)`.
Caused by warning:
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(cols_to_factor)
# Now:
data %>% select(all_of(cols_to_factor))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
str(df_ov)
tibble [578 × 17] (S3: tbl_df/tbl/data.frame)
$ alt_sample_name : chr [1:578] "TCGA-20-0987-01A-02R-0434-01" "TCGA-23-1031-01A-01R-0434-01" "TCGA-24-0979-01A-01R-0434-01" "TCGA-23-1117-01A-02R-0434-01" ...
$ unique_patient_ID : chr [1:578] "TCGA-20-0987" "TCGA-23-1031" "TCGA-24-0979" "TCGA-23-1117" ...
$ sample_type : Factor w/ 3 levels "healthy","tumor",..: 2 3 2 3 3 2 2 2 2 2 ...
$ primarysite : Factor w/ 2 levels "other","ov": 2 2 2 2 2 2 2 2 2 2 ...
$ summarygrade : Factor w/ 4 levels "HIGH","high",..: 2 1 2 2 2 4 2 2 2 2 ...
$ tumorstage : Factor w/ 4 levels "1","2","3","4": 3 4 4 3 4 3 3 4 3 3 ...
$ substage : chr [1:578] "c" NA NA "c" ...
$ grade : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 2 3 3 3 3 ...
$ age_at_initial_path_diagn: int [1:578] 61 60 53 42 45 78 74 73 45 45 ...
$ days_to_tumor_recurrence : int [1:578] 442 574 428 418 1446 61 870 68 2616 797 ...
$ recurrence_status : Factor w/ 2 levels "norecurrence",..: 2 1 2 2 1 1 2 1 1 2 ...
$ days_to_death : int [1:578] 701 574 1264 1013 1446 61 789 84 2616 816 ...
$ vital_status : Factor w/ 2 levels "deceased","living": 1 1 1 1 1 1 2 1 2 2 ...
$ percent_normal_cells : int [1:578] 16 8 0 0 0 0 12 15 NA 0 ...
$ percent_stromal_cells : int [1:578] 1 5 25 NA 0 36 2 5 NA NA ...
$ percent_tumor_cells : int [1:578] 72 80 72 NA NA 64 84 72 NA NA ...
$ batch : Factor w/ 7 levels "11","13","15",..: 2 2 2 2 2 2 2 2 2 2 ...
Step 5: Re-code and Reorder Factors
Sometimes we want to set the order of factor levels manually, especially if there’s a natural or meaningful order (e.g., “low” → “high”): This helps when plotting or interpreting regression coefficients.
# Recode factor
$summarygrade %>% table() df_ov
.
HIGH high LOW low
24 456 4 71
<- df_ov %>%
df_ov mutate(summarygrade = fct_recode(summarygrade, "high" = "HIGH"),
summarygrade = fct_recode(summarygrade, "low" = "LOW"))
$summarygrade %>% table() df_ov
.
high low
480 75
# Reorder levels
levels(df_ov$summarygrade) # high, low
[1] "high" "low"
$summarygrade <- factor(df_ov$summarygrade, levels = c('low', 'high'))
df_ov
levels(df_ov$summarygrade) # low, high
[1] "low" "high"
Step 6: Creating New Variables
Now, we would like to create a new data column by extracting specific information or combining other columns.
Add a column combining other columns into one:
<- df_ov %>%
df_ov mutate(percent_not_cancer_cells = percent_stromal_cells + percent_normal_cells)
For character vectors you can use functions paste
and paste0
. The paste
function concatenates two strings into one. This one is from baseR.
paste('TCGA', 'Data', sep = ' ')
[1] "TCGA Data"
<- df_ov %>%
df_ov mutate(stage = paste(tumorstage, substage, sep = '-'))
unique(df_ov$stage)
[1] "3-c" "4-NA" "2-b" "NA-NA" "1-b" "3-b" "2-c" "1-c"
Create logical variables with ifelse
. Adding a column with content based on a condition with ifelse
.
<- df_ov %>%
df_ov mutate(dominant_cell_type = ifelse(percent_not_cancer_cells > percent_tumor_cells, "not_cancer_cells", "cancer_cells"))
table(df_ov$dominant_cell_type)
cancer_cells not_cancer_cells
536 9
Extension of ifelse
: case_when
<- df_ov %>%
df_ov mutate(group = case_when(age_at_initial_path_diagn < 60 & summarygrade == "low" ~ "Low grade - Young",
< 60 & summarygrade == "high" ~ "High grade - Young",
age_at_initial_path_diagn >= 60 & summarygrade == "low" ~ "Low grade - Mature",
age_at_initial_path_diagn >= 60 & summarygrade == "high" ~ "High grade - Mature",
age_at_initial_path_diagn .default = NA))
table(df_ov$group)
High grade - Mature High grade - Young Low grade - Mature Low grade - Young
233 247 31 44
String manipulation
Sometimes the information you need is hidden inside longer text strings or encoded in specific patterns — especially in genomics or clinical datasets. In this example we notice that sample type information is unknown for a lot of patients. Since we are TCGA experts, we know that this information is actually encoded in the full sample name TCGA-20-0987-01A-02R-0434-01
. But How do we get it out.
To extract and work with this kind of information, we can use what is called a Regular Expression (Regex) and string manipulation tools.
Regex: Regular expression
Regular expressions (regex) are used to search for patterns in text strings. Regular expressions are super flexible and very powerful, BUT they can be slightly confusing at times…
Instead of looking for an exact word like cat, a regex lets you search for things like:
“Any word that starts with ‘c’ and ends with ‘t’”
“Any sequence of digits”
“A word that may or may not have a certain letter”
“A word containing a specific delimiter, e.g. a dash, as comma, etc.”
It might look a little weird (lots of slashes, dots, and symbols), but once you learn a few basics, it’s incredibly useful! And the good thing is that ChatCPT is very good at making regular expressions for you!
We will do string manipulation the tidyverse way: using the stringr
package where all the functions starts with str_
. These functions often has a parameter called pattern
which can take a regular expression as an argument. Let’s start with some simple examples.
Examples
$alt_sample_name %>% head(n = 1) df_ov
[1] "TCGA-20-0987-01A-02R-0434-01"
Let’s say we want to extract part of the sample name. We can use the str_split_i
function to split a string on a given character and access the i’th element.
str_split_i('TCGA-20-0987-01A-02R-0434-01', pattern = '-', i = 2)
[1] "20"
This grabs the 4th segment in the name string, which might contain the sample code We can apply this to a whole column using mutate()
and create a new variable with the extracted values.
<- df_ov %>%
df_ov mutate(sample_code = str_split_i(alt_sample_name, pattern = '-', i = 4))
table(df_ov$sample_code)
01A 01B 01C 01D 11A
539 27 2 1 8
We know that 01 encodes primary tumor while 11 encodes healthy tissue. We can use str_detect()
to see if a sub-string contains the 01 or 11.
str_detect(df_ov$sample_code, '01') %>% table(useNA = 'ifany')
.
FALSE TRUE <NA>
8 569 1
str_detect(df_ov$sample_code, '11') %>% table(useNA = 'ifany')
.
FALSE TRUE <NA>
569 8 1
Add a column that records TRUE if sample is a primary tumor and place it after the Sample_type
column.
'01.{0,1}'
explained:
01
— matches the characters ‘0’ followed immediately by ‘1’.
.
— matches any single character (A, B, C, X, !, except a newline by default).
{0,1}
— is a quantifier, meaning: match the preceding element (the dot .
) zero or one time.
<- df_ov %>%
df_ov mutate(sample_code_manual = str_detect(sample_code, '01.{0,1}'),
.after = sample_type)
%>% select(sample_code, sample_code_manual) %>% slice(100:110) df_ov
# A tibble: 11 × 2
sample_code sample_code_manual
<chr> <lgl>
1 01A TRUE
2 01A TRUE
3 01A TRUE
4 01A TRUE
5 01A TRUE
6 11A FALSE
7 11A FALSE
8 01A TRUE
9 01A TRUE
10 01A TRUE
11 11A FALSE
We can also replace text patterns:
str_replace_all(string = 'TCGA', pattern = 'A', replacement = 'aaaaaaaaaa')
[1] "TCGaaaaaaaaaa"
<- df_ov %>%
df_ov mutate(sample_recode = str_replace_all(sample_code, '01.', 'Tumor'),
sample_recode = str_replace_all(sample_recode, '11A', 'Healthy'),
.after = sample_type)
%>% select(alt_sample_name, sample_code, sample_code_manual, sample_recode) %>% tail() df_ov
# A tibble: 6 × 4
alt_sample_name sample_code sample_code_manual sample_recode
<chr> <chr> <lgl> <chr>
1 TCGA-24-1852-01A-01R-0808-01 01A TRUE Tumor
2 TCGA-29-1692-01B-01R-0808-01 01B TRUE Tumor
3 TCGA-13-1817-01A-01R-0808-01 01A TRUE Tumor
4 TCGA-61-1916-01A-01R-0808-01 01A TRUE Tumor
5 TCGA-29-1704-01B-01R-0808-01 01B TRUE Tumor
6 TCGA-13-1819-01A-01R-0808-01 01A TRUE Tumor
<- df_ov %>%
df_ov select(!c(sample_code_manual, sample_code, alt_sample_name, sample_type))
%>% dim() df_ov
[1] 578 20
Nicely done!
At this point, we have:
Dropped Non-Informative Variables
Dropped Variables with Too Many Missing Values
Standardized Categorical Data
Fixed Variable Types
Recoded and Reordered Factors
Created New Variables from existing ones
Joining Dataframes
Often we have information stored in more than one table, and we want to merge these data together into a single dataset. For example, here we would like to join our df_ov
and df_exp
data table by IDs.
Let’s make a subset of each of the datasets to have a better overview of what is going on.
<- df_exp %>% arrange(unique_patient_ID) %>% slice(1:5) %>% select(1:5)
df_exp_subset <- df_ov %>% arrange(unique_patient_ID) %>% slice(3:7) %>% select(1:3, vital_status)
df_ov_subset
$unique_patient_ID %in% df_ov_subset$unique_patient_ID %>% table() df_exp_subset
.
FALSE TRUE
2 3
A quick recap of join types from dplyr
:
full_join()
: all rows from bothinner_join()
: only matched rowsleft_join()
: all from left, matched from rightright_join()
: all from right, matched from left
Full Join
A full join keeps everything — all rows from both df_ov_subset
and df_exp_subset
. If there’s no match, missing values (NA
) are filled in.
%>%
df_ov_subset full_join(df_exp_subset, by = "unique_patient_ID")
# A tibble: 7 × 8
unique_patient_ID sample_recode primarysite vital_status COL10A1 COL11A1
<chr> <chr> <fct> <fct> <dbl> <dbl>
1 TCGA-01-0631 Healthy <NA> <NA> 3.64 4.90
2 TCGA-01-0633 Healthy <NA> <NA> 3.50 2.97
3 TCGA-01-0636 Healthy <NA> <NA> 3.28 2.93
4 TCGA-01-0637 Healthy <NA> <NA> NA NA
5 TCGA-01-0639 Healthy <NA> <NA> NA NA
6 TCGA-01-0628 <NA> <NA> <NA> 2.99 3.00
7 TCGA-01-0630 <NA> <NA> <NA> 3.63 2.76
# ℹ 2 more variables: COL11A2 <dbl>, COL13A1 <dbl>
# N row:
# union(df_exp_subset$unique_patient_ID, df_ov_subset$unique_patient_ID) %>% length()
Inner Join
An inner join keeps only the rows that appear in both data frames. So if an ID exists in one but not the other, it’s dropped.
%>%
df_ov_subset inner_join(df_exp_subset, by = "unique_patient_ID")
# A tibble: 3 × 8
unique_patient_ID sample_recode primarysite vital_status COL10A1 COL11A1
<chr> <chr> <fct> <fct> <dbl> <dbl>
1 TCGA-01-0631 Healthy <NA> <NA> 3.64 4.90
2 TCGA-01-0633 Healthy <NA> <NA> 3.50 2.97
3 TCGA-01-0636 Healthy <NA> <NA> 3.28 2.93
# ℹ 2 more variables: COL11A2 <dbl>, COL13A1 <dbl>
# N row:
# intersect(df_exp_subset$unique_patient_ID, df_ov_subset$unique_patient_ID) %>% length()
Left Join
A left join keeps all rows from df_ov_subset
, and matches info from df_exp_subset
wherever possible. Unmatched rows from df_ov_subset
get NA
for the new columns.
%>%
df_ov_subset left_join(df_exp_subset, by = "unique_patient_ID")
# A tibble: 5 × 8
unique_patient_ID sample_recode primarysite vital_status COL10A1 COL11A1
<chr> <chr> <fct> <fct> <dbl> <dbl>
1 TCGA-01-0631 Healthy <NA> <NA> 3.64 4.90
2 TCGA-01-0633 Healthy <NA> <NA> 3.50 2.97
3 TCGA-01-0636 Healthy <NA> <NA> 3.28 2.93
4 TCGA-01-0637 Healthy <NA> <NA> NA NA
5 TCGA-01-0639 Healthy <NA> <NA> NA NA
# ℹ 2 more variables: COL11A2 <dbl>, COL13A1 <dbl>
# N row:
# df_ov_subset$unique_patient_ID %>% length()
Right Join
A right join is the opposite: it keeps all rows from df_ov_subset
and adds matching data from df_exp_subset
wherever it can.
%>%
df_ov_subset right_join(df_exp_subset, by = "unique_patient_ID")
# A tibble: 5 × 8
unique_patient_ID sample_recode primarysite vital_status COL10A1 COL11A1
<chr> <chr> <fct> <fct> <dbl> <dbl>
1 TCGA-01-0631 Healthy <NA> <NA> 3.64 4.90
2 TCGA-01-0633 Healthy <NA> <NA> 3.50 2.97
3 TCGA-01-0636 Healthy <NA> <NA> 3.28 2.93
4 TCGA-01-0628 <NA> <NA> <NA> 2.99 3.00
5 TCGA-01-0630 <NA> <NA> <NA> 3.63 2.76
# ℹ 2 more variables: COL11A2 <dbl>, COL13A1 <dbl>
# N row:
# df_exp_subset$unique_patient_ID %>% length()
Let’s join our datasets
We will join the two full datasets.
<- df_ov %>%
df_comb left_join(df_exp, by = "unique_patient_ID")
Recap
By now, we have:
Loaded and explored clinical and expression datasets
Cleaned whitespace, fixed types, dropped unnecessary variables
Created new columns through logic and string operations
Merged the clinical and expression data for downstream analysis
Your data is now prepped and ready for modeling, visualization, or more advanced statistical analysis.
Your dataset df_comb
is now ready for summary statistics and exploratory data analysis (EDA).
# save(df_comb, file="../data/Ovarian_comb_clean.RData")