library(SummarizedExperiment)
library(airway)
library(tidyverse)
#load the airway dataset
data(airway, package = 'airway')
#You can read about the dataset by doing:
#?airway
R Objects: S4 Objects
S4 objects
Bioconductor packages often use S4 objects to store their data. Some examples:
(Ranged)SummarizedExperiment
is an object that holds rectangular matrices of high-throughput omics data (RNA-Seq, ChIP-Seq, ATAC-Seq)pyholoseq
is an objects that stores phylogenetic data such as OTU tables, taxonomy ect for e.g. microbiome and general metagenomicsDESeqDataSet
is an object that stores counts, transformations and metadata for RNAseq analysis. It is subclass of(Ranged)SummarizedExperiment
Unlike the S3 objects (named lists with a class) we looked at before, S4 objects have clearly defined elements called ‘slots’.
A slot must have:
- a name
- a defined data type
- it may have validation rules
They are usually accessed by specialized functions. Let’s have a look at a SummarizedExperiment
objects.
An example: The airway dataset
typeof(airway)
[1] "S4"
class(airway)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
airway
is a RangedSummarizedExperiment object. We can get a good overview of the content of Bioconductor S4 objects by just accessing them:
airway
class: RangedSummarizedExperiment
dim: 63677 8
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
There is only one assay and it is count data. We might have several assays in the same RangedSummarizedExperiment object, i.e. one for raw counts and one for normalized counts or rlog transformed counts.
Our rows are genes with Ensembl Gene Identifiers (ENSG00000000003, ect) and our columns are samples. There have been 64102 genes measured (look at the dimensions).
Accessing slots
We generally access slots by functions that have the same name as the slot. The (raw) counts for example are in the assay slot:
<- assay(airway, 'counts')
count_matrix head(count_matrix)
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
ENSG00000000457 260 211 263 164 245
ENSG00000000460 60 55 40 35 78
ENSG00000000938 0 0 2 0 1
SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 1047 770 572
ENSG00000000005 0 0 0
ENSG00000000419 799 417 508
ENSG00000000457 331 233 229
ENSG00000000460 63 76 60
ENSG00000000938 0 0 0
class(count_matrix)
[1] "matrix" "array"
The count data is in an array/matrix where each row is a gene and each column is a sample and the gene names are in the row names. They don’t have their own column because unlike dataframes matrices can only have numeric values.
We haven’t explicitly worked with this data type before, but many functions in R can be applied to different types of objects so let’s try getting the row sums:
rowSums(count_matrix) %>% head()
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
5935 0 4279 1936 467
ENSG00000000938
3
We can see that gene ENSG00000000005 does not have counts in any of our sample (the row sum is 0). How many of such genes are there?
sum(rowSums(count_matrix) == 0)
[1] 30208
About half of the genes in the catalog have not been seen.
Remember these are raw counts so we’re not going to do too much with them. If we are going to compare counts across samples or genes we will have to normalize them first. This is a topic for the RNAseq course! (insert shameless self promotion).
Metadata
The metadata slot contains the experimental data underlying the counts:
metadata(airway)
[[1]]
Experiment data
Experimenter name: Himes BE
Laboratory: NA
Contact information:
Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.
URL: http://www.ncbi.nlm.nih.gov/pubmed/24926665
PMIDs: 24926665
Abstract: A 226 word abstract is available. Use 'abstract' method.
What we would traditionally describe as metadata in for example an RNAseq experiment is in the colData
slot instead:
colData(airway)
DataFrame with 8 rows and 9 columns
SampleName cell dex albut Run avgLength
<factor> <factor> <factor> <factor> <factor> <integer>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
Experiment Sample BioSample
<factor> <factor> <factor>
SRR1039508 SRX384345 SRS508568 SAMN02422669
SRR1039509 SRX384346 SRS508567 SAMN02422675
SRR1039512 SRX384349 SRS508571 SAMN02422678
SRR1039513 SRX384350 SRS508572 SAMN02422670
SRR1039516 SRX384353 SRS508575 SAMN02422682
SRR1039517 SRX384354 SRS508576 SAMN02422673
SRR1039520 SRX384357 SRS508579 SAMN02422683
SRR1039521 SRX384358 SRS508580 SAMN02422677
This dataframe tells us for each sample whether it was treated with dexamethasone (dex
) or albuterol (albut
), as well the Run ID, Experiment II and some other identifiers.
A cool thing about (Ranged)SummarizedExperiment
objects is that the $
syntax directly indexes the colData
. For example we can get a vector of sample names or whether the samples were treated with dexamethasone like so:
#the $ syntax directly indexes the metadata, i.e.
$SampleName airway
[1] GSM1275862 GSM1275863 GSM1275866 GSM1275867 GSM1275870 GSM1275871 GSM1275874
[8] GSM1275875
8 Levels: GSM1275862 GSM1275863 GSM1275866 GSM1275867 ... GSM1275875
$dex airway
[1] untrt trt untrt trt untrt trt untrt trt
Levels: trt untrt
Subsetting (Ranged)SummarizedExperiment
objects
Another neat thing about (Ranged)SummarizedExperiment
objects is that they can be subset like matrices.
Let’s have brief example of a matrix:
<- matrix(1:12, nrow = 3, ncol = 4)
mat
# Assign row and column names
rownames(mat) <- c("Gene1", "Gene2", "Gene3")
colnames(mat) <- c("Sample1", "Sample2", "Sample3", "Sample4")
mat
Sample1 Sample2 Sample3 Sample4
Gene1 1 4 7 10
Gene2 2 5 8 11
Gene3 3 6 9 12
Matrices are subset with this syntax:
mat[rows,columns]
Lets say we only want to look at Genes 1 and 2 we can indicate that by:
- rownames
- rowindices
- a boolean vector
#rownames
c("Gene1", "Gene2"), ] mat[
Sample1 Sample2 Sample3 Sample4
Gene1 1 4 7 10
Gene2 2 5 8 11
#rowindices
c(1,3),] mat[
Sample1 Sample2 Sample3 Sample4
Gene1 1 4 7 10
Gene3 3 6 9 12
#boolean vector
<- c(T,F,T)
keep mat[keep,]
Sample1 Sample2 Sample3 Sample4
Gene1 1 4 7 10
Gene3 3 6 9 12
And similarly for columns in the second field:
# Select Sample1 & Sample3 (all rows)
c("Sample1", "Sample3")] mat[,
Sample1 Sample3
Gene1 1 7
Gene2 2 8
Gene3 3 9
We can use that syntax to subset the entire (Ranged)SummarizedExperiment
object (not just the count matrix in the assay
slot)!
For example we can make a new (Ranged)SummarizedExperiment
object with only the samples treated with dexamethasone.
#create a boolean vector of which samples to select
$dex airway
[1] untrt trt untrt trt untrt trt untrt trt
Levels: trt untrt
<- airway$dex == 'trt'
keep keep
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#use the vector to subset the object as if were a matrix!
<- airway[, airway$dex == "trt"]
dex_treated_samples class(dex_treated_samples)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
dex_treated_samples
class: RangedSummarizedExperiment
dim: 63677 4
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
That’s all for now!