R Objects: S4 Objects

S4 objects

Bioconductor packages often use S4 objects to store their data. Some examples:

  • (Ranged)SummarizedExperiment is an object that holds rectangular matrices of high-throughput omics data (RNA-Seq, ChIP-Seq, ATAC-Seq)

  • pyholoseq is an objects that stores phylogenetic data such as OTU tables, taxonomy ect for e.g. microbiome and general metagenomics

  • DESeqDataSet is an object that stores counts, transformations and metadata for RNAseq analysis. It is subclass of (Ranged)SummarizedExperiment

Unlike the S3 objects (named lists with a class) we looked at before, S4 objects have clearly defined elements called ‘slots’.

A slot must have:

  • a name
  • a defined data type
  • it may have validation rules

They are usually accessed by specialized functions. Let’s have a look at a SummarizedExperiment objects.

An example: The airway dataset

library(SummarizedExperiment)
library(airway)
library(tidyverse)

#load the airway dataset
data(airway, package = 'airway')
#You can read about the dataset by doing:
#?airway
typeof(airway)
[1] "S4"
class(airway)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"

airway is a RangedSummarizedExperiment object. We can get a good overview of the content of Bioconductor S4 objects by just accessing them:

airway
class: RangedSummarizedExperiment 
dim: 63677 8 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

There is only one assay and it is count data. We might have several assays in the same RangedSummarizedExperiment object, i.e. one for raw counts and one for normalized counts or rlog transformed counts.

Our rows are genes with Ensembl Gene Identifiers (ENSG00000000003, ect) and our columns are samples. There have been 64102 genes measured (look at the dimensions).

Accessing slots

We generally access slots by functions that have the same name as the slot. The (raw) counts for example are in the assay slot:

count_matrix <- assay(airway, 'counts')
head(count_matrix)
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
ENSG00000000457        260        211        263        164        245
ENSG00000000460         60         55         40         35         78
ENSG00000000938          0          0          2          0          1
                SRR1039517 SRR1039520 SRR1039521
ENSG00000000003       1047        770        572
ENSG00000000005          0          0          0
ENSG00000000419        799        417        508
ENSG00000000457        331        233        229
ENSG00000000460         63         76         60
ENSG00000000938          0          0          0
class(count_matrix)
[1] "matrix" "array" 

The count data is in an array/matrix where each row is a gene and each column is a sample and the gene names are in the row names. They don’t have their own column because unlike dataframes matrices can only have numeric values.

We haven’t explicitly worked with this data type before, but many functions in R can be applied to different types of objects so let’s try getting the row sums:

rowSums(count_matrix) %>% head()
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
           5935               0            4279            1936             467 
ENSG00000000938 
              3 

We can see that gene ENSG00000000005 does not have counts in any of our sample (the row sum is 0). How many of such genes are there?

sum(rowSums(count_matrix) == 0)
[1] 30208

About half of the genes in the catalog have not been seen.

Remember these are raw counts so we’re not going to do too much with them. If we are going to compare counts across samples or genes we will have to normalize them first. This is a topic for the RNAseq course! (insert shameless self promotion).

Metadata

The metadata slot contains the experimental data underlying the counts:

metadata(airway)
[[1]]
Experiment data
  Experimenter name: Himes BE 
  Laboratory: NA 
  Contact information:  
  Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells. 
  URL: http://www.ncbi.nlm.nih.gov/pubmed/24926665 
  PMIDs: 24926665 

  Abstract: A 226 word abstract is available. Use 'abstract' method.

What we would traditionally describe as metadata in for example an RNAseq experiment is in the colData slot instead:

colData(airway)
DataFrame with 8 rows and 9 columns
           SampleName     cell      dex    albut        Run avgLength
             <factor> <factor> <factor> <factor>   <factor> <integer>
SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126
SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98
           Experiment    Sample    BioSample
             <factor>  <factor>     <factor>
SRR1039508  SRX384345 SRS508568 SAMN02422669
SRR1039509  SRX384346 SRS508567 SAMN02422675
SRR1039512  SRX384349 SRS508571 SAMN02422678
SRR1039513  SRX384350 SRS508572 SAMN02422670
SRR1039516  SRX384353 SRS508575 SAMN02422682
SRR1039517  SRX384354 SRS508576 SAMN02422673
SRR1039520  SRX384357 SRS508579 SAMN02422683
SRR1039521  SRX384358 SRS508580 SAMN02422677

This dataframe tells us for each sample whether it was treated with dexamethasone (dex) or albuterol (albut), as well the Run ID, Experiment II and some other identifiers.

A cool thing about (Ranged)SummarizedExperiment objects is that the $ syntax directly indexes the colData. For example we can get a vector of sample names or whether the samples were treated with dexamethasone like so:

#the $ syntax directly indexes the metadata, i.e.
airway$SampleName
[1] GSM1275862 GSM1275863 GSM1275866 GSM1275867 GSM1275870 GSM1275871 GSM1275874
[8] GSM1275875
8 Levels: GSM1275862 GSM1275863 GSM1275866 GSM1275867 ... GSM1275875
airway$dex
[1] untrt trt   untrt trt   untrt trt   untrt trt  
Levels: trt untrt

Subsetting (Ranged)SummarizedExperiment objects

Another neat thing about (Ranged)SummarizedExperiment objects is that they can be subset like matrices.

Let’s have brief example of a matrix:

mat <- matrix(1:12, nrow = 3, ncol = 4)

# Assign row and column names
rownames(mat) <- c("Gene1", "Gene2", "Gene3")
colnames(mat) <- c("Sample1", "Sample2", "Sample3", "Sample4")

mat
      Sample1 Sample2 Sample3 Sample4
Gene1       1       4       7      10
Gene2       2       5       8      11
Gene3       3       6       9      12

Matrices are subset with this syntax:

mat[rows,columns]

Lets say we only want to look at Genes 1 and 2 we can indicate that by:

  • rownames
  • rowindices
  • a boolean vector
#rownames
mat[c("Gene1", "Gene2"), ]
      Sample1 Sample2 Sample3 Sample4
Gene1       1       4       7      10
Gene2       2       5       8      11
#rowindices
mat[c(1,3),]
      Sample1 Sample2 Sample3 Sample4
Gene1       1       4       7      10
Gene3       3       6       9      12
#boolean vector
keep <- c(T,F,T)
mat[keep,]
      Sample1 Sample2 Sample3 Sample4
Gene1       1       4       7      10
Gene3       3       6       9      12

And similarly for columns in the second field:

# Select Sample1 & Sample3 (all rows)
mat[, c("Sample1", "Sample3")]  
      Sample1 Sample3
Gene1       1       7
Gene2       2       8
Gene3       3       9

We can use that syntax to subset the entire (Ranged)SummarizedExperiment object (not just the count matrix in the assay slot)!

For example we can make a new (Ranged)SummarizedExperiment object with only the samples treated with dexamethasone.

#create a boolean vector of which samples to select
airway$dex
[1] untrt trt   untrt trt   untrt trt   untrt trt  
Levels: trt untrt
keep <- airway$dex == 'trt'
keep
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
#use the vector to subset the object as if were a matrix!
dex_treated_samples <- airway[, airway$dex == "trt"]
class(dex_treated_samples)
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
dex_treated_samples
class: RangedSummarizedExperiment 
dim: 63677 4 
metadata(1): ''
assays(1): counts
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
  ENSG00000273493
rowData names(10): gene_id gene_name ... seq_coord_system symbol
colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

That’s all for now!