The SummarizedExperiment class

Last updated on 2024-11-19 | Edit this page

Estimated time: NA minutes


Overview

Questions

  • How is information organized in SummarizedExperiment objects?
  • How can that information be added, edited, and accessed?

Objectives

  • Describe how both experimental data and metadata can be stored in a single object.
  • Explain why this is crucial to keep data and metadata synchronised throughout analyses.

Install packages


Before we can proceed into the following sections, we install some Bioconductor packages that we will need. First, we check that the BiocManager package is installed before trying to use it; otherwise we install it. Then we use the BiocManager::install() function to install the necessary packages.

R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("SummarizedExperiment"))

Motivation


Experiments are multifaceted data sets typically composed of at least two key pieces of information necessary for any analysis:

  • Assay data, typically a matrix representing measurements of a set of features in a set of samples (e.g., RNA-sequencing).
  • Sample metadata, typically a data.frame of metadata representing information about samples (e.g., treatment group).

All those pieces of information must be kept synchronised – same samples, same order – for downstream analyses to accurately process the information and produce reliable results.

It is also very common for analytical workflows to analyse subsets of samples or identify outliers that need to be removed to allow for more accurate downstream analyses. In such cases, all aspects of the experiments must be subsetted to the same set of samples – in the same order – to preserve consistency in the data set and correct results.

The SummarizedExperiment – implemented in the SummarizedExperiment package – provides a container that accommodates those essential aspects of individual experiments into a single object coordinates data and metadata during subsetting and reordering operations. Its flexibility accommodating many biological data types and comprehensive set of features make it a popular data structure re-used throughout the Bioconductor and a key part of the Bioconductor ecosystem. For instance, familiarity with the SummarizedExperiment is a prerequisite for working with the DESeq2 package for differential expression analysis, and the SingleCellExperiment extension class for single-cell analyses.

Class structure


SummarizedExperiment is a matrix-like container where rows represent features of interest (e.g. genes, transcripts, exons, etc.) and columns represent samples.

The objects can contain one or more assays, each represented by a matrix-like object, as long as they be of the same dimensions.

Information about the features is stored in a DataFrame object, nested within the SummarizedExperiment object, and accessible using the function rowData(). Each row of the DataFrame provides information on the feature in the corresponding row of the SummarizedExperiment object. That information may include annotations independent of the experiment (e.g., gene identifier) as well as quality control metrics computed from assay data during workflows.

Similarly, information about the samples is stored in another DataFrame object, also nested within the SummarizedExperiment object, and accessible using the function colData().

The following graphic displays the class geometry and highlights the vertical (column) and horizontal (row) relationships. It was obtained from the vignette of the SummarizedExperiment package.

Schematic representation of the SummarizedExperiment class.

Creating a SummarizedExperiment object


Let us first load the package:

R

library(SummarizedExperiment)

Then, let us import assay data from a file that we downloaded during the lesson setup.

The file is a simple text file in which the first column contains made-up feature identifiers and all other columns contain simulated data for made-up samples. As such, we can use the base R function read.csv to parse the file into a data.frame object.

In the example below, we indicate that the row names can be found in the first column, so that the function immediately sets the row names accordingly in the output object. Hadn’t we specified it, the function would have parsed it as a regular column and left the row names to the default integer indexing.

R

count_data <- read.csv("data/counts.csv", row.names = 1)
count_data

OUTPUT

        sample_1 sample_2 sample_3 sample_4
gene_1       109       84       91      105
gene_2       111       97       98      108
gene_3        89      121      105       99
gene_4       105      109      122      101
gene_5        82       97      112       83
gene_6        89       96       90      116
gene_7       121       95       88      106
gene_8       101      101       86      103
gene_9        91      119       89       87
gene_10       81      111       81      118
gene_11       93      118       93       99
gene_12      103      111      116      103
gene_13       89      126      103      100
gene_14      101      107      111       79
gene_15       96       91      103      108
gene_16      110      102      128      103
gene_17       95      106      118      100
gene_18       99      115      114      102
gene_19      114      105       94      118
gene_20      110       88       99      102
gene_21      116       95       94      105
gene_22      114       96      107       91
gene_23       97      120       93       90
gene_24       91       84      118       97
gene_25       99      106       97      110

One assay data matrix is enough to create a SummarizedExperiment object, although without sample metadata, only unsupervised analyses – that do not require information about the samples – are possible.

In the example below, we create a SummarizedExperiment object in which we store the matrix of count data under the name ‘counts’. Note that the argument ‘assays=’ (plural) can accept more than one assay – as discussed above – which is why we encapsulate our only assay matrix in a named list that also gives us the opportunity to assign a name to the assay. Naming assays becomes crucial during workflows that contain multiple assays, in order to identify and retrieve individual assays unambiguously.

R

se <- SummarizedExperiment(
  assays = list(counts = count_data)
)
se

OUTPUT

class: SummarizedExperiment
dim: 25 4
metadata(0):
assays(1): counts
rownames(25): gene_1 gene_2 ... gene_24 gene_25
rowData names(0):
colnames(4): sample_1 sample_2 sample_3 sample_4
colData names(0):

In the output above, the summary view of the object reminds us that the assay – and thus the overall SummarizedExperiment object – contains information for 25 features in 4 samples, it contains a single assay named ‘counts’, the features seem to be named from ‘gene_1’ to ‘gene_25’ (only the first and last ones are shown), and the samples are named from sample_1 to sample_4. The object does not contain any row metadata nor column metadata.

To create a more comprehensive SummarizedExperiment object, let us import gene metadata and sample metadata for another two files that we downloaded during the lesson setup.

The files are formatted similarly to the count data, so we use again the base R function read.csv() to parse them into data.frame objects.

R

sample_metadata <- read.csv("data/sample_metadata.csv", row.names = 1)
sample_metadata

OUTPUT

         condition batch
sample_1         A     1
sample_2         A     2
sample_3         B     1
sample_4         B     2

R

gene_metadata <- read.csv("data/gene_metadata.csv", row.names = 1)
gene_metadata

OUTPUT

        chromosome
gene_1           4
gene_2           4
gene_3           5
gene_4           4
gene_5           5
gene_6           1
gene_7           2
gene_8           1
gene_9           3
gene_10          1
gene_11          1
gene_12          5
gene_13          5
gene_14          1
gene_15          3
gene_16          4
gene_17          2
gene_18          5
gene_19          1
gene_20          3
gene_21          5
gene_22          5
gene_23          1
gene_24          4
gene_25          5

We can re-create the SummarizedExperiment object, this time including the gene and sample metadata:

R

se <- SummarizedExperiment(
  assays = list(counts = count_data),
  colData = sample_metadata,
  rowData = gene_metadata
)
se

OUTPUT

class: SummarizedExperiment
dim: 25 4
metadata(0):
assays(1): counts
rownames(25): gene_1 gene_2 ... gene_24 gene_25
rowData names(1): chromosome
colnames(4): sample_1 sample_2 sample_3 sample_4
colData names(2): condition batch

Comparing the output above with the previous ‘assay-only’ version of the SummarizedExperiment object, we can see that the rowData and colData components now contain 1 and 4 metadata, respectively.

Accessing information


A number of functions give access to the various components of SummarizedExperiment objects.

The assays() function returns the list of assays stored in the object. The output is always a List, event if the object contains a single assay.

R

assays(se)

OUTPUT

List of length 1
names(1): counts

The assayNames() function returns a character vector of the assay names. This is most useful when the object contains larger numbers of assays, as the assays() function (see above) may not display all of them. Knowing the names of the various assays is key to accessing any individual assay.

R

assayNames(se)

OUTPUT

[1] "counts"

The assay() function can be used to retrieve a single assay from the object. For this, the function should be given the name or the integer position of the desired assay. If unspecified, the function automatically returns the first assay in the object.

R

head(assay(se, "counts"))

OUTPUT

       sample_1 sample_2 sample_3 sample_4
gene_1      109       84       91      105
gene_2      111       97       98      108
gene_3       89      121      105       99
gene_4      105      109      122      101
gene_5       82       97      112       83
gene_6       89       96       90      116

The colData() and rowData() functions can be used to retrieve sample metadata and row metadata, respectively.

R

colData(se)

OUTPUT

DataFrame with 4 rows and 2 columns
           condition     batch
         <character> <integer>
sample_1           A         1
sample_2           A         2
sample_3           B         1
sample_4           B         2

R

rowData(se)

OUTPUT

DataFrame with 25 rows and 1 column
        chromosome
         <integer>
gene_1           4
gene_2           4
gene_3           5
gene_4           4
gene_5           5
...            ...
gene_21          5
gene_22          5
gene_23          1
gene_24          4
gene_25          5

Separately, the $ operator can be used to access a single column of sample metadata. A useful feature of this operator is the autocompletion that is triggered automatically in RStudio or using the tabulation key in terminal applications.

R

se$batch

OUTPUT

[1] 1 2 1 2

Notably, there is no operator for accessing a single column of feature metadata. For this, users need to first access the full DataFrame returned by rowData() before accessing a column using the standard $ or [[ operators, e.g.

R

rowData(se)[["chromosome"]]

OUTPUT

 [1] 4 4 5 4 5 1 2 1 3 1 1 5 5 1 3 4 2 5 1 3 5 5 1 4 5

Adding and editing information

Information can be added to SummarizedExperiment after their creation. In fact, this is the basis for workflows that compute normalised assay values – adding those to the list of assays –, and quality control metrics for either features or samples – adding those to the rowData and colData components, as appropriate – progressively growing the amount of information stored within the overall object.

Most of the functions for accessing information, described in the previous section, have a counterpart function for adding new values or editing existing ones. Note that editing is merely the result of adding values under a name already in use, which has the effect of replacing existing values.

In the example below, we add an assay named ‘logcounts’ which is the result of applying a log-transformation to the ‘counts’ assay after adding a pseucocount of one:

R

assay(se, "logcounts") <- log1p(assay(se, "counts"))
se

OUTPUT

class: SummarizedExperiment
dim: 25 4
metadata(0):
assays(2): counts logcounts
rownames(25): gene_1 gene_2 ... gene_24 gene_25
rowData names(1): chromosome
colnames(4): sample_1 sample_2 sample_3 sample_4
colData names(2): condition batch

In the output above, we see that the object now contains two assays: the ‘counts’ assay included in the object when it was first created, and the ‘logcounts’ assay added just now.

Similarly, the colData() and rowData() functions – as well as the $ operator – can be used to add and edit values in the corresponding components.

In the example below, we compute the sum of counts for each sample, and store the result in the sample metadata table under the new name ‘sum_counts’.

R

colData(se)[["sum_counts"]] <- colSums(assay(se, "counts"))
colData(se)

OUTPUT

DataFrame with 4 rows and 3 columns
           condition     batch sum_counts
         <character> <integer>  <numeric>
sample_1           A         1       2506
sample_2           A         2       2600
sample_3           B         1       2550
sample_4           B         2       2533

In this next example, we compute the average count for each feature, and store the result in the feature metadata table under the new name ‘mean_counts’.

R

rowData(se)[["mean_counts"]] <- rowSums(assay(se, "counts"))
rowData(se)

OUTPUT

DataFrame with 25 rows and 2 columns
        chromosome mean_counts
         <integer>   <numeric>
gene_1           4         389
gene_2           4         414
gene_3           5         414
gene_4           4         437
gene_5           5         374
...            ...         ...
gene_21          5         410
gene_22          5         408
gene_23          1         400
gene_24          4         390
gene_25          5         412

Key Points

  • The SummarizedExperiment class provides a single container for storing both assay data and metadata.
  • Assay data and metadata are kept synchronised through subsetting and reordering operations.
  • A comprehensive set of functions are available to access, add, and edit information stored in the various components of the SummarizedExperiment objects.