Importing and annotating quantified data into R

Last updated on 2023-05-12 | Edit this page

Overview

Questions

How can one import quantified gene expression data into an object suitable for downstream statistical analysis in R?
What types of gene identifiers are typically used, and how are mappings between them done?

Objectives

Learn how to import the quantifications into a SummarizedExperiment object.
Learn how to add additional gene annotations to the object.

Contribute!

This episode is intended to show how we can assemble a SummarizedExperiment starting from individual count, rowdata and coldata files. Moreover, we will practice adding annotations for the genes, and discuss related concepts and things to keep in mind (annotation sources, versions, ‘helper’ packages like tximeta).

Read the data

Counts

R

counts <- read.csv("data/GSE96870_counts_cerebellum.csv", 
                   row.names = 1)

Sample annotations

R

coldata <- read.csv("data/GSE96870_coldata_cerebellum.csv",
                    row.names = 1)

Gene annotations

Need to be careful - the descriptions contain both commas and ’ (e.g., 5’)

R

rowranges <- read.delim("data/GSE96870_rowranges.tsv", sep = "\t", 
                        colClasses = c(ENTREZID = "character"),
                        header = TRUE, quote = "", row.names = 5)

Mention other ways of getting annotations, and practice querying org package. Important to use the right annotation source/version.

R

suppressPackageStartupMessages({
    library(org.Mm.eg.db)
})
mapIds(org.Mm.eg.db, keys = "497097", column = "SYMBOL", keytype = "ENTREZID")

OUTPUT

'select()' returned 1:1 mapping between keys and columns

OUTPUT

497097 
"Xkr4"

Check feature types

R

table(rowranges$gbkey)

OUTPUT


     C_region     D_segment          exon     J_segment      misc_RNA 
           20            23          4008            94          1988 
         mRNA         ncRNA precursor_RNA          rRNA          tRNA 
        21198         12285          1187            35           413 
    V_segment 
          535

Assemble SummarizedExperiment

R

stopifnot(rownames(rowranges) == rownames(counts),
          rownames(coldata) == colnames(counts))

se <- SummarizedExperiment(
    assays = list(counts = as.matrix(counts)),
    rowRanges = as(rowranges, "GRanges"),
    colData = coldata
)

Save SummarizedExperiment

R

saveRDS(se, "data/GSE96870_se.rds")

Session info

R

sessionInfo()

OUTPUT

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] org.Mm.eg.db_3.17.0         AnnotationDbi_1.62.1       
 [3] SummarizedExperiment_1.30.1 Biobase_2.60.0             
 [5] MatrixGenerics_1.12.0       matrixStats_0.63.0         
 [7] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
 [9] IRanges_2.34.0              S4Vectors_0.38.1           
[11] BiocGenerics_0.46.0        

loaded via a namespace (and not attached):
 [1] Matrix_1.5-4            bit_4.0.5               compiler_4.3.0         
 [4] BiocManager_1.30.20     renv_0.17.3             crayon_1.5.2           
 [7] blob_1.2.4              Biostrings_2.68.0       bitops_1.0-7           
[10] png_0.1-8               fastmap_1.1.1           lattice_0.21-8         
[13] R6_2.5.1                XVector_0.40.0          S4Arrays_1.0.1         
[16] knitr_1.42              DelayedArray_0.26.2     GenomeInfoDbData_1.2.10
[19] DBI_1.1.3               rlang_1.1.1             KEGGREST_1.40.0        
[22] cachem_1.0.8            xfun_0.39               bit64_4.0.5            
[25] RSQLite_2.3.1           memoise_2.0.1           cli_3.6.1              
[28] zlibbioc_1.46.0         grid_4.3.0              vctrs_0.6.2            
[31] evaluate_0.21           RCurl_1.98-1.12         httr_1.4.6             
[34] pkgconfig_2.0.3         tools_4.3.0

Keypoints

Depending on the gene expression quantification tool used, there are different ways (often distributed in Bioconductor packages) to read the output into a SummarizedExperiment or DGEList object for further processing in R.
Stable gene identifiers such as Ensembl or Entrez IDs should preferably be used as the main identifiers throughout an RNA-seq analysis, with gene symbols added for easier interpretation.