Importing and annotating quantified data into R

Last updated on 2024-03-05 | Edit this page

Overview

Questions

  • How can one import quantified gene expression data into an object suitable for downstream statistical analysis in R?
  • What types of gene identifiers are typically used, and how are mappings between them done?

Objectives

  • Learn how to import the quantifications into a SummarizedExperiment object.
  • Learn how to add additional gene annotations to the object.

Load packages


In this episode we will use some functions from add-on R packages. In order to use them, we need to load them from our library:

R

suppressPackageStartupMessages({
    library(AnnotationDbi)
    library(org.Mm.eg.db)
    library(hgu95av2.db)
    library(SummarizedExperiment)
})

If you get any error messages about there is no package called 'XXXX' it means you have not installed the package/s yet for this version of R. See the bottom of the Summary and Setup to install all the necessary packages for this workshop. If you have to install, remember to re-run the library commands above to load them.

Load data


In the last episode, we used R to download 4 files from the internet and saved them on our computer. But we do not have these files loaded into R yet so that we can work with them. The original experimental design in Blackmore et al. 2017 was fairly complex: 8 week old male and female C57BL/6 mice were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice per ‘Sex x Time x Tissue’ group, but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. Expression quantification was done using STAR to align to the mouse genome and then counting reads that map to genes. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. And for the genes, it is helpful to have extra information called annotation. Let’s read in the data files that we downloaded in the last episode and start to explore them:

Counts

R

counts <- read.csv("data/GSE96870_counts_cerebellum.csv", 
                   row.names = 1)
dim(counts)

OUTPUT

[1] 41786    22

R

# View(counts)

Genes are in rows and samples are in columns, so we have counts for 41,786 genes and 22 samples. The View() command has been commented out for the website, but running it will open a tab in RStudio that lets us look at the data and even sort the table by a particular column. However, the viewer cannot change the data inside the counts object, so we can only look, not permanently sort nor edit the entries. When finished, close the viewer using the X in the tab. Looks like the rownames are gene symbols and the column names are the GEO sample IDs, which are not very informative for telling us which sample is what.

Sample annotations

Next read in the sample annotations. Because samples are in columns in the count matrix, we will name the object coldata:

R

coldata <- read.csv("data/GSE96870_coldata_cerebellum.csv",
                    row.names = 1)
dim(coldata)

OUTPUT

[1] 22 10

R

# View(coldata)

Now samples are in rows with the GEO sample IDs as the rownames, and we have 10 columns of information. The columns that are the most useful for this workshop are geo_accession (GEO sample IDs again), sex and time.

Gene annotations

The counts only have gene symbols, which while short and somewhat recognizable to the human brain, are not always good absolute identifiers for exactly what gene was measured. For this we need additional gene annotations that were provided by the authors. The count and coldata files were in comma separated value (.csv) format, but we cannot use that for our gene annotation file because the descriptions can contain commas that would prevent a .csv file from being read in correctly. Instead the gene annotation file is in tab separated value (.tsv) format. Likewise, the descriptions can contain the single quote ' (e.g., 5’), which by default R assumes indicates a character entry. So we have to use a more generic function read.delim() with extra arguments to specify that we have tab-separated data (sep = "\t") with no quotes used (quote = ""). We also put in other arguments to specify that the first row contains our column names (header = TRUE), the gene symbols that should be our row.names are in the 5th column (row.names = 5), and that NCBI’s species-specific gene ID (i.e., ENTREZID) should be read in as character data even though they look like numbers (colClasses argument). You can look up this details on available arguments by simply entering the function name starting with question mark. (e.g., ?read.delim)

R

rowranges <- read.delim("data/GSE96870_rowranges.tsv", 
                        sep = "\t", 
                        colClasses = c(ENTREZID = "character"),
                        header = TRUE, 
                        quote = "", 
                        row.names = 5)
dim(rowranges)

OUTPUT

[1] 41786     7

R

# View(rowranges)

For each of the 41,786 genes, we have the seqnames (e.g., chromosome number), start and end positions, strand, ENTREZID, gene product description (product) and the feature type (gbkey). These gene-level metadata are useful for the downstream analysis. For example, from the gbkey column, we can check what types of genes and how many of them are in our dataset:

R

table(rowranges$gbkey)

OUTPUT


     C_region     D_segment          exon     J_segment      misc_RNA 
           20            23          4008            94          1988 
         mRNA         ncRNA precursor_RNA          rRNA          tRNA 
        21198         12285          1187            35           413 
    V_segment 
          535 

Challenge: Discuss the following points with your neighbor

  1. How are the 3 objects counts, coldata and rowranges related to each other in terms of their rows and columns?
  2. If you only wanted to analyse the mRNA genes, what would you have to do keep just those (generally speaking, not exact codes)?
  3. If you decided the first two samples were outliers, what would you have to do to remove those (generally speaking, not exact codes)?
  1. In counts, the rows are genes just like the rows in rowranges. The columns in counts are the samples, but this corresponds to the rows in coldata.
  2. I would have to remember subset both the rows of counts and the rows of rowranges to just the mRNA genes.
  3. I would have to remember to subset both the columns of counts but the rows of coldata to exclude the first two samples.

You can see how keeping related information in separate objects could easily lead to mis-matches between our counts, gene annotations and sample annotations. This is why Bioconductor has created a specialized S4 class called a SummarizedExperiment. The details of a SummarizedExperiment were covered extensively at the end of the Introduction to data analysis with R and Bioconductor workshop. As a reminder, let’s take a look at the figure below representing the anatomy of the SummarizedExperiment class:

It is designed to hold any type of quantitative ’omics data (assays) along with linked sample annotations (colData) and feature annotations with (rowRanges) or without (rowData) chromosome, start and stop positions. Once these three tables are (correctly!) linked, subsetting either samples and/or features will correctly subset the assay, colData and rowRanges. Additionally, most Bioconductor packages are built around the same core data infrastructure so they will recognize and be able to manipulate SummarizedExperiment objects. Two of the most popular RNA-seq statistical analysis packages have their own extended S4 classes similar to a SummarizedExperiment with the additional slots for statistical results: DESeq2’s DESeqDataSet and edgeR’s DGEList. No matter which one you end up using for statistical analysis, you can start by putting your data in a SummarizedExperiment.

Assemble SummarizedExperiment


We will create a SummarizedExperiment from these objects:

  • The count object will be saved in assays slot
  • The coldata object with sample information will be stored in colData slot (sample metadata)
  • The rowranges object describing the genes will be stored in rowRanges slot (features metadata)

Before we put them together, you ABSOLUTELY MUST MAKE SURE THE SAMPLES AND GENES ARE IN THE SAME ORDER! Even though we saw that count and coldata had the same number of samples and count and rowranges had the same number of genes, we never explicitly checked to see if they were in the same order. One quick way to check:

R

all.equal(colnames(counts), rownames(coldata)) # samples

OUTPUT

[1] TRUE

R

all.equal(rownames(counts), rownames(rowranges)) # genes

OUTPUT

[1] TRUE

R

# If the first is not TRUE, you can match up the samples/columns in
# counts with the samples/rows in coldata like this (which is fine
# to run even if the first was TRUE):

tempindex <- match(colnames(counts), rownames(coldata))
coldata <- coldata[tempindex, ]

# Check again:
all.equal(colnames(counts), rownames(coldata)) 

OUTPUT

[1] TRUE

Challenge

If the features (i.e., genes) in the assay (e.g., counts) and the gene annotation table (e.g., rowranges) are different, how can we fix them? Write the codes.

R

tempindex <- match(rownames(counts), rownames(rowranges))
rowranges <- rowranges[tempindex, ]

all.equal(rownames(counts), rownames(rowranges)) 

Once we have verified that samples and genes are in the same order, we can then create our SummarizedExperiment object.

R

# One final check:
stopifnot(rownames(rowranges) == rownames(counts), # features
          rownames(coldata) == colnames(counts)) # samples

se <- SummarizedExperiment(
    assays = list(counts = as.matrix(counts)),
    rowRanges = as(rowranges, "GRanges"),
    colData = coldata
)

Because matching the genes and samples is so important, the SummarizedExperiment() constructor does some internal check to make sure they contain the same number of genes/samples and the sample/row names match. If not, you will get some error messages:

R

# wrong number of samples:

bad1 <- SummarizedExperiment(
    assays = list(counts = as.matrix(counts)),
    rowRanges = as(rowranges, "GRanges"),
    colData = coldata[1:3,]
)

ERROR

Error in validObject(.Object): invalid class "SummarizedExperiment" object: 
    nb of cols in 'assay' (22) must equal nb of rows in 'colData' (3)

R

# same number of genes but in different order:

bad2 <- SummarizedExperiment(
  assays = list(counts = as.matrix(counts)),
  rowRanges = as(rowranges[c(2:nrow(rowranges), 1),], "GRanges"),
  colData = coldata
)

ERROR

Error in SummarizedExperiment(assays = list(counts = as.matrix(counts)), : the rownames and colnames of the supplied assay(s) must be NULL or
  identical to those of the RangedSummarizedExperiment object (or
  derivative) to construct

A brief recap of how to access the various data slots in a SummarizedExperiment and how to make some manipulations:

R

# Access the counts
head(assay(se))

OUTPUT

             GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Xkr4               1891       2410       2159       1980       1977       1945
LOC105243853          0          0          1          4          0          0
LOC105242387        204        121        110        120        172        173
LOC105242467         12          5          5          5          2          6
Rp1                   2          2          0          3          2          1
Sox17               251        239        218        220        261        232
             GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Xkr4               1757       2235       1779       1528       1644       1585
LOC105243853          1          3          3          0          1          3
LOC105242387        177        130        131        160        180        176
LOC105242467          3          2          2          2          1          2
Rp1                   3          1          1          2          2          2
Sox17               179        296        233        271        205        230
             GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Xkr4               2275       1881       2584       1837       1890       1910
LOC105243853          1          0          0          1          1          0
LOC105242387        161        154        124        221        272        214
LOC105242467          2          4          7          1          3          1
Rp1                   3          6          5          3          5          1
Sox17               302        286        325        201        267        322
             GSM2545354 GSM2545362 GSM2545363 GSM2545380
Xkr4               1771       2315       1645       1723
LOC105243853          0          1          0          1
LOC105242387        124        189        223        251
LOC105242467          4          2          1          4
Rp1                   3          3          1          0
Sox17               273        197        310        246

R

dim(assay(se))

OUTPUT

[1] 41786    22

R

# The above works now because we only have one assay, "counts"
# But if there were more than one assay, we would have to specify
# which one like so:

head(assay(se, "counts"))

OUTPUT

             GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Xkr4               1891       2410       2159       1980       1977       1945
LOC105243853          0          0          1          4          0          0
LOC105242387        204        121        110        120        172        173
LOC105242467         12          5          5          5          2          6
Rp1                   2          2          0          3          2          1
Sox17               251        239        218        220        261        232
             GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Xkr4               1757       2235       1779       1528       1644       1585
LOC105243853          1          3          3          0          1          3
LOC105242387        177        130        131        160        180        176
LOC105242467          3          2          2          2          1          2
Rp1                   3          1          1          2          2          2
Sox17               179        296        233        271        205        230
             GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Xkr4               2275       1881       2584       1837       1890       1910
LOC105243853          1          0          0          1          1          0
LOC105242387        161        154        124        221        272        214
LOC105242467          2          4          7          1          3          1
Rp1                   3          6          5          3          5          1
Sox17               302        286        325        201        267        322
             GSM2545354 GSM2545362 GSM2545363 GSM2545380
Xkr4               1771       2315       1645       1723
LOC105243853          0          1          0          1
LOC105242387        124        189        223        251
LOC105242467          4          2          1          4
Rp1                   3          3          1          0
Sox17               273        197        310        246

R

# Access the sample annotations
colData(se)

OUTPUT

DataFrame with 22 rows and 10 columns
                     title geo_accession     organism         age         sex
               <character>   <character>  <character> <character> <character>
GSM2545336 CNS_RNA-seq_10C    GSM2545336 Mus musculus     8 weeks      Female
GSM2545337 CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks      Female
GSM2545338 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks      Female
GSM2545339 CNS_RNA-seq_13C    GSM2545339 Mus musculus     8 weeks      Female
GSM2545340 CNS_RNA-seq_14C    GSM2545340 Mus musculus     8 weeks        Male
...                    ...           ...          ...         ...         ...
GSM2545353  CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks      Female
GSM2545354  CNS_RNA-seq_4C    GSM2545354 Mus musculus     8 weeks        Male
GSM2545362  CNS_RNA-seq_5C    GSM2545362 Mus musculus     8 weeks      Female
GSM2545363  CNS_RNA-seq_6C    GSM2545363 Mus musculus     8 weeks        Male
GSM2545380  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks      Female
             infection      strain        time      tissue     mouse
           <character> <character> <character> <character> <integer>
GSM2545336  InfluenzaA     C57BL/6        Day8  Cerebellum        14
GSM2545337 NonInfected     C57BL/6        Day0  Cerebellum         9
GSM2545338 NonInfected     C57BL/6        Day0  Cerebellum        10
GSM2545339  InfluenzaA     C57BL/6        Day4  Cerebellum        15
GSM2545340  InfluenzaA     C57BL/6        Day4  Cerebellum        18
...                ...         ...         ...         ...       ...
GSM2545353 NonInfected     C57BL/6        Day0  Cerebellum         4
GSM2545354 NonInfected     C57BL/6        Day0  Cerebellum         2
GSM2545362  InfluenzaA     C57BL/6        Day4  Cerebellum        20
GSM2545363  InfluenzaA     C57BL/6        Day4  Cerebellum        12
GSM2545380  InfluenzaA     C57BL/6        Day8  Cerebellum        19

R

dim(colData(se))

OUTPUT

[1] 22 10

R

# Access the gene annotations
head(rowData(se))

OUTPUT

DataFrame with 6 rows and 3 columns
                ENTREZID                product       gbkey
             <character>            <character> <character>
Xkr4              497097 X Kell blood group p..        mRNA
LOC105243853   105243853 uncharacterized LOC1..       ncRNA
LOC105242387   105242387 uncharacterized LOC1..       ncRNA
LOC105242467   105242467 lipoxygenase homolog..        mRNA
Rp1                19888 retinitis pigmentosa..        mRNA
Sox17              20671 SRY (sex determining..        mRNA

R

dim(rowData(se))

OUTPUT

[1] 41786     3

R

# Make better sample IDs that show sex, time and mouse ID:

se$Label <- paste(se$sex, se$time, se$mouse, sep = "_")
se$Label

OUTPUT

 [1] "Female_Day8_14" "Female_Day0_9"  "Female_Day0_10" "Female_Day4_15"
 [5] "Male_Day4_18"   "Male_Day8_6"    "Female_Day8_5"  "Male_Day0_11"  
 [9] "Female_Day4_22" "Male_Day4_13"   "Male_Day8_23"   "Male_Day8_24"  
[13] "Female_Day0_8"  "Male_Day0_7"    "Male_Day4_1"    "Female_Day8_16"
[17] "Female_Day4_21" "Female_Day0_4"  "Male_Day0_2"    "Female_Day4_20"
[21] "Male_Day4_12"   "Female_Day8_19"

R

colnames(se) <- se$Label

# Our samples are not in order based on sex and time
se$Group <- paste(se$sex, se$time, sep = "_")
se$Group

OUTPUT

 [1] "Female_Day8" "Female_Day0" "Female_Day0" "Female_Day4" "Male_Day4"  
 [6] "Male_Day8"   "Female_Day8" "Male_Day0"   "Female_Day4" "Male_Day4"  
[11] "Male_Day8"   "Male_Day8"   "Female_Day0" "Male_Day0"   "Male_Day4"  
[16] "Female_Day8" "Female_Day4" "Female_Day0" "Male_Day0"   "Female_Day4"
[21] "Male_Day4"   "Female_Day8"

R

# change this to factor data with the levels in order 
# that we want, then rearrange the se object:

se$Group <- factor(se$Group, levels = c("Female_Day0","Male_Day0", 
                                        "Female_Day4","Male_Day4",
                                        "Female_Day8","Male_Day8"))
se <- se[, order(se$Group)]
colData(se)

OUTPUT

DataFrame with 22 rows and 12 columns
                         title geo_accession     organism         age
                   <character>   <character>  <character> <character>
Female_Day0_9  CNS_RNA-seq_11C    GSM2545337 Mus musculus     8 weeks
Female_Day0_10 CNS_RNA-seq_12C    GSM2545338 Mus musculus     8 weeks
Female_Day0_8  CNS_RNA-seq_27C    GSM2545348 Mus musculus     8 weeks
Female_Day0_4   CNS_RNA-seq_3C    GSM2545353 Mus musculus     8 weeks
Male_Day0_11   CNS_RNA-seq_20C    GSM2545343 Mus musculus     8 weeks
...                        ...           ...          ...         ...
Female_Day8_16  CNS_RNA-seq_2C    GSM2545351 Mus musculus     8 weeks
Female_Day8_19  CNS_RNA-seq_9C    GSM2545380 Mus musculus     8 weeks
Male_Day8_6    CNS_RNA-seq_17C    GSM2545341 Mus musculus     8 weeks
Male_Day8_23   CNS_RNA-seq_25C    GSM2545346 Mus musculus     8 weeks
Male_Day8_24   CNS_RNA-seq_26C    GSM2545347 Mus musculus     8 weeks
                       sex   infection      strain        time      tissue
               <character> <character> <character> <character> <character>
Female_Day0_9       Female NonInfected     C57BL/6        Day0  Cerebellum
Female_Day0_10      Female NonInfected     C57BL/6        Day0  Cerebellum
Female_Day0_8       Female NonInfected     C57BL/6        Day0  Cerebellum
Female_Day0_4       Female NonInfected     C57BL/6        Day0  Cerebellum
Male_Day0_11          Male NonInfected     C57BL/6        Day0  Cerebellum
...                    ...         ...         ...         ...         ...
Female_Day8_16      Female  InfluenzaA     C57BL/6        Day8  Cerebellum
Female_Day8_19      Female  InfluenzaA     C57BL/6        Day8  Cerebellum
Male_Day8_6           Male  InfluenzaA     C57BL/6        Day8  Cerebellum
Male_Day8_23          Male  InfluenzaA     C57BL/6        Day8  Cerebellum
Male_Day8_24          Male  InfluenzaA     C57BL/6        Day8  Cerebellum
                   mouse          Label       Group
               <integer>    <character>    <factor>
Female_Day0_9          9  Female_Day0_9 Female_Day0
Female_Day0_10        10 Female_Day0_10 Female_Day0
Female_Day0_8          8  Female_Day0_8 Female_Day0
Female_Day0_4          4  Female_Day0_4 Female_Day0
Male_Day0_11          11   Male_Day0_11 Male_Day0  
...                  ...            ...         ...
Female_Day8_16        16 Female_Day8_16 Female_Day8
Female_Day8_19        19 Female_Day8_19 Female_Day8
Male_Day8_6            6    Male_Day8_6 Male_Day8  
Male_Day8_23          23   Male_Day8_23 Male_Day8  
Male_Day8_24          24   Male_Day8_24 Male_Day8  

R

# Finally, also factor the Label column to keep in order in plots:

se$Label <- factor(se$Label, levels = se$Label)

Challenge

  1. How many samples are there for each level of the Infection variable?
  2. Create 2 objects named se_infected and se_noninfected containing a subset of se with only infected and non-infected samples, respectively. Then, calculate the mean expression levels of the first 500 genes for each object, and use the summary() function to explore the distribution of expression levels for infected and non-infected samples based on these genes.
  3. How many samples represent female mice infected with Influenza A on day 8?

R

# 1
table(se$infection)

OUTPUT


 InfluenzaA NonInfected 
         15           7 

R

# 2
se_infected <- se[, se$infection == "InfluenzaA"]
se_noninfected <- se[, se$infection == "NonInfected"]

means_infected <- rowMeans(assay(se_infected)[1:500, ])
means_noninfected <- rowMeans(assay(se_noninfected)[1:500, ])

summary(means_infected)

OUTPUT

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.000     0.133     2.867   764.068   337.417 18896.600 

R

summary(means_noninfected)

OUTPUT

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    0.000     0.143     3.143   771.017   366.571 20010.143 

R

# 3
ncol(se[, se$sex == "Female" & se$infection == "InfluenzaA" & se$time == "Day8"])

OUTPUT

[1] 4

Save SummarizedExperiment


This was a bit of code and time to create our SummarizedExperiment object. We will need to keep using it throughout the workshop, so it can be useful to save it as an actual single file on our computer to read it back in to R’s memory if we have to shut down RStudio. To save an R-specific file we can use the saveRDS() function and later read it back into R using the readRDS() function.

R

saveRDS(se, "data/GSE96870_se.rds")
rm(se) # remove the object!
se <- readRDS("data/GSE96870_se.rds")

Data provenance and reproducibility


We have now created an external .rds file that represents our RNA-Seq data in a format that can be read into R and used by various packages for our analyses. But we should still keep a record of the codes that created the .rds file from the 3 files we downloaded from the internet. But what is the provenance of those files - i.e, where did they come from and how were they made? The original counts and gene information were deposited in the GEO public database, accession number GSE96870. But these counts were generated by running alignment/quantification programs on the also-deposited fastq files that hold the sequence base calls and quality scores, which in turn were generated by a specific sequencing machine using some library preparation method on RNA extracted from samples collected in a particular experiment. Whew!

If you conducted the original experiment ideally you would have the complete record of where and how the data were generated. But you might use publicly-available data sets so the best you can do is to keep track of what original files you got from where and what manipulations you have done to them. Using R codes to keep track of everything is an excellent way to be able to reproduce the entire analysis from the original input files. The exact results you get can differ depending on the R version, add-on package versions and even what operating system you use, so make sure to keep track of all this information as well by running sessionInfo() and recording the output (see example at end of lesson).

Challenge: How to subset to mRNA genes

Before, we conceptually discussed subsetting to only the mRNA genes. Now that we have our SummarizedExperiment object, it becomes much easier to write the codes to subset se to a new object called se_mRNA that contains only the genes/rows where the rowData(se)$gbkey is equal to mRNA. Write the codes and then check you correctly got the 21,198 mRNA genes:

R

se_mRNA <- se[rowData(se)$gbkey == "mRNA" , ]
dim(se_mRNA)

OUTPUT

[1] 21198    22

Gene Annotations


Depending on who generates your count data, you might not have a nice file of additional gene annotations. There may only be the count row names, which could be gene symbols or ENTREZIDs or another database’s ID. Characteristics of gene annotations differ based on their annotation strategies and information sources. For example, RefSeq human gene models (i.e., Entrez from NCBI) are well supported and broadly used in various studies. The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser. Ensembl genes contain both automated genome annotation and manual curation.

You can find more information in Bioconductor Annotation Workshop material.

Bioconductor has many packages and functions that can help you to get additional annotation information for your genes. The available resources are covered in more detail in Episode 7 Gene set enrichment analysis.

Here, we will introduce one of the gene ID mapping functions, mapIds:

mapIds(annopkg, keys, column, keytype, ..., multiVals)

Where

  • annopkg is the annotation package
  • keys are the IDs that we know
  • column is the value we want
  • keytype is the type of key used

R

mapIds(org.Mm.eg.db, keys = "497097", column = "SYMBOL", keytype = "ENTREZID")

OUTPUT

'select()' returned 1:1 mapping between keys and columns

OUTPUT

497097 
"Xkr4" 

Different from the select() function, mapIds() function handles 1:many mapping between keys and columns through an additional argument, multiVals. The below example demonstrate this functionality using the hgu95av2.db package, an Affymetrix Human Genome U95 Set annotation data.

R

keys <- head(keys(hgu95av2.db, "ENTREZID"))
last <- function(x){x[[length(x)]]}

mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID")

OUTPUT

'select()' returned 1:many mapping between keys and columns

OUTPUT

       10       100      1000     10000 100008586     10001 
   "AAC2"    "ADA1"   "ACOGS"    "MPPH"     "AL4"   "ARC33" 

R

# When there is 1:many mapping, the default behavior was 
# to output the first match. This can be changed to a function,
# which we defined above to give us the last match:

mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID", multiVals = last)

OUTPUT

'select()' returned 1:many mapping between keys and columns

OUTPUT

       10       100      1000     10000 100008586     10001 
   "NAT2"     "ADA"    "CDH2"    "AKT3" "GAGE12F"    "MED6" 

R

# Or we can get back all the many mappings:

mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID", multiVals = "list")

OUTPUT

'select()' returned 1:many mapping between keys and columns

OUTPUT

$`10`
[1] "AAC2"  "NAT-2" "PNAT"  "NAT2" 

$`100`
[1] "ADA1" "ADA" 

$`1000`
[1] "ACOGS"  "ADHD8"  "ARVD14" "CD325"  "CDHN"   "CDw325" "NCAD"   "CDH2"  

$`10000`
[1] "MPPH"         "MPPH2"        "PKB-GAMMA"    "PKBG"         "PRKBG"       
[6] "RAC-PK-gamma" "RAC-gamma"    "STK-2"        "AKT3"        

$`100008586`
[1] "AL4"     "CT4.7"   "GAGE-7"  "GAGE-7B" "GAGE-8"  "GAGE7"   "GAGE7B" 
[8] "GAGE12F"

$`10001`
[1] "ARC33"     "NY-REN-28" "MED6"     

Session info


R

sessionInfo()

OUTPUT

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] hgu95av2.db_3.13.0          org.Hs.eg.db_3.17.0        
 [3] org.Mm.eg.db_3.17.0         AnnotationDbi_1.62.2       
 [5] SummarizedExperiment_1.30.2 Biobase_2.60.0             
 [7] MatrixGenerics_1.12.3       matrixStats_1.2.0          
 [9] GenomicRanges_1.52.1        GenomeInfoDb_1.36.4        
[11] IRanges_2.34.1              S4Vectors_0.38.2           
[13] BiocGenerics_0.46.0         knitr_1.45                 

loaded via a namespace (and not attached):
 [1] Matrix_1.6-5            bit_4.0.5               highr_0.10             
 [4] compiler_4.3.2          BiocManager_1.30.22     renv_1.0.5             
 [7] crayon_1.5.2            blob_1.2.4              Biostrings_2.68.1      
[10] bitops_1.0-7            png_0.1-8               fastmap_1.1.1          
[13] yaml_2.3.8              lattice_0.22-5          R6_2.5.1               
[16] XVector_0.40.0          S4Arrays_1.0.6          DelayedArray_0.26.7    
[19] GenomeInfoDbData_1.2.10 DBI_1.2.2               rlang_1.1.3            
[22] KEGGREST_1.40.1         cachem_1.0.8            xfun_0.42              
[25] bit64_4.0.5             RSQLite_2.3.5           memoise_2.0.1          
[28] cli_3.6.2               zlibbioc_1.46.0         grid_4.3.2             
[31] vctrs_0.6.5             evaluate_0.23           abind_1.4-5            
[34] RCurl_1.98-1.14         httr_1.4.7              pkgconfig_2.0.3        
[37] tools_4.3.2            

Key Points

  • Depending on the gene expression quantification tool used, there are different ways (often distributed in Bioconductor packages) to read the output into a SummarizedExperiment or DGEList object for further processing in R.
  • Stable gene identifiers such as Ensembl or Entrez IDs should preferably be used as the main identifiers throughout an RNA-seq analysis, with gene symbols added for easier interpretation.