Importing and annotating quantified data into R
Last updated on 2024-12-03 | Edit this page
Overview
Questions
- How can one import quantified gene expression data into an object suitable for downstream statistical analysis in R?
- What types of gene identifiers are typically used, and how are mappings between them done?
Objectives
- Learn how to import the quantifications into a SummarizedExperiment object.
- Learn how to add additional gene annotations to the object.
Load packages
In this episode we will use some functions from add-on R packages. In
order to use them, we need to load them from our
library
:
R
suppressPackageStartupMessages({
library(AnnotationDbi)
library(org.Mm.eg.db)
library(hgu95av2.db)
library(SummarizedExperiment)
})
If you get any error messages about
there is no package called 'XXXX'
it means you have not
installed the package/s yet for this version of R. See the bottom of the
Summary
and Setup to install all the necessary packages for this workshop.
If you have to install, remember to re-run the library
commands above to load them.
Load data
In the last episode, we used R to download 4 files from the internet and saved them on our computer. But we do not have these files loaded into R yet so that we can work with them. The original experimental design in Blackmore et al. 2017 was fairly complex: 8 week old male and female C57BL/6 mice were collected at Day 0 (before influenza infection), Day 4 and Day 8 after influenza infection. From each mouse, cerebellum and spinal cord tissues were taken for RNA-seq. There were originally 4 mice per ‘Sex x Time x Tissue’ group, but a few were lost along the way resulting in a total of 45 samples. For this workshop, we are going to simplify the analysis by only using the 22 cerebellum samples. Expression quantification was done using STAR to align to the mouse genome and then counting reads that map to genes. In addition to the counts per gene per sample, we also need information on which sample belongs to which Sex/Time point/Replicate. And for the genes, it is helpful to have extra information called annotation. Let’s read in the data files that we downloaded in the last episode and start to explore them:
Counts
R
counts <- read.csv("data/GSE96870_counts_cerebellum.csv",
row.names = 1)
dim(counts)
OUTPUT
[1] 41786 22
R
# View(counts)
Genes are in rows and samples are in columns, so we have counts for
41,786 genes and 22 samples. The View()
command has been
commented out for the website, but running it will open a tab in RStudio
that lets us look at the data and even sort the table by a particular
column. However, the viewer cannot change the data inside the
counts
object, so we can only look, not permanently sort
nor edit the entries. When finished, close the viewer using the X in the
tab. Looks like the rownames are gene symbols and the column names are
the GEO sample IDs, which are not very informative for telling us which
sample is what.
Sample annotations
Next read in the sample annotations. Because samples are in columns
in the count matrix, we will name the object coldata
:
R
coldata <- read.csv("data/GSE96870_coldata_cerebellum.csv",
row.names = 1)
dim(coldata)
OUTPUT
[1] 22 10
R
# View(coldata)
Now samples are in rows with the GEO sample IDs as the rownames, and
we have 10 columns of information. The columns that are the most useful
for this workshop are geo_accession
(GEO sample IDs again),
sex
and time
.
Gene annotations
The counts only have gene symbols, which while short and somewhat
recognizable to the human brain, are not always good absolute
identifiers for exactly what gene was measured. For this we need
additional gene annotations that were provided by the authors. The
count
and coldata
files were in comma
separated value (.csv) format, but we cannot use that for our gene
annotation file because the descriptions can contain commas that would
prevent a .csv file from being read in correctly. Instead the gene
annotation file is in tab separated value (.tsv) format. Likewise, the
descriptions can contain the single quote '
(e.g., 5’),
which by default R assumes indicates a character entry. So we have to
use a more generic function read.delim()
with extra
arguments to specify that we have tab-separated data
(sep = "\t"
) with no quotes used (quote = ""
).
We also put in other arguments to specify that the first row contains
our column names (header = TRUE
), the gene symbols that
should be our row.names
are in the 5th column
(row.names = 5
), and that NCBI’s species-specific gene ID
(i.e., ENTREZID) should be read in as character data even though they
look like numbers (colClasses
argument). You can look up
this details on available arguments by simply entering the function name
starting with question mark. (e.g., ?read.delim
)
R
rowranges <- read.delim("data/GSE96870_rowranges.tsv",
sep = "\t",
colClasses = c(ENTREZID = "character"),
header = TRUE,
quote = "",
row.names = 5)
dim(rowranges)
OUTPUT
[1] 41786 7
R
# View(rowranges)
For each of the 41,786 genes, we have the seqnames
(e.g., chromosome number), start
and end
positions, strand
, ENTREZID
, gene product
description (product
) and the feature type
(gbkey
). These gene-level metadata are useful for the
downstream analysis. For example, from the gbkey
column, we
can check what types of genes and how many of them are in our
dataset:
R
table(rowranges$gbkey)
OUTPUT
C_region D_segment exon J_segment misc_RNA
20 23 4008 94 1988
mRNA ncRNA precursor_RNA rRNA tRNA
21198 12285 1187 35 413
V_segment
535
Challenge: Discuss the following points with your neighbor
- How are the 3 objects
counts
,coldata
androwranges
related to each other in terms of their rows and columns? - If you only wanted to analyse the mRNA genes, what would you have to do keep just those (generally speaking, not exact codes)?
- If you decided the first two samples were outliers, what would you have to do to remove those (generally speaking, not exact codes)?
- In
counts
, the rows are genes just like the rows inrowranges
. The columns incounts
are the samples, but this corresponds to the rows incoldata
. - I would have to remember subset both the rows of
counts
and the rows ofrowranges
to just the mRNA genes. - I would have to remember to subset both the columns of
counts
but the rows ofcoldata
to exclude the first two samples.
You can see how keeping related information in separate objects could
easily lead to mis-matches between our counts, gene annotations and
sample annotations. This is why Bioconductor has created a specialized
S4 class called a SummarizedExperiment
. The details of a
SummarizedExperiment
were covered extensively at the end of
the Introduction
to data analysis with R and Bioconductor workshop. As a reminder,
let’s take a look at the figure below representing the anatomy of the
SummarizedExperiment
class:
It is designed to hold any type of quantitative ’omics data
(assays
) along with linked sample annotations
(colData
) and feature annotations with
(rowRanges
) or without (rowData
) chromosome,
start and stop positions. Once these three tables are (correctly!)
linked, subsetting either samples and/or features will correctly subset
the assay
, colData
and rowRanges
.
Additionally, most Bioconductor packages are built around the same core
data infrastructure so they will recognize and be able to manipulate
SummarizedExperiment
objects. Two of the most popular
RNA-seq statistical analysis packages have their own extended S4 classes
similar to a SummarizedExperiment
with the additional slots
for statistical results: DESeq2’s
DESeqDataSet
and edgeR’s
DGEList
. No matter which one you end up using for
statistical analysis, you can start by putting your data in a
SummarizedExperiment
.
Assemble SummarizedExperiment
We will create a SummarizedExperiment
from these
objects:
- The
count
object will be saved inassays
slot - The
coldata
object with sample information will be stored incolData
slot (sample metadata) - The
rowranges
object describing the genes will be stored inrowRanges
slot (features metadata)
Before we put them together, you ABSOLUTELY MUST MAKE SURE THE
SAMPLES AND GENES ARE IN THE SAME ORDER! Even though we saw that
count
and coldata
had the same number of
samples and count
and rowranges
had the same
number of genes, we never explicitly checked to see if they were in the
same order. One quick way to check:
R
all.equal(colnames(counts), rownames(coldata)) # samples
OUTPUT
[1] TRUE
R
all.equal(rownames(counts), rownames(rowranges)) # genes
OUTPUT
[1] TRUE
R
# If the first is not TRUE, you can match up the samples/columns in
# counts with the samples/rows in coldata like this (which is fine
# to run even if the first was TRUE):
tempindex <- match(colnames(counts), rownames(coldata))
coldata <- coldata[tempindex, ]
# Check again:
all.equal(colnames(counts), rownames(coldata))
OUTPUT
[1] TRUE
Challenge
If the features (i.e., genes) in the assay (e.g.,
counts
) and the gene annotation table (e.g.,
rowranges
) are different, how can we fix them? Write the
codes.
R
tempindex <- match(rownames(counts), rownames(rowranges))
rowranges <- rowranges[tempindex, ]
all.equal(rownames(counts), rownames(rowranges))
Once we have verified that samples and genes are in the same order,
we can then create our SummarizedExperiment
object.
R
# One final check:
stopifnot(rownames(rowranges) == rownames(counts), # features
rownames(coldata) == colnames(counts)) # samples
se <- SummarizedExperiment(
assays = list(counts = as.matrix(counts)),
rowRanges = as(rowranges, "GRanges"),
colData = coldata
)
Because matching the genes and samples is so important, the
SummarizedExperiment()
constructor does some internal check
to make sure they contain the same number of genes/samples and the
sample/row names match. If not, you will get some error messages:
R
# wrong number of samples:
bad1 <- SummarizedExperiment(
assays = list(counts = as.matrix(counts)),
rowRanges = as(rowranges, "GRanges"),
colData = coldata[1:3,]
)
ERROR
Error in validObject(.Object): invalid class "SummarizedExperiment" object:
nb of cols in 'assay' (22) must equal nb of rows in 'colData' (3)
R
# same number of genes but in different order:
bad2 <- SummarizedExperiment(
assays = list(counts = as.matrix(counts)),
rowRanges = as(rowranges[c(2:nrow(rowranges), 1),], "GRanges"),
colData = coldata
)
ERROR
Error in SummarizedExperiment(assays = list(counts = as.matrix(counts)), : the rownames and colnames of the supplied assay(s) must be NULL or
identical to those of the RangedSummarizedExperiment object (or
derivative) to construct
A brief recap of how to access the various data slots in a
SummarizedExperiment
and how to make some
manipulations:
R
# Access the counts
head(assay(se))
OUTPUT
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Xkr4 1891 2410 2159 1980 1977 1945
LOC105243853 0 0 1 4 0 0
LOC105242387 204 121 110 120 172 173
LOC105242467 12 5 5 5 2 6
Rp1 2 2 0 3 2 1
Sox17 251 239 218 220 261 232
GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Xkr4 1757 2235 1779 1528 1644 1585
LOC105243853 1 3 3 0 1 3
LOC105242387 177 130 131 160 180 176
LOC105242467 3 2 2 2 1 2
Rp1 3 1 1 2 2 2
Sox17 179 296 233 271 205 230
GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Xkr4 2275 1881 2584 1837 1890 1910
LOC105243853 1 0 0 1 1 0
LOC105242387 161 154 124 221 272 214
LOC105242467 2 4 7 1 3 1
Rp1 3 6 5 3 5 1
Sox17 302 286 325 201 267 322
GSM2545354 GSM2545362 GSM2545363 GSM2545380
Xkr4 1771 2315 1645 1723
LOC105243853 0 1 0 1
LOC105242387 124 189 223 251
LOC105242467 4 2 1 4
Rp1 3 3 1 0
Sox17 273 197 310 246
R
dim(assay(se))
OUTPUT
[1] 41786 22
R
# The above works now because we only have one assay, "counts"
# But if there were more than one assay, we would have to specify
# which one like so:
head(assay(se, "counts"))
OUTPUT
GSM2545336 GSM2545337 GSM2545338 GSM2545339 GSM2545340 GSM2545341
Xkr4 1891 2410 2159 1980 1977 1945
LOC105243853 0 0 1 4 0 0
LOC105242387 204 121 110 120 172 173
LOC105242467 12 5 5 5 2 6
Rp1 2 2 0 3 2 1
Sox17 251 239 218 220 261 232
GSM2545342 GSM2545343 GSM2545344 GSM2545345 GSM2545346 GSM2545347
Xkr4 1757 2235 1779 1528 1644 1585
LOC105243853 1 3 3 0 1 3
LOC105242387 177 130 131 160 180 176
LOC105242467 3 2 2 2 1 2
Rp1 3 1 1 2 2 2
Sox17 179 296 233 271 205 230
GSM2545348 GSM2545349 GSM2545350 GSM2545351 GSM2545352 GSM2545353
Xkr4 2275 1881 2584 1837 1890 1910
LOC105243853 1 0 0 1 1 0
LOC105242387 161 154 124 221 272 214
LOC105242467 2 4 7 1 3 1
Rp1 3 6 5 3 5 1
Sox17 302 286 325 201 267 322
GSM2545354 GSM2545362 GSM2545363 GSM2545380
Xkr4 1771 2315 1645 1723
LOC105243853 0 1 0 1
LOC105242387 124 189 223 251
LOC105242467 4 2 1 4
Rp1 3 3 1 0
Sox17 273 197 310 246
R
# Access the sample annotations
colData(se)
OUTPUT
DataFrame with 22 rows and 10 columns
title geo_accession organism age sex
<character> <character> <character> <character> <character>
GSM2545336 CNS_RNA-seq_10C GSM2545336 Mus musculus 8 weeks Female
GSM2545337 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks Female
GSM2545338 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks Female
GSM2545339 CNS_RNA-seq_13C GSM2545339 Mus musculus 8 weeks Female
GSM2545340 CNS_RNA-seq_14C GSM2545340 Mus musculus 8 weeks Male
... ... ... ... ... ...
GSM2545353 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks Female
GSM2545354 CNS_RNA-seq_4C GSM2545354 Mus musculus 8 weeks Male
GSM2545362 CNS_RNA-seq_5C GSM2545362 Mus musculus 8 weeks Female
GSM2545363 CNS_RNA-seq_6C GSM2545363 Mus musculus 8 weeks Male
GSM2545380 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks Female
infection strain time tissue mouse
<character> <character> <character> <character> <integer>
GSM2545336 InfluenzaA C57BL/6 Day8 Cerebellum 14
GSM2545337 NonInfected C57BL/6 Day0 Cerebellum 9
GSM2545338 NonInfected C57BL/6 Day0 Cerebellum 10
GSM2545339 InfluenzaA C57BL/6 Day4 Cerebellum 15
GSM2545340 InfluenzaA C57BL/6 Day4 Cerebellum 18
... ... ... ... ... ...
GSM2545353 NonInfected C57BL/6 Day0 Cerebellum 4
GSM2545354 NonInfected C57BL/6 Day0 Cerebellum 2
GSM2545362 InfluenzaA C57BL/6 Day4 Cerebellum 20
GSM2545363 InfluenzaA C57BL/6 Day4 Cerebellum 12
GSM2545380 InfluenzaA C57BL/6 Day8 Cerebellum 19
R
dim(colData(se))
OUTPUT
[1] 22 10
R
# Access the gene annotations
head(rowData(se))
OUTPUT
DataFrame with 6 rows and 3 columns
ENTREZID product gbkey
<character> <character> <character>
Xkr4 497097 X Kell blood group p.. mRNA
LOC105243853 105243853 uncharacterized LOC1.. ncRNA
LOC105242387 105242387 uncharacterized LOC1.. ncRNA
LOC105242467 105242467 lipoxygenase homolog.. mRNA
Rp1 19888 retinitis pigmentosa.. mRNA
Sox17 20671 SRY (sex determining.. mRNA
R
dim(rowData(se))
OUTPUT
[1] 41786 3
R
# Make better sample IDs that show sex, time and mouse ID:
se$Label <- paste(se$sex, se$time, se$mouse, sep = "_")
se$Label
OUTPUT
[1] "Female_Day8_14" "Female_Day0_9" "Female_Day0_10" "Female_Day4_15"
[5] "Male_Day4_18" "Male_Day8_6" "Female_Day8_5" "Male_Day0_11"
[9] "Female_Day4_22" "Male_Day4_13" "Male_Day8_23" "Male_Day8_24"
[13] "Female_Day0_8" "Male_Day0_7" "Male_Day4_1" "Female_Day8_16"
[17] "Female_Day4_21" "Female_Day0_4" "Male_Day0_2" "Female_Day4_20"
[21] "Male_Day4_12" "Female_Day8_19"
R
colnames(se) <- se$Label
# Our samples are not in order based on sex and time
se$Group <- paste(se$sex, se$time, sep = "_")
se$Group
OUTPUT
[1] "Female_Day8" "Female_Day0" "Female_Day0" "Female_Day4" "Male_Day4"
[6] "Male_Day8" "Female_Day8" "Male_Day0" "Female_Day4" "Male_Day4"
[11] "Male_Day8" "Male_Day8" "Female_Day0" "Male_Day0" "Male_Day4"
[16] "Female_Day8" "Female_Day4" "Female_Day0" "Male_Day0" "Female_Day4"
[21] "Male_Day4" "Female_Day8"
R
# change this to factor data with the levels in order
# that we want, then rearrange the se object:
se$Group <- factor(se$Group, levels = c("Female_Day0","Male_Day0",
"Female_Day4","Male_Day4",
"Female_Day8","Male_Day8"))
se <- se[, order(se$Group)]
colData(se)
OUTPUT
DataFrame with 22 rows and 12 columns
title geo_accession organism age
<character> <character> <character> <character>
Female_Day0_9 CNS_RNA-seq_11C GSM2545337 Mus musculus 8 weeks
Female_Day0_10 CNS_RNA-seq_12C GSM2545338 Mus musculus 8 weeks
Female_Day0_8 CNS_RNA-seq_27C GSM2545348 Mus musculus 8 weeks
Female_Day0_4 CNS_RNA-seq_3C GSM2545353 Mus musculus 8 weeks
Male_Day0_11 CNS_RNA-seq_20C GSM2545343 Mus musculus 8 weeks
... ... ... ... ...
Female_Day8_16 CNS_RNA-seq_2C GSM2545351 Mus musculus 8 weeks
Female_Day8_19 CNS_RNA-seq_9C GSM2545380 Mus musculus 8 weeks
Male_Day8_6 CNS_RNA-seq_17C GSM2545341 Mus musculus 8 weeks
Male_Day8_23 CNS_RNA-seq_25C GSM2545346 Mus musculus 8 weeks
Male_Day8_24 CNS_RNA-seq_26C GSM2545347 Mus musculus 8 weeks
sex infection strain time tissue
<character> <character> <character> <character> <character>
Female_Day0_9 Female NonInfected C57BL/6 Day0 Cerebellum
Female_Day0_10 Female NonInfected C57BL/6 Day0 Cerebellum
Female_Day0_8 Female NonInfected C57BL/6 Day0 Cerebellum
Female_Day0_4 Female NonInfected C57BL/6 Day0 Cerebellum
Male_Day0_11 Male NonInfected C57BL/6 Day0 Cerebellum
... ... ... ... ... ...
Female_Day8_16 Female InfluenzaA C57BL/6 Day8 Cerebellum
Female_Day8_19 Female InfluenzaA C57BL/6 Day8 Cerebellum
Male_Day8_6 Male InfluenzaA C57BL/6 Day8 Cerebellum
Male_Day8_23 Male InfluenzaA C57BL/6 Day8 Cerebellum
Male_Day8_24 Male InfluenzaA C57BL/6 Day8 Cerebellum
mouse Label Group
<integer> <character> <factor>
Female_Day0_9 9 Female_Day0_9 Female_Day0
Female_Day0_10 10 Female_Day0_10 Female_Day0
Female_Day0_8 8 Female_Day0_8 Female_Day0
Female_Day0_4 4 Female_Day0_4 Female_Day0
Male_Day0_11 11 Male_Day0_11 Male_Day0
... ... ... ...
Female_Day8_16 16 Female_Day8_16 Female_Day8
Female_Day8_19 19 Female_Day8_19 Female_Day8
Male_Day8_6 6 Male_Day8_6 Male_Day8
Male_Day8_23 23 Male_Day8_23 Male_Day8
Male_Day8_24 24 Male_Day8_24 Male_Day8
R
# Finally, also factor the Label column to keep in order in plots:
se$Label <- factor(se$Label, levels = se$Label)
Challenge
- How many samples are there for each level of the
Infection
variable? - Create 2 objects named
se_infected
andse_noninfected
containing a subset ofse
with only infected and non-infected samples, respectively. Then, calculate the mean expression levels of the first 500 genes for each object, and use thesummary()
function to explore the distribution of expression levels for infected and non-infected samples based on these genes. - How many samples represent female mice infected with Influenza A on day 8?
R
# 1
table(se$infection)
OUTPUT
InfluenzaA NonInfected
15 7
R
# 2
se_infected <- se[, se$infection == "InfluenzaA"]
se_noninfected <- se[, se$infection == "NonInfected"]
means_infected <- rowMeans(assay(se_infected)[1:500, ])
means_noninfected <- rowMeans(assay(se_noninfected)[1:500, ])
summary(means_infected)
OUTPUT
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.133 2.867 764.068 337.417 18896.600
R
summary(means_noninfected)
OUTPUT
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.143 3.143 771.017 366.571 20010.143
R
# 3
ncol(se[, se$sex == "Female" & se$infection == "InfluenzaA" & se$time == "Day8"])
OUTPUT
[1] 4
Save SummarizedExperiment
This was a bit of code and time to create our
SummarizedExperiment
object. We will need to keep using it
throughout the workshop, so it can be useful to save it as an actual
single file on our computer to read it back in to R’s memory if we have
to shut down RStudio. To save an R-specific file we can use the
saveRDS()
function and later read it back into R using the
readRDS()
function.
R
saveRDS(se, "data/GSE96870_se.rds")
rm(se) # remove the object!
se <- readRDS("data/GSE96870_se.rds")
Data provenance and reproducibility
We have now created an external .rds file that represents our RNA-Seq data in a format that can be read into R and used by various packages for our analyses. But we should still keep a record of the codes that created the .rds file from the 3 files we downloaded from the internet. But what is the provenance of those files - i.e, where did they come from and how were they made? The original counts and gene information were deposited in the GEO public database, accession number GSE96870. But these counts were generated by running alignment/quantification programs on the also-deposited fastq files that hold the sequence base calls and quality scores, which in turn were generated by a specific sequencing machine using some library preparation method on RNA extracted from samples collected in a particular experiment. Whew!
If you conducted the original experiment ideally you would have the
complete record of where and how the data were generated. But you might
use publicly-available data sets so the best you can do is to keep track
of what original files you got from where and what manipulations you
have done to them. Using R codes to keep track of everything is an
excellent way to be able to reproduce the entire analysis from the
original input files. The exact results you get can differ depending on
the R version, add-on package versions and even what operating system
you use, so make sure to keep track of all this information as well by
running sessionInfo()
and recording the output (see example
at end of lesson).
Challenge: How to subset to mRNA genes
Before, we conceptually discussed subsetting to only the mRNA genes.
Now that we have our SummarizedExperiment
object, it
becomes much easier to write the codes to subset se
to a
new object called se_mRNA
that contains only the genes/rows
where the rowData(se)$gbkey
is equal to mRNA. Write the
codes and then check you correctly got the 21,198 mRNA genes:
R
se_mRNA <- se[rowData(se)$gbkey == "mRNA" , ]
dim(se_mRNA)
OUTPUT
[1] 21198 22
Gene Annotations
Depending on who generates your count data, you might not have a nice file of additional gene annotations. There may only be the count row names, which could be gene symbols or ENTREZIDs or another database’s ID. Characteristics of gene annotations differ based on their annotation strategies and information sources. For example, RefSeq human gene models (i.e., Entrez from NCBI) are well supported and broadly used in various studies. The UCSC Known Genes dataset is based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from GenBank, and serves as a foundation for the UCSC Genome Browser. Ensembl genes contain both automated genome annotation and manual curation.
You can find more information in Bioconductor Annotation Workshop material.
Bioconductor has many packages and functions that can help you to get additional annotation information for your genes. The available resources are covered in more detail in Episode 7 Gene set enrichment analysis.
Here, we will introduce one of the gene ID mapping functions,
mapIds
:
mapIds(annopkg, keys, column, keytype, ..., multiVals)
Where
-
annopkg is the annotation package
-
keys are the IDs that we know
-
column is the value we want
- keytype is the type of key used
R
mapIds(org.Mm.eg.db, keys = "497097", column = "SYMBOL", keytype = "ENTREZID")
OUTPUT
'select()' returned 1:1 mapping between keys and columns
OUTPUT
497097
"Xkr4"
Different from the select()
function,
mapIds()
function handles 1:many mapping between keys and
columns through an additional argument, multiVals
. The
below example demonstrate this functionality using the
hgu95av2.db
package, an Affymetrix Human Genome U95 Set
annotation data.
R
keys <- head(keys(hgu95av2.db, "ENTREZID"))
last <- function(x){x[[length(x)]]}
mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID")
OUTPUT
'select()' returned 1:many mapping between keys and columns
OUTPUT
10 100 1000 10000 100008586 10001
"AAC2" "ADA1" "ACOGS" "MPPH" "AL4" "ARC33"
R
# When there is 1:many mapping, the default behavior was
# to output the first match. This can be changed to a function,
# which we defined above to give us the last match:
mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID", multiVals = last)
OUTPUT
'select()' returned 1:many mapping between keys and columns
OUTPUT
10 100 1000 10000 100008586 10001
"NAT2" "ADA" "CDH2" "AKT3" "GAGE12F" "MED6"
R
# Or we can get back all the many mappings:
mapIds(hgu95av2.db, keys = keys, column = "ALIAS", keytype = "ENTREZID", multiVals = "list")
OUTPUT
'select()' returned 1:many mapping between keys and columns
OUTPUT
$`10`
[1] "AAC2" "NAT-2" "PNAT" "NAT2"
$`100`
[1] "ADA1" "ADA"
$`1000`
[1] "ACOGS" "ADHD8" "ARVD14" "CD325" "CDHN" "CDw325" "NCAD" "CDH2"
$`10000`
[1] "MPPH" "MPPH2" "PKB-GAMMA" "PKBG" "PRKBG"
[6] "RAC-PK-gamma" "RAC-gamma" "STK-2" "AKT3"
$`100008586`
[1] "AL4" "CT4.7" "GAGE-7" "GAGE-7B" "GAGE-8" "GAGE7" "GAGE7B"
[8] "GAGE12F"
$`10001`
[1] "ARC33" "NY-REN-28" "MED6"
Session info
R
sessionInfo()
OUTPUT
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] hgu95av2.db_3.13.0 org.Hs.eg.db_3.19.1
[3] org.Mm.eg.db_3.19.1 AnnotationDbi_1.66.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0
[7] MatrixGenerics_1.16.0 matrixStats_1.4.1
[9] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
[11] IRanges_2.38.1 S4Vectors_0.42.1
[13] BiocGenerics_0.50.0 knitr_1.49
loaded via a namespace (and not attached):
[1] Matrix_1.7-1 bit_4.5.0 jsonlite_1.8.9
[4] compiler_4.4.2 BiocManager_1.30.25 renv_1.0.11
[7] crayon_1.5.3 blob_1.2.4 Biostrings_2.72.1
[10] png_0.1-8 fastmap_1.2.0 yaml_2.3.10
[13] lattice_0.22-6 R6_2.5.1 XVector_0.44.0
[16] S4Arrays_1.4.1 DelayedArray_0.30.1 GenomeInfoDbData_1.2.12
[19] DBI_1.2.3 rlang_1.1.4 KEGGREST_1.44.1
[22] cachem_1.1.0 xfun_0.49 bit64_4.5.2
[25] memoise_2.0.1 SparseArray_1.4.8 RSQLite_2.3.8
[28] cli_3.6.3 zlibbioc_1.50.0 grid_4.4.2
[31] vctrs_0.6.5 evaluate_1.0.1 abind_1.4-8
[34] httr_1.4.7 pkgconfig_2.0.3 tools_4.4.2
[37] UCSC.utils_1.0.0
Key Points
- Depending on the gene expression quantification tool used, there are
different ways (often distributed in Bioconductor packages) to read the
output into a
SummarizedExperiment
orDGEList
object for further processing in R. - Stable gene identifiers such as Ensembl or Entrez IDs should preferably be used as the main identifiers throughout an RNA-seq analysis, with gene symbols added for easier interpretation.