Working with annotations

Last updated on 2024-09-03 | Edit this page


Overview

Questions

  • What Bioconductor packages provides methods to efficiently fetch and use gene annotations?
  • How can I use gene annotation packages to convert between different gene identifiers?

Objectives

  • Explain how gene annotations are managed in the Bioconductor project.
  • Identify Bioconductor packages and methods available to fetch and use gene annotations.

Install packages


Before we can proceed into the following sections, we install some Bioconductor packages that we will need. First, we check that the BiocManager package is installed before trying to use it; otherwise we install it. Then we use the BiocManager::install() function to install the necessary packages.

R

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("biomaRt", "org.Hs.eg.db"))

Overview


Packages dedicated to query gene annotations exist in the ‘Software’ and ‘Annotation’ categories of the Bioconductor biocViews, according to their nature.

In the ‘Software’ section, we find packages that do not actually contain gene annotations, but rather dynamically query them from online resources (e.g.,Ensembl BioMart). One such Bioconductor package is biomaRt.

Instead, in the ‘Annotation’ section, we find packages that do contain annotations. Examples include org.Hs.eg.db, EnsDb.Hsapiens.v86, and TxDb.Hsapiens.UCSC.hg38.knownGene.

In this episode, we will demonstrate the two approaches:

  • Querying annotations from the Ensembl Biomart API using the biomaRt package.
  • Querying annotations from the org.Hs.eg.db annotation package.

Online resources or Bioconductor annotation packages?


Accessing the latest information

Bioconductor’s 6-month release cycle implies that packages available from the latest stable release branch will not be updated for six months (only bugfixes are allowed, not functional updates). As a result, annotation packages may contain information that is out-of-date by up to six months.

Instead, independent online resources often have different policies driving the release of updated information. Some databases are frequently updated, while others may not have been updated in years.

Accessing up-to-date information must also be balanced with reproducibility. Having downloaded the ‘latest’ information at one point is time is no good if one hasn’t recorded at least when that information was downloaded.

Storage requirements

By nature, Bioconductor annotation packages are larger than software packages. Just like any other R package, annotation packages must be installed on the user’s computer before they can be used. This can rapidly use add up to using an amount of disk space that is not negligible.

Conversely, online resources are generally accessed programmatically, and generally only require users to record code to replicate analyses reproducibly.

Internet connectivity

When using online resources, it is often a good idea to write annotations ownloaded from online resources to a local file, and refer to that local file during analyses.

If online resources were to become unavailable for any reason (e.g., downtime, loss of internet connection), analyses that use local files can carry on while those that rely on those online resources cannot.

In contrast, Bioconductor annotation packages only require internet connectivity at the time of installation. Once installed, they do not require internet connectivity, as they rely on information stored locally.

Reproducibility

Bioconductor annotation packages are naturally versioned, meaning that users can confidently report the version of the package used in their analysis. Just like software packages, users control if and when annotation packages should be updated on their computer.

Online resources have different policies to facilitate reproducible analyses. Some online resources keep archived versions of their annotations, allowing users to consistently access the same information over time. When this is not the case, it may be necessary to download a copy of the annotation at one point in time, and preciously keep that copy throughout the lifetime of the project to ensure the use of a consistent set of annotations.

Consistency

As we will see in the practical examples of this episode, Bioconductor annotation packages generally re-use a consistent set of data structures. This allows users familiar with one annotation package to rapidly get started with others.

Independent online resources often organise their data in different ways, which requires users to write custom code to access, retrieve, and process their respective data.

Querying annotations from Ensembl BioMart


The Ensembl BioMart

Ensembl BioMart is a robust data mining tool designed to facilitate access to the vast array of biological data available through the Ensembl project.

The BioMart web interface enables researchers to efficiently query and retrieve data on genes, proteins, and other genomic features across multiple species. It allows users to filter, sort, and export data based on various attributes such as gene IDs, chromosomal locations, and functional annotations.

The Bioconductor biomaRt package

biomaRt is a Bioconductor software package that enables retrieval of large amounts of data from Ensembl BioMart tables directly from an R session where those annotations can be used.

Let us first load the package:

R

library(biomaRt)

Listing available marts

Ensembl BioMart organises its diverse biological information into four databases also known as ‘marts’ or ‘biomarts’. Each mart focuses on a different type of data.

Users must select the mart corresponds to the type of data they are interested in before they can query any information from it.

The function listMarts() can be used to display the names of those marts. This is convenient as users do not need to memorise the name of the marts, and the function will also return an updated list of names if any mart is renamed, added, or removed.

R

listMarts()

OUTPUT

               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 112
2   ENSEMBL_MART_MOUSE      Mouse strains 112
3     ENSEMBL_MART_SNP  Ensembl Variation 112
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 112

In this demonstration, we will use the biomart called ENSEMBL_MART_ENSEMBL, which contains the Ensembl gene set.

Notably, the version columns also indicates the version of the biomart. The Ensembl BioMart is updated regularly (multiple times per year). By default, biomaRt functions access the latest version of each biomart. This is not ideal for reproducibility.

Thankfully, Ensembl BioMart archives past versions of its mars in a way that is accessible both programmatically, and on its website.

The function listEnsemblArchives() can be used to display all the versions of Ensembl Biomart accessible.

R

listEnsemblArchives()

OUTPUT

             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 112 May 2024 https://may2024.archive.ensembl.org     112
3     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111
4     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
5     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
6     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
7     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
8     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
9     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
10    Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
11    Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
12    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
13    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
14    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
15     Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org      99
16     Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org      98
17     Ensembl 97 Jul 2019 https://jul2019.archive.ensembl.org      97
18     Ensembl 80 May 2015 https://may2015.archive.ensembl.org      80
19     Ensembl 77 Oct 2014 https://oct2014.archive.ensembl.org      77
20     Ensembl 75 Feb 2014 https://feb2014.archive.ensembl.org      75
21     Ensembl 54 May 2009 https://may2009.archive.ensembl.org      54
   current_release
1
2                *
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21                

In the output above, the key piece of information is the url column, which provides the URL that biomaRt functions will need to access data from the corresponding snapshot of the Ensembl BioMart.

At the time of writing, the current release is Ensembl 112, so let us use the corresponding url https://may2024.archive.ensembl.org to ensure reproducible results no matter when this lesson is delivered.

Connecting to a biomart

The two pieces of information collected above – the name of a biomart and the URL of a snapshot – is all that is needed to connect to a BioMart database reproducibly.

The function useMart() can then be used to create a connection. The connection is traditionally stored in an object called mart, to be reused in subsequent steps for querying information from the online mart.

R

mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", host = "https://may2024.archive.ensembl.org")

Listing available data sets

Each biomart contains a number of data sets.

The function listDatasets() can be used to display information about those data sets. This is convenient as users do not need to memorise the name of the data sets, and the information returned by the function includes a short description of each data set, as well as its version.

In the example below, we restrict the output table to the first few rows, as the full table comprises 214 rows.

R

head(listDatasets(mart))

OUTPUT

                       dataset                           description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.3)
3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
      version
1 ASM259213v1
2  fAstCal1.3
3 AnoCar2.0v2
4  bAquChr1.2
5    Midas_v5
6 ASM200744v2

In the output above, the key piece of information is the dataset column, which provides the identifier that biomaRt functions will need to access data from the corresponding biomart table.

In this demonstration, we will use the Ensembl gene set for Homo sapiens, which is not visible in the output above.

Given the number of data sets available, let us programmatically filter the table of information using pattern matching rather than searching the table manually:

R

subset(listDatasets(mart), grepl("sapiens", dataset))

OUTPUT

                 dataset              description    version
80 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14

From the output above, we identify the desired data set identifier as hsapiens_gene_ensembl.

Connecting to a data set

Having chosen the data set that we want to use, we need to call the function useMart() again, this time specifying the selected data set.

Typically, one would copy paste the previous call to useMart() and edit as needed. It is also common practice to replace the mart object with the new connection.

R

mart <- useMart(
  biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = "https://may2024.archive.ensembl.org")

Listing information available in a data set

BioMart tables contain many pieces of information also known as ‘attributes’. So many, in fact, that they have been grouped into categories also known as ‘pages’.

The function listAttributes() can be used to display information about those attributes. This is convenient as users do not need to memorise the name of the attributes, and the information returned by the function includes a short description of each attribute, as well as its page categorisation.

In the example below, we restrict the output table to the first few rows, as the full table comprises 3157 rows.

R

head(listAttributes(mart))

OUTPUT

                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page

In the output above, the key piece of information is the name column, which provides the identifier that biomaRt functions will need to query that information from the corresponding biomart data set.

The choice of attributes to query now depends on what it is we wish to achieve.

For instance, let us imagine that we have a set of gene identifiers, for which we wish to query:

  • The gene symbol
  • The name of the chromosome where the gene is located
  • The start and end position of the gene on that chromosome
  • The strand on which the gene is encoded

Users would often manually explore the full table of attributes to identify the ones they wish to include in their query. It is also possible to programmatically filter the table of attribute, based on experience and intuition, to narrow down the search:

R

subset(listAttributes(mart), grepl("position", name) & grepl("feature", page))

OUTPUT

             name     description         page
10 start_position Gene start (bp) feature_page
11   end_position   Gene end (bp) feature_page

Querying information from a BioMart table

We have now all the information that we need to perform the actual query:

  • A connection to a BioMart data set
  • The list of attributes available in that data set

The function getBM() is the main biomaRt query function. Given a set of filters and corresponding values, it retrieves the attributes requested by the user from the BioMart data set it is connected to.

In the example below, we manually create a vector of arbitrary gene identifiers for our query. In practice, the query will often originate from an earlier analysis (e.g., differential gene expression).

The example below also queries attributes that we have not introduced yet. In the previous section, we described how one may search the table of attributes returned by listAttributes() to identify attributes to include in their query.

R

query_gene_ids <- c(
  "ENSG00000133101",
  "ENSG00000145386",
  "ENSG00000134057",
  "ENSG00000157456",
  "ENSG00000147082"
)
getBM(
  attributes = c(
    "ensembl_gene_id",
    "hgnc_symbol",
    "chromosome_name",
    "start_position",
    "end_position",
    "strand"
  ),
  filters = "ensembl_gene_id",
  values = query_gene_ids,
  mart = mart
)

OUTPUT

  ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
1 ENSG00000133101       CCNA1              13       36431520     36442870
2 ENSG00000134057       CCNB1               5       69167135     69178245
3 ENSG00000145386       CCNA2               4      121816444    121823883
4 ENSG00000147082       CCNB3               X       50202713     50351914
5 ENSG00000157456       CCNB2              15       59105126     59125045
  strand
1      1
2      1
3     -1
4      1
5      1

Note that we also included the filtering attribute ensembl_gene_id to the attributes retrieved from the data set. This is key to reliably match the newly retrieved attributes to those used in the query.

Querying annotations from annotation packages


Families of annotation packages

To balance the need for comprehensive information while maintaining reasonable package sizes, Bioconductor annotation packages are organised by release, data type, and species.

The major families of Bioconductor annotation packages are:

  • OrgDb packages provide mapping between various types of gene identifiers and pathway information.
  • EnsDb packages provide individual releases of Ensembl annotations.
  • TxDb packages provide individual releases of UCSC annotations.

All those families of annotations derive from the AnnotationDb base class defined in the AnnotationDbi package. As a result, any of those annotation packages can be accessed using the same set of R functions, as demonstrated in the following sections.

Using an OrgDb package

In this example, we will use the org.Hs.eg.db package to demonstrate the use of gene annotations for the human species.

Let us first load the package:

R

library(org.Hs.eg.db)

Each OrgDb package contains an object named identically to the package itself. That object contains the annotations that the package is meant to disseminate.

Aside from querying information, the whole object can be called to print information about the annotations it contains, including the date at which the snapshots of annotations that it contains were made.

R

org.Hs.eg.db

OUTPUT

OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2024-Mar12
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2024-01-17
| GOEGSOURCEDATE: 2024-Mar12
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2024-Feb29
| ENSOURCEDATE: 2023-Nov22
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Apr 18 21:39:39 2024

OUTPUT


Please see: help('select') for usage information

That same object is the one that needs to be supplied to AnnotationDbi functions for running queries and retrieving annotations.

Listing information available in an annotation package

The function columns() can be used to display the annotations available in the object.

Here, the word ‘column’ refers to columns of tables used to store information in database, the very same concept as ‘attributes’ in BioMart. In other words, columns represent all the types of annotations that may be retrieved from the object.

This is convenient as users do not need to memorise the names of the columns of annotations available in the package.

R

columns(org.Hs.eg.db)

OUTPUT

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"
[26] "UNIPROT"     

Listing keys and key types

In database terminology, keys are the values by which information may be queried from a database table.

Information being organised in columns, key types are the names of the columns in which the key values are stored.

Given the variable number of columns in database tables, some tables may allow information to be queried by more than one key. As a result, it is crucial to specify both the keys and the type of key as part of the query.

The function keytypes() can be used to display the names of the columns that may be used to query information from the object.

R

keytypes(org.Hs.eg.db)

OUTPUT

 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"
[26] "UNIPROT"     

The function keys() can be used to display all the possible values for a given key type.

It is generally better practice to specify the type of key being queried (to avoid ambiguity), although database tables typically have a ‘primary key’ used if users do not specify a type themselves.

In the example below, we restrict the list of gene symbol keys to the first few values, as the full set comprises 193279 values.

R

head(keys(org.Hs.eg.db, keytype = "SYMBOL"))

OUTPUT

[1] "A1BG"  "A2M"   "A2MP1" "NAT1"  "NAT2"  "NATP" 

Querying information from an annotation package

The function select() is the main AnnotationDbi query function. Given an AnnotationDb object, key values, and columns (and optionally the type of key supplied if not the primary key), it retrieves the columns requested by the user from the annotation object.

In the example below, we re-use the vector of arbitrary gene identifiers used in the BioMart example a few sections above.

As you can see from the output of the columns() function, the annotation object does not contain some of the attributes that we queried in the Biomart example. In this case, let us query:

  • the gene symbol
  • the gene name
  • the gene type

R

select(
  x = org.Hs.eg.db,
  keys = query_gene_ids,
  columns = c(
    "SYMBOL",
    "GENENAME",
    "GENETYPE"
  ),
  keytype = "ENSEMBL"
)

OUTPUT

'select()' returned 1:1 mapping between keys and columns

OUTPUT

          ENSEMBL SYMBOL  GENENAME       GENETYPE
1 ENSG00000133101  CCNA1 cyclin A1 protein-coding
2 ENSG00000145386  CCNA2 cyclin A2 protein-coding
3 ENSG00000134057  CCNB1 cyclin B1 protein-coding
4 ENSG00000157456  CCNB2 cyclin B2 protein-coding
5 ENSG00000147082  CCNB3 cyclin B3 protein-coding

One small but notable difference with biomaRt is that the output of select() automatically contains the column that correspond to the key type used in the query. In other words, there is no need to specify the key type(s) again in the column(s) to retrieve.

Vectorized 1:1 mapping

It is sometimes possible for annotations to display 1-to-many relationships. For instance, individual genes typically have a unique Ensembl gene identifier, while they may be known under multiple gene name aliases.

The select() function demonstrated in the previous section automatically returns all values in the columns requested, for the key specified. This is possible thanks to the tabular format in which annotations are returned; rows are added, repeating values as necessary to display them on the same row as every other values they are associated with.

In some cases, that behaviour is not desirable. Instead, users may wish to retrieve a single value for each key that they input. One common scenario arises during differential gene expression (DGE), where gene identifiers are used to uniquely identify genes throughout the analysis, while gene symbols are added to the final table of DGE statistics, to provide more readable human-friendly gene identifiers. However, it is not desirable to duplicate rows of DGE statistics, and thus only a single gene symbol is required to annotate each gene.

The function mapIds() can be used for this purpose. A major difference between the functions mapIds() and select() are their arguments column (singular) and columns (plural), respectively. The function mapIds() accepts a single column name and returns a named character vector where names are the input query values, and values are the corresponding values in the requested column.

To deal with 1-to-many relationships, the function mapIds() has an argument multiVals which can be used to specify how the function should handle multiple values. The default is to take the first value and ignore any other value.

In the example below, we query the gene symbol for a set of Ensembl gene identifiers.

R

mapIds(
  x = org.Hs.eg.db,
  keys = query_gene_ids,
  column = "SYMBOL",
  keytype = "ENSEMBL"
)

OUTPUT

'select()' returned 1:1 mapping between keys and columns

OUTPUT

ENSG00000133101 ENSG00000145386 ENSG00000134057 ENSG00000157456 ENSG00000147082
        "CCNA1"         "CCNA2"         "CCNB1"         "CCNB2"         "CCNB3" 

Challenge

Load the packages EnsDb.Hsapiens.v86 and TxDb.Hsapiens.UCSC.hg38.knownGene. Then, display the columns of annotations available in those packages.

R

library(EnsDb.Hsapiens.v86)
columns(EnsDb.Hsapiens.v86)

OUTPUT

 [1] "ENTREZID"            "EXONID"              "EXONIDX"
 [4] "EXONSEQEND"          "EXONSEQSTART"        "GENEBIOTYPE"
 [7] "GENEID"              "GENENAME"            "GENESEQEND"
[10] "GENESEQSTART"        "INTERPROACCESSION"   "ISCIRCULAR"
[13] "PROTDOMEND"          "PROTDOMSTART"        "PROTEINDOMAINID"
[16] "PROTEINDOMAINSOURCE" "PROTEINID"           "PROTEINSEQUENCE"
[19] "SEQCOORDSYSTEM"      "SEQLENGTH"           "SEQNAME"
[22] "SEQSTRAND"           "SYMBOL"              "TXBIOTYPE"
[25] "TXCDSSEQEND"         "TXCDSSEQSTART"       "TXID"
[28] "TXNAME"              "TXSEQEND"            "TXSEQSTART"
[31] "UNIPROTDB"           "UNIPROTID"           "UNIPROTMAPPINGTYPE" 

R

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
columns(TxDb.Hsapiens.UCSC.hg38.knownGene)

OUTPUT

 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"
 [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"
[11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"
[16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"
[21] "TXSTRAND"   "TXTYPE"    

Key Points

  • Bioconductor provides a wide range annotation packages.
  • Some Bioconductor software packages can be used to programmatically access online resources.
  • Users should carefully choose their source of annotations based on their needs and expectations.