This lesson is still being designed and assembled (Pre-Alpha version)

The Bioconductor project

Introduction and setup

Overview

Teaching: XX min
Exercises: XX min
Questions
  • Am I using the correct version of R for this lesson?

  • Why does my version of R matter?

  • How do I obtain the files that are used in this lesson?

Objectives
  • Ensure that participants are using the correct version of R to reproduce exactly the contents of this lesson.

  • Download the example files for this lesson.

Version of R

This lesson was developed and tested with R version 4.2.0 (2022-04-22).

Take a moment to launch RStudio and verify that you are using R version 4.1.x, with x being any patch version, e.g. 4.1.2.

R.version.string
[1] "R version 4.2.0 (2022-04-22)"

This is important because Bioconductor uses the version of R running in the current session to determine the version of Bioconductor packages that can be installed in the R library associated with the current R session. Using a different version of R while following this lesson may lead to unexpected results.

Download files

Several episodes in this lesson rely on example files that participants need to download.

Run the code below programmatically create a folder called data in the current working directory, and download the lesson files in that folder.

dir.create("data", showWarnings = FALSE)
download.file(
    url = "https://raw.githubusercontent.com/Bioconductor/bioconductor-teaching/master/data/TrimmomaticAdapters/TruSeq3-PE-2.fa", 
    destfile = "data/TruSeq3-PE-2.fa"
)
download.file(
    url = "https://raw.githubusercontent.com/Bioconductor/bioconductor-teaching/master/data/ActbGtf/actb.gtf", 
    destfile = "data/actb.gtf"
)
download.file(
    url = "https://raw.githubusercontent.com/Bioconductor/bioconductor-teaching/master/data/ActbOrf/actb_orfs.fasta", 
    destfile = "data/actb.gtf"
)

Note

Ideally, participants might want to create a new RStudio project and download the lesson files in a sub-directory of that project.

Using an RStudio project sets the working directory to the root directory of that project. As a consequence, code is executed relative to that root directory, often avoiding the need for using absolute file paths to import/export data from/to files.

Key Points

  • Participants will only be able to install the version of Bioconductor packages described in this lesson and reproduce their exact outputs if they use the correct version of R.

  • The files used in this lesson should be downloaded in a local path that is easily accessible from an R session.


Introduction to Bioconductor

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What does the Bioconductor project comprise?

  • How does the Bioconductor project relate to the CRAN repository?

  • How can I learn to use Bioconductor packages effectively?

  • How do I join and communicate with the Bioconductor community?

Objectives
  • Describe the Bioconductor project globally.

  • Gain a global view of the Bioconductor project in the R ecosystem.

  • Identify sources of information to watch for future updates about the Bioconductor project.

What is Bioconductor?

A brief history of Bioconductor

The Bioconductor project was started in the Fall of 2001, as an initiative for the collaborative creation of extensible software for computational biology and bioinformatics (Gentleman, et al., 2004). From the very start, the stated mission of the project was to develop tools for the statistical analysis and comprehension of large datasets and technological artifacts in rigorously and robustly designed experiments. Beyond statistical analyses, the interpretation of statistical results is supported by packages providing biological context, visualization, and reproducibility.

Over the years, software packages contributed to the Bioconductor project have reflected the evolution and emergence of several high-throughput technologies, from microarrays to single-cell genomics, through many variations of sequencing experiments (e.g., RNA-seq, ChIP-seq, DNA-seq), analyses (e.g., sequence variation, copy number variation, single nucleotide polymorphisms), and data modalities (e.g., flow cytometry, proteomics, microscropy and image analysis).

Crucially, the project has not only released software packages implementing novel statistical tests and methodologies, but also produced a diverse range of packages types granting access to databases of molecular annotations and experimental datasets.

The Bioconductor project culminates at an annual conference in North America in the summer, while regional conferences offer great opportunities for networking in Europe, Asia, and North America. The project is commited to promote a diverse and inclusive community, including a Code of Conduct enforced by a Code of Conduct committee.

Timeline of major Bioconductor milestones alongside technological advancements.

Timeline of major Bioconductor milestones alongside technological advancements. Above the timeline, the figure marks the first occurence of major events. Within the timeline, the name of packages providing core infrastructure indicate the release date. Below the timeline, major technological advancements contextualise the evolution of the Bioconductor project over time.

A scientific project

The original publication describes the aims and methods of the project at its inception Gentleman, et al. (2004).

Huber, et al. (2015) illustrates the progression of the project, including descriptions of core infrastructure and case studies, from the perspective of both users and developers.

Amezquita, et al. (2020) reviewed further developments of the project in the wake of single-cell genomics technologies.

Many more publications and book chapters cite the Bioconductor project, with recent example listed on the Bioconductor website.

A package repository

Overview and relationship to CRAN

Undoubtedly, software packages are the best-known aspect of the Bioconductor project,. Since its inception in 2001, the repository has grown over time to host thousands of packages.

The Bioconductor project has extended the preexisting CRAN repository – much larger and general-purpose in scope – to comprise R packages primarily catering for bioinformatics and computational biology analyses.

Going further

The Discussion article of this lesson includes a section discussing the relationship of Bioconductor and CRAN in further details.

The Bioconductor release cycle

The Bioconductor project also extended the packaging infrastructure of the CRAN repository to better support the deployment and management of packages at the user level (Gentleman, et al., 2004). In particular, the Bioconductor projects features a 6-month release cycle (typically around April and October), which sees a snapshot of the current version of all packages in the Bioconductor repository earmarked for a specific version of R. R itself is released on an annual basis (typically around April), meaning that for each release of R, two compatible releases of Bioconductor packages are available.

As such, Bioconductor package developers are required to always use the version of R that will be associated with the next release of the Bioconductor project. This means using the development version of R between October and April, and the release version of R between April and October.

Crucially, the strict Bioconductor release cycle prevents users from installing temporally distant versions of packages that were very likely never tested together. This practice reflects the development cycle of packages of both CRAN and Bioconductor, where contemporaneous packages are regularly tested by automated systems to ensure that the latest software updates in package dependencies do not break downstream packages, or prompts those package maintainers to update their own software as a consequence.

Prior to each Bioconductor release, packages that do not pass the requires suites of automated tests are deprecated and subsequently removed from the repository. This ensures that each Bioconductor release provides a suite of packages that are mutually compatible, traceable, and guaranteed to function for the associated version of R.

Timeline of release dates for selected Bioconductor and R versions.

Timeline of release dates for selected Bioconductor and R versions. The upper section of the timeline indicates versions and approximate release dates for the R project. The lower section of the timeline indicates versions and release dates for the Bioconductor project. Source: Bioconductor.

Package types

Packages are broadly divided in four major categories:

Software packages themselves can be subdivided into packages that provide infrastructure (i.e., classes) to store and access data, and packages that provide methodological tools to process data stored in those data structures. This separation of structure and analysis is at the core of the Bioconductor project, encouraging developers of new methodological software packages to thoughtfully re-use existing data containers where possible, and reducing the cognitive burden imposed on users who can more easily experiment with alternative workflows without the need to learn and convert between different data structures.

Annotation data packages provide self-contained databases of diverse genomic annotations (e.g., gene identifiers, biological pathways). Different collections of annotation packages can be found in the Bioconductor project. They are identifiable by their respective naming pattern, and the information that they contain. For instance, the so-called OrgDb packages (e.g., the org.Hs.eg.db package) provide information mapping different types of gene identifiers and pathway databases; the so-called EnsDb (e.g., EnsDb.Hsapiens.v86) packages encapsulate individual versions of the Ensembl annotations in Bioconductor packages; and the so-called TxDb packages (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene) encapsulate individual versions UCSC gene annotation tables.

Experiment data packages provide self-contained datasets that are often used by software package developers to demonstrate the use of their package on well-known standard datasets in their package vignettes.

Finally, workflow packages exclusively provide collections of vignettes that demonstrate the combined usage of several other packages as a coherent workflow, but do not provide any new source code or functionality themselves.

Challenge: The Bioconductor website

The Bioconductor website is accessible at https://bioconductor.org/.

Browse the website to find information answering the following questions:

  1. How many packages does the current release of the Bioconductor project include?
  2. How many packages of each type does this number include?

Solution

The following solution includes numbers that were valid at the time of writing (Bioconductor release 3.13); numbers will inevitably be different for future releases of the Bioconductor project.

  1. On the page https://bioconductor.org/, in the section “Install”, we can read:

    Discover 2042 software packages available in Bioconductor release 3.13.

  2. On the page https://bioconductor.org/, in the section “News”, click on the link that reads “Bioconductor Bioc X.Y Released” (X.Y being the version of the current Bioconductor release when you go through this exercise yourself). On the linked page, we can read:

    We are pleased to announce Bioconductor 3.13, consisting of 2042 software packages, 406 experiment data packages, 965 annotation packages, and 29 workflows.

    There are 133 new software packages, 22 new data experiment packages, 7 new annotation packages, 1 new workflow, no new books, and many updates and improvements to existing packages; Bioconductor 3.13 is compatible with R 4.1.0, and is supported on Linux, 32- and 64-bit Windows, and macOS 10.14.6 Mojave or higher. This release will include an updated Bioconductor Docker containers.

Package classification using biocViews

The Bioconductor project uses biocViews, a set of terms from a controlled vocabulary, to classify Bioconductor packages and facilitate their discovery by thematic search on the Bioconductor website.

Each Bioconductor package is tagged with a small set of terms chosen from the available controlled vocabulary, to describe the type and functionality of the package. Terms are initially selected by the package authors, and subsequently refined during package review or updates to the controlled vocabulary.

Challenge

Visit the listing of all packages on the Bioconductor biocViews web page. Use the “Autocomplete biocViews search” box in the upper left to filter packages by category and explore the graph of software packages by expanding and contracting individual terms.

  1. What biocView terms can be used to identify packages that have been tagged for RNA sequencing analysis? ChIP-seq? Epigenetics? Variant annotation? Proteomics? Single-cell genomics?
  2. In the RNASeq category, two very popular packages are DESeq2 and edgeR. Which one is more popular in terms of download statistics (i.e., lower rank)?

Solution

  1. RNAseq, ChIPSeq, Epigenetics, VariantAnnotation, Proteomics, SingleCell.
  2. For Bioconductor release 3.14, DESeq2 and edgeR are listed at ranks 23 and 28 respectively. In other words, the two packages are among the most frequently downloaded packages in the Bioconductor project, in this instance with a small advantage in favour of edgeR.

Going further

The Bioconductor package biocViews is used to support and manage the infrastructure of the controlled vocabulary. It can also be used to programmatically inspect and subset the list of terms available using their relationship as a graph.

Furthermore, the BiocPkgTools package can be used to browse packages under different biocViews ( Su, et al., 2019).

Packages interoperability

At the core of the Bioconductor philosophy is the notion of interoperability. Namely, the capacity of packages to operate on the same data structures. Importantly, interoperability benefits both users and developers.

Users can more easily write arbitrarily complex workflows that combine multiple packages. With packages operating on the same data structure, users can maximize their attention to the practical steps of their workflow, and minimize time spent in often complex and error-prone conversions between different data structures specific to each package. Comparative benchmarks are also easier to implement and can be evaluated more fairly when competing software packages offering similar functionality operate on input and outputs stored in the same data structures.

Similarly, developers of new packages can focus on the implementation of novel functionality, borrowing existing data structures that offer robust and trusted infrastructure for storage, verification, and indexing of information.

Ultimately, the figure below illustrates how many different Bioconductor packages - as well as base R packages - can be combined to carry out a diverse range of analyses, from importing sequencing data into an R session, to the annotation, integration and visualization of data and results.

Sequencing Ecosystem.

Sequencing Ecosystem Major data processing steps (blue) and relevant software packages (pink) are listed in the context of archetypal workflows for various types of genomics analyses. The sequential relation of workflow steps and software package illustrates the importance of interoperability between software package in order to assemble complete end-to-end workflows.

Conferences, courses and workshops

The Bioconductor community regularly organizes a number of events throughout the year and across the world. For example:

Course materials are regularly uploaded on the Bioconductor website following each of those events. In particular, online books are being developed and maintained by community members.

The Bioconductor YouTube channel is used to publish video recordings of conference presentations including talks and workshops, as well as editions of the regular Bioconductor developers forum (link needed).

Contribute!

It could be great to illustrate a typical cycle of conferences over a year, e.g.

  • BioC conference in North America around late July
  • EuroBioC conference in Europe around December
  • BioCAsia conference in Asia around November

Online communication channels

Support site

The Bioconductor support site provides a platform where users and developers can communicate freely (following the Bioconductor Code of Conduct) to discuss issues on a range of subjects, ranging from packages to conceptual questions about best practices.

Slack workspace

The Bioconductor Slack workspace is an open space that all community members are welcome to join (for free) and use for rapid interactions. Currently, the “Pro” pricing plan kindly supported by core funding provides:

A wide range of channels have been created to discuss a range of subjects, and community members can freely join the discussion on those channels of create new ones to discuss new subjects.

Important announcements are posted on the #general channel.

Note

Users are encouraged to use the Bioconductor support site to raise issues that are relevant to the wider community. The Slack workspace is often most useful for live discussions, and widely subscribed channels (e.g. #general) should be used with moderation.

Developer Mailing List

The bioc-devel@r-project.org mailing list is used for communication between package developers, and announcements from the Biocondutor core team.

A scientific and technical community

Note

At least 1 member of TAB/CAB sits on both to act at the liason to ensure communication of the board.

References

Su, S. et al. (2019). “BiocPkgTools: Toolkit for mining the Bioconductor package ecosystem [version 1; peer review: 2 approved, 1 approved with reservations] ”. In: F1000Research 8.752. DOI: 10.12688/f1000research.19410.1.

Amezquita, R. A. et al. (2020). “Orchestrating single-cell analysis with Bioconductor”. In: Nat Methods 17.2, pp. 137-145. ISSN: 1548-7105 (Electronic) 1548-7091 (Linking). DOI: 10.1038/s41592-019-0654-x. URL: https://www.ncbi.nlm.nih.gov/pubmed/31792435.

Gentleman, R. C. et al. (2004). “Bioconductor: open software development for computational biology and bioinformatics”. In: Genome Biol 5.10, p. R80. ISSN: 1474-760X (Electronic) 1474-7596 (Linking). DOI: 10.1186/gb-2004-5-10-r80. URL: https://www.ncbi.nlm.nih.gov/pubmed/15461798.

Huber, W. et al. (2015). “Orchestrating high-throughput genomic analysis with Bioconductor”. In: Nat Methods 12.2, pp. 115-21. ISSN: 1548-7105 (Electronic) 1548-7091 (Linking). DOI: 10.1038/nmeth.3252. URL: https://www.ncbi.nlm.nih.gov/pubmed/25633503.

Key Points

  • R packages are but one aspect of the Bioconductor project.

  • The Bioconductor project extends and complements the CRAN repository.

  • Different types of packages provide not only software, but also annotations, experimental data, and demonstrate the use of multiple packages in integrated workflows.

  • Interoperability beteen Bioconductor packages facilitates the writing of integrated workflows and minimizes the cognitive burden on users.

  • Educational materials from courses and conferences are archived and accessible on the Bioconductor website and YouTube channel.

  • Different channels of communication enable community members to converse and help each other, both as users and package developers.

  • The Bioconductor project is governed by scientific, technical, and advisory boards, as well as a Code of Conduct committee.


Installing Bioconductor packages

Overview

Teaching: XX min
Exercises: XX min
Questions
  • How do I install Bioconductor packages?

  • How do I check if newer versions of my installed packages are available?

  • How do I update Bioconductor packages?

  • How do I find out the name of packages available from the Bioconductor repositories?

Objectives
  • Install BiocManager.

  • Install Bioconductor packages.

BiocManager

The BiocManager package is the entry point into the Bioconductor package repository. Technically, this is the only Bioconductor package distributed on the CRAN repository.

It provides functions to safely install Bioconductor packages and check for available updates.

Once the package is installed, the function BiocManager::install() can be used to install packages from the Bioconductor repository. The function is also capable of installing packages from other repositories (e.g., CRAN), if those packages are not found in the Bioconductor repository first.

The package BiocManager is available from the CRAN repository and used to install packages from the Bioconductor repository.

The package BiocManager is available from the CRAN repository and used to install packages from the Bioconductor repository. The function install.packages() from the base R package utils can be used to install the BiocManager package distributed on the CRAN repository. In turn, the function BiocManager::install() can be used to install packages available on the Bioconductor repository. Notably, the BiocManager::install() function will fall back on the CRAN repository if a package cannot be found in the Bioconductor repository.

Install the package using the code below.

install.packages("BiocManager")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.1/BiocManager_1.30.16.tgz'
Content type 'application/x-gzip' length 322920 bytes (315 KB)
==================================================
downloaded 315 KB


The downloaded binary packages are in
	/var/folders/51/nz__2dx10yzd75rylr5cg2zc0000gq/T//RtmpeS8MHV/downloaded_packages

Going further

A number of packages that are not part of the base R installation also provide functions to install packages from various repositories. For instance:

  • devtools::install()
  • remotes::install_bioc()
  • remotes::install_bitbucket()
  • remotes::install_cran()
  • remotes::install_dev()
  • remotes::install_github()
  • remotes::install_gitlab()
  • remotes::install_git()
  • remotes::install_local()
  • remotes::install_svn()
  • remotes::install_url()
  • renv::install()

Those functions are beyond the scope of this lesson, and should be used with caution and adequate knowledge of their specific behaviors. The general recommendation is to use BiocManager::install() over any other installation mechanism because it ensures proper versioning of Bioconductor packages.

Bioconductor releases and current version

Once the BiocManager package is installed, the BiocManager::version() function displays the version (i.e., release) of the Bioconductor project that is currently active in the R session.

BiocManager::version()
[1] '3.15'

Using the correct version of R and Bioconductor packages is a key aspect of reproducibility. The BiocManager packages uses the version of R running in the current session to determine the version of Biocondutor packages that can be installed in the current R library.

The Bioconductor project produces two releases each year, one around April and another one around October. The April release of Bioconductor coincides with the annual release of R. The October release of Bioconductor continues to use the same version of R for that annual cycle (i.e., until the next release, in April).

Timeline of release dates for selected Bioconductor and R versions.

Timeline of release dates for selected Bioconductor and R versions. The upper section of the timeline indicates versions and approximate release dates for the R project. The lower section of the timeline indicates versions and release dates for the Bioconductor project. Source: Bioconductor.

During each 6-month cycle of package development, Bioconductor tests packages for compatibility with the version of R that will be available for the next release cycle. Then, each time a new Bioconductor release is produced, the version of every package in the Bioconductor repository is incremented, including the package BiocVersion which determines the version of the Bioconductor project.

packageVersion("BiocVersion")
[1] '3.15.2'

This is the case for every package, even those which have not been updated at all since the previous release. That new version of each package is earmarked for the corresponding version of R; in other words, that version of the package can only be installed and accessed in an R session that uses the correct version of R. This version increment is essential to associate a each version of a Bioconductor package with a unique release of the Bioconductor project.

Following the April release, this means that users must install the new version of R to access the newly released versions of Bioconductor packages.

Instead, in October, users can continue to use the same version of R to access the newly released version of Bioconductor packages. However, to update an R library from the April release to the October release of Bioconductor, users need to call the function BiocManager::install() specifying the correct version of Bioconductor as the version option, for instance:

BiocManager::install(version = "3.14")

This needs to be done only once, as the BiocVersion package will be updated to the corresponding version, indicating the version of Bioconductor in use in this R library.

Going further

The Discussion article of this lesson includes a section discussing the release cycle of the Bioconductor project.

Check for updates

The BiocManager::valid() function inspects the version of packages currently installed in the user library, and checks whether a new version is available for any of them on the Bioconductor repository.

If everything is up-to-date, the function will simply print TRUE.

BiocManager::valid()
[1] TRUE

Conveniently, if any package can be updated, the function generates and displays the command needed to update those packages. Users simply need to copy-paste and run that command in their R console.

Example of out-of-date package library

In the example below, the BiocManager::valid() function did not return TRUE. Instead, it includes information about the active user session, and displays the exact call to BiocManager::install() that the user should run to replace all the outdated packages detected in the user library with the latest version available in CRAN or Bioconductor.

> BiocManager::valid()

* sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
[1] BiocManager_1.30.16 compiler_4.1.0      tools_4.1.0         renv_0.14.0        

Bioconductor version '3.13'

  * 18 packages out-of-date
  * 0 packages too new

create a valid installation with

  BiocManager::install(c(
    "cpp11", "data.table", "digest", "hms", "knitr", "lifecycle", "matrixStats", "mime", "pillar", "RCurl",
    "readr", "remotes", "S4Vectors", "shiny", "shinyWidgets", "tidyr", "tinytex", "XML"
  ), update = TRUE, ask = FALSE)

more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Warning message:
18 packages out-of-date; 0 packages too new 

Specifically, in this example, the message tells the user to run the following command to bring their installation up to date:

  BiocManager::install(c(
    "cpp11", "data.table", "digest", "hms", "knitr", "lifecycle", "matrixStats", "mime", "pillar", "RCurl",
    "readr", "remotes", "S4Vectors", "shiny", "shinyWidgets", "tidyr", "tinytex", "XML"
  ), update = TRUE, ask = FALSE)

Exploring the package repository

The Bioconductor biocViews, demonstrated in the earlier episode Introduction to Bioconductor, are a great way to discover new packages by thematically browsing the hierarchical classification of Bioconductor packages.

In addition, the BiocManager::available() function returns the complete list of package names that are can be intsalled from the Bioconductor and CRAN repositories. For instance the total number of numbers that could be installed using BiocManager

length(BiocManager::available())
[1] 22437

Specifically, the current Bioconductor and CRAN repositories can be displayed as follows.

BiocManager::repositories()
                                                        BioCsoft 
                   "https://bioconductor.org/packages/3.15/bioc" 
                                                         BioCann 
        "https://bioconductor.org/packages/3.15/data/annotation" 
                                                         BioCexp 
        "https://bioconductor.org/packages/3.15/data/experiment" 
                                                   BioCworkflows 
              "https://bioconductor.org/packages/3.15/workflows" 
                                                       BioCbooks 
                  "https://bioconductor.org/packages/3.15/books" 
                                                            RSPM 
"https://packagemanager.rstudio.com/cran/__linux__/focal/latest" 
                                                            CRAN 
                                   "https://cloud.r-project.org" 

Each repository URL can be accessed in a web browser, displaying the full list of packages available from that repository. For instance, navigate to https://bioconductor.org/packages/3.14/bioc.

Going further

The function BiocManager::repositories() can be combined with the base function available.packages() to query packages available specifically from any package repository, e.g. the Bioconductor software package repository.

> db = available.packages(repos = BiocManager::repositories()["BioCsoft"])
> dim(db)
[1] 1948   17
> head(rownames(db))
[1] "a4"          "a4Base"      "a4Classif"   "a4Core"      "a4Preproc"
[6] "a4Reporting"

Conveniently, BiocManager::available() includes a pattern= argument, particularly useful to navigate annotation resources (the original use case motivating it). For instance, a range of Annotation data packages available for the mouse model organism can be listed as follows.

BiocManager::available(pattern = "*Mmusculus")
 [1] "BSgenome.Mmusculus.UCSC.mm10"        "BSgenome.Mmusculus.UCSC.mm10.masked"
 [3] "BSgenome.Mmusculus.UCSC.mm39"        "BSgenome.Mmusculus.UCSC.mm8"        
 [5] "BSgenome.Mmusculus.UCSC.mm8.masked"  "BSgenome.Mmusculus.UCSC.mm9"        
 [7] "BSgenome.Mmusculus.UCSC.mm9.masked"  "EnsDb.Mmusculus.v75"                
 [9] "EnsDb.Mmusculus.v79"                 "PWMEnrich.Mmusculus.background"     
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene"    "TxDb.Mmusculus.UCSC.mm10.knownGene" 
[13] "TxDb.Mmusculus.UCSC.mm39.refGene"    "TxDb.Mmusculus.UCSC.mm9.knownGene"  

Installing packages

The BiocManager::install() function is used to install or update packages.

The function takes a character vector of package names, and attempts to install them from the Bioconductor repository.

BiocManager::install(c("S4Vectors", "BiocGenerics"))

However, if any package cannot be found in the Bioconductor repository, the function also searches for those packages in repositories listed in the global option repos.

Contribute !

Add an example of non-Bioconductor package that can be installed using BioManager. Preferably, a package that will be used later in this lesson.

Uninstalling packages

Bioconductor packages can be removed from the R library like any other R package, using the base R function remove.packages(). In essence, this function simply removes installed packages and updates index information as necessary. As a result, it will not be possible to attach the package to a session or browse the documentation of that package anymore.

remove.packages("S4Vectors")
Removing package from '/home/runner/work/_temp/Library'
(as 'lib' is unspecified)

Key Points

  • The BiocManager package is available from the CRAN repository.

  • BiocManager::install() is used to install and update Bioconductor packages (but also from CRAN and GitHub).

  • BiocManager::valid() is used to check for available package updates.

  • BiocManager::version() reports the version of Bioconductor currently installed.

  • BiocManager::install() can also be used to update an entire R library to a specific version of Bioconductor.


Getting help

Overview

Teaching: XX min
Exercises: XX min
Questions
  • Where can I find help online?

  • Where can I ask questions to package developers and other users?

  • Where can I find documentation for a specific package?

  • Where can I learn best practices to combine multiple package into a coherent workflow?

Objectives
  • Identify online resources for help.

  • Access package documentation.

Getting help with Bioconductor packages

Help about Bioconductor packages and best practices is available in several places. Often, the best source of help depends on the situation at hand:

In the next sections, we describe different sources of help available to Bioconductor users, and situations where each of them are most useful.

The Bioconductor website

The main Bioconductor website provides a host of resources, all freely available without even the need to install R or any Bioconductor package.

In particular, the biocViews page is a great way to thematically explore the collection of packages and identify Bioconductor packages providing a certain functionality.

Furthermore, the website also collects materials from courses and conferences, including presentations, video recordings, and teaching materials.

By nature, individual presentations and training materials are often tied to a specific version of Bioconductor packages. As such, they provide a snapshot of best practices at a particular point in time, and may become outdated over time after successive Bioconductor releases. Thus, it is important check the version of packages demonstrated in teaching materials matches that in the user’s R library. Alternatively, when referring to materials that employ package versions different from those in the user’s library, it is important to carefully interpret any discrepancy between the expected and actual results.

Package landing pages

Each package accepted in the Bioconductor project is granted a landing page on the main Bioconductor website, e.g. S4Vectors.

Package landing pages contains useful information that can be consulted without the need to install the package itself. This is particularly useful while browsing the Bioconductor repository is search of packages suitable for a specific task.

On the landing page, prospective users can find a short description of the package functionality, as well as pre-compiled documentation in the form of links to package vignettes. Vignettes represent a particularly powerful form of documentation as they allow package developers to demonstrate the functionality of their package in the context of an example data set and workflow (sometimes more than one).

In particular, precompiled vignettes available on the Bioconductor website provide that information without the need to install the package itself. Instead, prospective users are presented with documents (PDF or HTML) written by developers to demonstrate how the functions available in the package are meant to be used and combined into a complete workflow. Often, vignettes use standard data sets preprocessed and freely available from public repositories, including ExperimentData packages or the Bioconductor ExperimentHub.

In the “Details” section of the landing page, many packages provide a field labelled “BugReports”. That field provides a URL that users can visit to report bugs to the package developer(s).

Note

It can be difficult to distinguish actual software bugs from unwitting mistakes made by users not fully familiar with the package yet. Later in this episode, we provide advice for reporting bugs and including sufficient information to receive the fastest and most helpful responses.

In doubt, the Biocondutor support site can also be a great place to discuss individual experiences and share knowledge about packages and best practices.

Additionally, the landing page provides many other pieces of information, from daily build reports indicating whether the package passed all tests on a range of operating systems, to software dependencies indicating the number of other Bioconductor packages that must be installed before the package itself can be installed and used on the user’s system.

Going further

Each package has a landing page for each release of Bioconductor since the package was added to the repository, e.g.:

In the URL of a package landing page, we can replace the version number by the word “release” or “devel” to access the landing page of the latest stable release or development version, respectively.

Package vignettes

Each Bioconductor package is required to include at least one vignette. Many packages have more than one vignette, often separating core functionality from specific use cases.

As we noted earlier in this episode, vignettes are available from package landing pages on the Bioconductor website. However, the landing page only links to the documentation of the most recent version of the package for each version of Bioconductor. This may be a different version from the one that is installed in the user’s R library and used in the R session.

When Bioconductor packages are installed in the user’s R library, the vignettes associated with that particular version of the package are also installed on the user’s computer. Those locally installed vignettes are the gold standard reference for the version of the package that is currently installed in the R library and used in the R session. They can be accessed using the function browseVignettes(), for instance:

browseVignettes("BiocManager")

Specifically, the function browseVignettes() opens a local web page in the user’s default web browser, listing all the vignettes available for the requested package. Each vignette is available in three formats:

The precompiled format is often the most comfortable format to read, as the PDF and HTML formats allow the contents of the documents to be preview in one integrated view. This includes plain text explanations, as well as code and their outputs, both figures and console messages.

Package help pages

Bioconductor requires every user-facing package function to documented in one of the package help pages (often referred to as “man pages”, after the name of the package sub-directory where they are stored).

Help pages can be accessed using the help() function or the question mark symbol ?.

help(topic = "install", package = "BiocManager")
?BiocManager::install

Help pages for Bioconductor packages generally follow the same rules as CRAN packages when it comes to formatting and essential contents. However, Bioconductor also requires that most of the man pages documenting exported objects must have runnable examples. Runnable examples are particularly helpful to demonstrate the usage of individual functions on small data sets immediately available to users - either artificially simulated on the fly or programmatically imported from public data repositories.

In particular, runnable examples demonstrate the usage of functions in ideal cases, showcasing how the inputs inputs of the function should be formatted, and what information will be available to users in the outputs of the function. Running those examples and comparing the example inputs and outputs with the user’s own data often provide significant insights into the transformations that are needed before applying the function to the user’s own data, and how the outputs of the function can be interpreted and interacted with.

The Bioconductor support site

The Bioconductor support site provides a platform for the community of users and developers to ask questions, and help each other through the doubts and challenges related to Bioconductor packages and analytical workflows.

The support site can be freely browsed without an account to search and read the many questions that were already asked and answered. However, posting a new question does require an account on the platform. Signing up to the platform is straightforward using an email address or Open Authorization from a number of trusted providers.

A system of upvoting allows the most popular answers to feature more prominently at the top of each page. Furthermore, the original poster retains the right to mark one answer as the one that resolved their issue.

Separately, a system of points granted to each user for providing answers either popular or accepted by the original poster highlights the most active and trusted contributors on the platform.

The Bioconductor support site.

The Bioconductor support site. The Bioconductor support site tracks questions and answers posted by registered users. The platform can be freely browsed and searched by non-registered users.

Workflow packages

Bioconductor workflow packages are special in the way that they are only expected to contain vignettes, without any additional code or functionality of their own. Instead, the vignettes of workflow packages exclusively import functionality from other packages, and demonstrate how to combine functions from those packages into an integrated workflow that users are likely to face in their day-to-day work.

Like regular vignettes, data is typically fetched from publicly available sources, including Bioconductor ExperimentData packages or the Bioconductor ExperimentHub. Those freely available standard data sets allow users to interactively reproduce outputs while they read and follow along the vignette.

Workflow packages can be browsed in a dedicated section of the biocViews page.

Slack workspace

The Bioconductor Slack workspace was created in 2016.

The workspace can be freely joined using this Heroku app to generate invitations for individual email addresses.

The Bioconductor Slack workspace is a lively online platform for official announcements by the Bioconductor Core Team (e.g., Bioconductor release, conferences & events), as well as informal discussions between groups of users subscribed to thematic channels, and direct messages between community members.

The workspace features a large number of channels dedicated to particular topics and areas of interest in the community. Those channels range from active fields of research (e.g., single-cell genomics), to time-limited events (e.g., conferences), but also community outreach (e.g., diversity and representation).

Private channels also exist for governance (e.g., event organisation, advisory boards).

Note

The Slack workspace allows for rapid day-to-day communication and discussion with fellow community members through channels and direct messages.

However, popular channels can reach up to hundreds of users in many different time zones and should be used with parsimony and mindfulness. Conversely, direct messages and private channels are limited to the users invited in the discussion, and any outcome relevant to the community then needs to be re-posted in a public channel.

As a consequence, the Bioconductor support site remains the preferred way to publicly ask questions of interest to the community, in a way that both the question, discussion, and answers are easily searchable and indexed by major search engines.

How to efficiently ask for help

Most community members provide help voluntarily, on their own spare time and without any form of compensation. As such, it is important to ask questions as clearly as possible, providing relevant information, to help those volunteers identify the source of the issue as rapidly as possible, thus saving time for both the helper and the original poster.

Depending on the question, some key information includes:

When the issue relates to code being run and producing unexpected outputs, it is paramount to include sufficient information for others to reproduce the issue on their own computer. Indeed, many issues require a live R session to properly investigate the source of the issue, test fixes, or provide workaround and advice.

Crucially, when providing code as part of your post, it is important that this code be executable by readers, including data that are processed by the code. Often, the code itself may look correct, while the issue relates to the interaction between the code and a particular data set. If sharing sensitive data is not an option, then the issue should be reformulated and presented using a data set publicly available on the internet, or including code to generate simulated data randomly generated in a reproducible way (e.g., set.seed()).

One option is to use the package reprex to collate the code and outputs that describe the issue into formatted text that is easy to post of many online forums, including the Bioconductor support site.

Finally, the Bioconductor support site is the preferred platform to post questions related to Bioconductor packages. This is because questions are visible to the entire community, including many experienced Bioconductor users who regularly answer those questions, and other users who can find answers to questions that were already posted and resolved by the time they run into the issue themselves.

Key Points

  • The browseVignettes() function is recommended to access the vignette(s) installed with each package.

  • Vignettes can also be accessed on the Bioconductor website, but beware of differences between package versions!

  • The Bioconductor main website contains general information, package documentation, and course materials.

  • The Bioconductor support site is the recommended place to contact developers and ask questions.


S4 classes in Bioconductor

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What is the S4 class system?

  • How does Bioconductor use S4 classes?

  • How is the Bioconductor DataFrame different from the base data.frame?

Objectives
  • Explain what S4 classes, generics, and methods are.

  • Identify S4 classes at the core of the Bioconductor package infrastructure.

  • Create various S4 objects and apply relevant S4 methods.

Install packages

Before we can proceed into the following sections, we install some Bioconductor packages that we will need. First, we check that the BiocManager package is installed before trying to use it; otherwise we install it. Then we use the BiocManager::install() function to install the necessary packages.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("S4Vectors")

For Instructors

The first part of this episode may look somewhat heavy on the theory. Do not be tempted to go into excessive details about the inner workings of the S4 class system (e.g., no need to mention the function new(), or to demonstrate a concrete example of code creating a class). Instead, the caption of the first figure demonstrates how to progressively talk through the figure, introducing technical terms in simple sentences, building up to the method dispatch concept that is core to the S4 class system, and the source of much confusion in novice users.

S4 classes and methods

The methods package

The S4 class system is implemented in the base package methods. As such, the concept is not specific to the Bioconductor project and can be found in various independent packages as well. The subject is thoroughly documented in the online book Advanced R, by Hadley Wickham. Most Bioconductor users will never need to get overly familiar with the intricacies of the S4 class system. Rather, the key to an efficient use of packages in the Bioconductor project relies on a sufficient understanding of key motivations for using the S4 class system, as well as best practices for user-facing functionality, including classes, generics, and methods. In the following sections of this episode, we focus on the essential functionality and user experience of S4 classes and methods in the context of the Bioconductor project.

On one side, S4 classes provide data structures capable of storing arbitrarily complex information in computational objects that can be assigned to variable names in an R session. On the other side, S4 generics and methods define functions that may be applied to process those objects.

Over the years, the Bioconductor project has used the S4 class system to develop a number of classes and methods capable of storing and processing data for most biological assays, including raw and processed assay data, experimental metadata for individual features and samples, as well as other assay-specific information as relevant. Gaining familiarity with the standard S4 classes commonly used throughout Bioconductor packages is a key step in building up confidence in users wishing to follow best practices while developing analytical workflows.

S4 classes, generics, and methods.

S4 classes, generics, and methods. On the left, two example classes named S4Class1 and S4Class2 demonstrate the concept of inheritance. The class S4Class1 contains two slots named SlotName1 and SlotName2 for storing data. Those two slots are restricted to store objects of type SlotType1 and SlotType2, respectively. The class also defines validity rules that check the integrity of data each time an object is updated. The class S4Class2 inherits all the slots and validity rules from the class S4Class1, in addition to defining a new slot named SlotName3 and new validity rules. Example code illustrates how objects of each class are typically created using constructor functions named identically to the corresponding class. On the right, one generic function and two methods demonstrate the concept of polymorphism and the process of S4 method dispatch. The generic function S4Generic1() defines the name of the function, as well as its arguments. However, it does not provide any implementation of that function. Instead, two methods are defined, each providing a distinct implementation of the generic function for a particular class of input. Namely, the first method defines an implementation of S4Generic1() if an object of class S4Class1 is given as argument x, while the second method method provides a different implementation of S4Generic1() if an object of class S4Class2 is given as argument x. When the generic function S4Generic1() is called, a process called method dispatch takes place, whereby the appropriate implementation of the S4Generic1() method is called according to the class of the object passed to the argument x.

Slots and validity

In contrast to the S3 class system available directly in base R (not described in this lesson), the S4 class system provides a much stricter definition of classes and methods for object-oriented programming (OOP) in R. Like many programming languages that implement the OOP model, S4 classes are used to represent real-world entities as computational objects that store information inside one or more internal components called slots. The class definition declares the type of data that may be stored in each slot; an error will be thrown if one would attempt to store unsuitable data. Moreover, the class definition can also include code that checks the validity of data stored in an object, beyond their type. For instance, while a slot of type numeric could be used to store a person’s age, but a validity method could check that the value stored is, in fact, positive.

Inheritance

One of the core pillars of the OOP model is the possibility to develop new classes that inherit and extend the functionality of existing classes. The S4 class system implements this paradigm.

The definition of a new S4 classes can declare the name of other classes to inherit from. The new classes will contain all the slots of the parent class, in addition to any new slot added in the definition of the new class itself.

The new class definition can also define new validity checks, which are added to any validity check implement in each of the parent classes.

Generics and methods

While classes define the data structures that store information, generics and methods define the functions that can be applied to objects instantiated from those classes.

S4 generic functions are used to declare the name of functions that are expected to behave differently depending on the class of objects that are given as some of their essential arguments. Instead, S4 methods are used define the distinct implementations of a generic function for each particular combination of inputs.

When a generic function is called and given an S4 object, a process called method dispatch takes place, whereby the class of the object is used to determine the appropriate method to execute.

The S4Vectors package

The S4Vectors package defines the Vector and List virtual classes and a set of generic functions that extend the semantic of ordinary vectors and lists in R. Using the S4 class system, package developers can easily implement vector-like or list-like objects as concrete subclasses of Vector or List.

Virtual classes – such as Vector and List – cannot be instantiated into objects themselves. Instead, those virtual classes provide core functionality inherited by all the concrete classes that are derived from them.

Instead, a few low-level concrete subclasses of general interest (e.g. DataFrame, Rle, and Hits) are implemented in the S4Vectors package itself, and many more are implemented in other packages throughout the Bioconductor project (e.g., IRanges).

Attach the package to the current R session as follows.

library(S4Vectors)
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':

    expand.grid, I, unname

Note

The package startup messages printed in the console are worth noting that the S4Vectors package masks a number of functions from the base package when the package is attached to the session. This means that the S4Vectors package includes an implementation of those functions, and that – being the latest package attached to the R session – its own implementation of those functions will be found first on the R search path and used instead of their original implementation in the base package.

In many cases, masked functions can be used as before without any issue. Occasionally, it may be necessary to disambiguate calls to masked function using the package name as well as the function name, e.g. base::anyDuplicated().

The DataFrame class

An extension to the concept of rectangular data

The DataFrame class implemented in the S4Vectors package extends the concept of rectangular data familiar to users of the data.frame class in base R, or tibble in the tidyverse. Specifically, the DataFrame supports the storage of any type of object (with length and [ methods) as columns.

On the whole, the DataFrame class provides a formal definition of an S4 class that behaves very similarly to data.frame, in terms of construction, subsetting, splitting, combining, etc.

The DataFrame() constructor function should be used to create new objects, comparably to the data.frame() equivalent in base R. The help page for the function, accessible as ?DataFrame, ca be consulted for more information.

DF1 <- DataFrame(
    Integers = c(1L, 2L, 3L),
    Letters = c("A", "B", "C"),
    Floats = c(1.2, 2.3, 3.4)
)
DF1
DataFrame with 3 rows and 3 columns
   Integers     Letters    Floats
  <integer> <character> <numeric>
1         1           A       1.2
2         2           B       2.3
3         3           C       3.4

In fact, DataFrame objects can be easily converted to equivalent data.frame objects.

df1 <- as.data.frame(DF1)
df1
  Integers Letters Floats
1        1       A    1.2
2        2       B    2.3
3        3       C    3.4

Differences with the base data.frame

The most notable exceptions have to do with handling of row names. First, row names are optional. This means calling rownames(x) will return NULL if there are no row names.

rownames(DF1)
NULL

This is different from data.frame, where rownames(x) returns the equivalent of as.character(seq_len(nrow(x))).

rownames(df1)
[1] "1" "2" "3"

However, returning NULL informs, for example, combination functions that no row names are desired (they are often a luxury when dealing with large data).

Furthermore, row names of DataFrame objects are not required to be unique, in contrast to the data.frame in base R. Row names are a frequent source of controversy in R, as they can be used to uniquely identify and index observations in rectangular format, without storing that information explicitly in a dedicated column. When set, row names can be used to subset rectangular data using the [ operator. Meanwhile, non-unique row names defeat that purpose and can lead to unexpected results, as only the first occurrence of each selected row name is extracted. Instead, the tidyverse tibble removed the ability to set row names altogether, forcing users to store every bit of information explicitly in dedicated columns, while providing functions to dedicated to efficiently filtering rows in rectangular data, without the need for the [ operator.

DF2 <- DataFrame(
    Integers = c(1L, 2L, 3L),
    Letters = c("A", "B", "C"),
    Floats = c(1.2, 2.3, 3.4),
    row.names = c("name1", "name1", "name2")
)
DF2
DataFrame with 3 rows and 3 columns
       Integers     Letters    Floats
      <integer> <character> <numeric>
name1         1           A       1.2
name1         2           B       2.3
name2         3           C       3.4

Challenge

Using the example above, what does DF2["name1", ] return? Why?

Solution

> DF2["name1", ]
DataFrame with 1 row and 3 columns
       Integers     Letters    Floats
      <integer> <character> <numeric>
name1         1           A       1.2

Only the first occurrence of a row matching the row name name1 is returned.

In this case, row names do not have a particular meaning, making it difficult to justify the need for them. Instead, users could extract all the rows that matching the row name name1 more explicitly as follows: DF2[rownames(DF2) == "name1", ].

Users should be mindful of the motivation for using row names in any given situation; what they represent, and how they should be used during the analysis.

Finally, row names in DataFrame do not support partial matching during subsetting, in contrast to data.frame. The stricter behaviour of DataFrame prevents often unexpected results faced by unsuspecting users.

DF3 <- DataFrame(
    Integers = c(1L, 2L, 3L),
    Letters = c("A", "B", "C"),
    Floats = c(1.2, 2.3, 3.4),
    row.names = c("alpha", "beta", "gamma")
)
df3 <- as.data.frame(DF3)

Challenge

Using the examples above, what are the outputs of DF3["a", ] and df3["a", ]? Why are they different?

Solution

> DF3["a", ]
DataFrame with 1 row and 3 columns
      Integers     Letters    Floats
     <integer> <character> <numeric>
<NA>        NA          NA        NA
> df3["a", ]
      Integers Letters Floats
alpha        1       A    1.2

The DataFrame object did not perform partial row name matching, and thus did not match any row and return a DataFrame full of NA values. Instead, the data.frame object performed partial row name matching, matched the requested "a" to the "alpha" row name, and returned the corresponding row as a new data.frame object.

Indexing

Just like a regular data.frame, columns can be accessed using $, [, and [[. Each operator has a different purpose, and the most appropriate one will often depend on what you are trying to achieve.

For example, the dollar operator $ can be used to extract a single column by name. That will often be a vector, but it may depend on the nature of the data in that column. This operator can be quite convenient in an interactive R session, as it will offer autocompletion among available column names.

DF3$Integers
[1] 1 2 3

Similarly, the double bracket operator [[ can also be used to extract a single column. It is more flexible that $ as it can handle both character names and integer indices.

DF3[["Letters"]]
[1] "A" "B" "C"
DF3[[2]]
[1] "A" "B" "C"

The operator [ is most convenient when it comes to selecting simultaneously on rows and columns, or controlling whether a single-column selection should be returned as a DataFrame or a vector.

DF3[2:3, "Letters", drop=FALSE]
DataFrame with 2 rows and 1 column
          Letters
      <character>
beta            B
gamma           C

Metadata columns

One of most notable novel functionality in DataFrame relative to the base data.frame is the capacity to hold metadata on the columns in another DataFrame.

Metadata columns.

Metadata columns. Metadata columns are illustrated in the context of a DataFrame object. On the left, a DataFrame object called DF is created with columns named A and B. On the right, the metadata columns for DF are accessed using mcols(DF). In this example, two metadata columns are created with names meta1 and meta2. Metadata columns are stored as a DataFrame that contains one row for each column in the parent DataFrame.

The metadata columns are accessed using the function mcols(), If no metadata column is defined, mcols() simply returns NULL.

DF4 <- DataFrame(
    Integers = c(1L, 2L, 3L),
    Letters = c("A", "B", "C"),
    Floats = c(1.2, 2.3, 3.4),
    row.names = c("alpha", "beta", "gamma")
)
mcols(DF4)
NULL

The function mcols() can also be used to add, edit, or remove metadata columns. For instance, we can initialise metadata columns as a DataFrame of two columns:

mcols(DF4) <- DataFrame(
    Type = sapply(DF4, typeof),
    Distinct = sapply(DF4, function(x) { length(unique(x)) } )
)
mcols(DF4)
DataFrame with 3 rows and 2 columns
                Type  Distinct
         <character> <integer>
Integers     integer         3
Letters    character         3
Floats        double         3

Note

The row names of the metadata columns are automatically set to match the column names of the parent DataFrame, clearly indicating the pairing between columns and metadata.

Run-length encoding (RLE)

An extension to the concept of vector

Similarly to the DataFrame class implemented in the S4Vectors, the Rle class provides an S4 extension to the rle() function from the base package. Specifically, the Rle class supports the storage of atomic vectors in a run-length encoding format.

Run-length encoding.

Run-length encoding. The concept of run-length encoding is demonstrated here using the example of a sequence of nucleic acids. Before encoding, each nucleotide at each position in the sequence is explicitly stored in memory. During the encoding, consecutive runs of identical nucleotides are collapsed into two bits of information: the identity of the nucleotide and the length of the run.

Run-length encoding can dramatically reduce the memory footprint of vectors that contain frequent runs of identical information. For instance, a compelling application of run-length encoding is the representation of genomic coverage in sequencing experiments, where large genomic regions devoid of any mapped reads result in long runs of 0 values. Storing each individual value would be highly inefficient from the standpoint of memory usage. Instead, the run-length encoding process collapses such runs of redundant information from arbitrarily long runs of identical information to two values: the repeated value itself, and the number of times that it is repeated.

v1 <- c(0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 2, 1, 0, 0, 0, 0, 0)
rle1 <- Rle(v1)
rle1
numeric-Rle of length 17 with 7 runs
  Lengths: 7 1 1 1 1 1 5
  Values : 0 1 2 3 2 1 0

Indexing

Just like a regular vector, Rle objects can be indexed using [.

rle1[2:4]
numeric-Rle of length 3 with 1 run
  Lengths: 3
  Values : 0

Usage

As vector-like objects, Rle objects can also be stored as columns of DataFrame objects, alongside other vector-like objects.

v2 <- c(rep(1, 5), rep(2, 5))
rle2 <- Rle(v2)
DF5 <- DataFrame(
    vector = v2,
    rle = rle2,
    equal = v2 == rle2
)
DF5
DataFrame with 10 rows and 3 columns
      vector   rle equal
   <numeric> <Rle> <Rle>
1          1     1  TRUE
2          1     1  TRUE
3          1     1  TRUE
4          1     1  TRUE
5          1     1  TRUE
6          2     2  TRUE
7          2     2  TRUE
8          2     2  TRUE
9          2     2  TRUE
10         2     2  TRUE

Going further

A number of standard operations with Rle objects are documented in the help page of the Rle class, accessible as ?Rle, and in the vignettes of the S4Vectors package, accessible using browseVignettes("S4Vectors").

Key Points

  • S4 classes store information in slots, and check the validity of the information every an object is updated.

  • To ensure the continued integrity of S4 objects, users should not access slots directly, but using dedicated functions.

  • S4 generics invoke different implementations of the method depending on the class of the object that they are given.

  • The S4 class DataFrame extends the functionality of base data.frame, for instance with the capacity to hold information about each column in metadata columns.

  • The S4 class Rle extends the functionality of the base vector, for instance with the capacity to encode repetitive vectors in a memory-efficient format.


Working with biological sequences

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What is the recommended way to represent biological sequences in Bioconductor?

  • What Bioconductor packages provides methods to efficiently process biological sequences?

Objectives
  • Explain how biological sequences are represented in the Bioconductor project.

  • Identify Bioconductor packages and methods available to process biological sequences.

Install packages

Before we can proceed into the following sections, we install some Bioconductor packages that we will need. First, we check that the BiocManager package is installed before trying to use it; otherwise we install it. Then we use the BiocManager::install() function to install the necessary packages.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

The Biostrings package and classes

Why do we need classes for biological sequences?

Biological sequences are arguably some of the simplest biological entities to represent computationally. Examples include nucleic acid sequences (e.g., DNA, RNA) and protein sequences composed of amino acids.

That is because alphabets have been designed and agreed upon to represent individual monomers using character symbols.

For instance, using the alphabet for amino acids, the reference protein sequence for the Actin, alpha skeletal muscle protein sequence is represented as follows.

[1] "MCDEDETTALVCDNGSGLVKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEAQSKRGILTLKYPIEHGIITNWDDMEKIWHHTFYNELRVAPEEHPTLLTEAPLNPKANREKMTQIMFETFNVPAMYVAIQAVLSLYASGRTTGIVLDSGDGVTHNVPIYEGYALPHAIMRLDLAGRDLTDYLMKILTERGYSFVTTAEREIVRDIKEKLCYVALDFENEMATAASSSSLEKSYELPDGQVITIGNERFRCPETLFQPSFIGMESAGIHETTYNSIMKCDIDIRKDLYANNVMSGGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTFQQMWITKQEYDEAGPSIVHRKCF"

However, a major limitation of regular character vectors is that they do not check the validity of the sequences that they contain. Practically, it is possible to store meaningless sequences of symbols in character strings, including symbols that are not part of the official alphabet the relevant type of polymer. In those cases, the burden of checking the validity of sequences falls on the programs that process them, or causing those programs to run into errors when they unexpectedly encounter invalid symbols in a sequence.

Instead, S4 classes – demonstrated in the earlier episode The S4 class system – provide a way to label objects as distinct “DNA”, “RNA”, or “protein” varieties of biological sequences. This label is an extremely powerful way to inform programs on the set of character symbols they can expect in the sequence, but also the range of computational operations that can be applied to those sequences. For instance, a function designed to translate nucleic acid sequences into the corresponding amino acid sequence should only be allowed to run on sequences that represent nucleic acids.

Challenge

Can you tell whether this character string is a valid DNA sequence?

AATTGGCCRGGCCAATT

Solution

Yes, this is a valid DNA sequence using ambiguity codes defined in the IUPAC notation. In this case, A, T, C, and G represents the four standard types of nucleotides, while the R symbol acts as a regular expression representing either of the two purine nucleotide bases, A and G.

The Biostrings package

Overview

In the Bioconductor project, the Biostrings package implements S4 classes to represent biological sequences as S4 objects, e.g. DNAString for sequences of nucleotides in deoxyribonucleic acid polymers, and AAString for sequences of amino acids in protein polymers. Those S4 classes provide memory-efficient containers for character strings, automatic validity-checking functionality for each class of biological molecules, and methods implementing various string matching algorithms and other utilities for fast manipulation and processing of large biological sequences or sets of sequences.

A short presentation of the basic classes defined in the Biostrings package is available in one of the package vignettes, accessible as vignette("Biostrings2Classes"), while more detailed information is provided in the other package vignettes, accessible as browseVignettes("Biostrings").

First steps

To get started, we load the package.

library(Biostrings)

With the package loaded and attached to the session, we have access to the package functions. Those include functions that let us create new objects of the classes defined in the package. For instance, we can create an object that represents a DNA sequence, using the DNAString() constructor function. Without assigning the output to an object, we let the resulting object be printed in the console.

DNAString("ATCG")
4-letter DNAString object
seq: ATCG

Notably, DNA sequences may only contain the symbols A, T, C, and G, to represent the four DNA nucleotide bases, the symbol N as a placeholder for an unknown or unspecified base, and a restricted set of additional symbols with special meaning defined in the IUPAC Extended Genetic Alphabet. Notice that the constructor function does not let us create objects that contain invalid characters, e.g. Z.

DNAString("ATCGZ")
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 90 (char 'Z') not in lookup table

Specifically, the IUPAC Extended Genetic Alphabet defines ambiguity codes that represent sets of nucleotides, in a way similar to regular expressions. The IUPAC_CODE_MAP named character vector contains the mapping from the IUPAC nucleotide ambiguity codes to their meaning.

IUPAC_CODE_MAP
     A      C      G      T      M      R      W      S      Y      K      V      H      D      B      N 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG"  "ACT"  "AGT"  "CGT" "ACGT" 

Any of those nucleotide codes are allowed in the sequence of a DNAString object. For instance, the symbol M represents either of the two nucleotides A or C at a given position in a nucleic acid sequence.

DNAString("ATCGM")
5-letter DNAString object
seq: ATCGM

In particular, pattern matching methods implemented in the Biostrings package recognize the meaning of ambiguity codes for each class of biological sequence, allowing them to efficiently match motifs queried by users without the need to design elaborate regular expressions. For instance, the method matchPattern() takes a pattern= and a subject= argument, and returns a Views object that reports and displays any match of the pattern expression at any position in the subject sequence.

Note that the default option fixed = TRUE instructs the method to match the query exactly – i.e., ignore ambiguity codes – which in this case does report any exact match.

dna1 <- DNAString("ATCGCTTTGA")
matchPattern("GM", dna1, fixed = TRUE)
Views on a 10-letter DNAString subject
subject: ATCGCTTTGA
views: NONE

Instead, to indicate that the pattern includes some ambiguity code, the argument fixed must be set to FALSE.

matchPattern("GM", dna1, fixed = FALSE)
Views on a 10-letter DNAString subject
subject: ATCGCTTTGA
views:
      start end width
  [1]     4   5     2 [GC]
  [2]     9  10     2 [GA]

In this particular example, two views describes matches of the pattern in the subject sequence. Specifically, the pattern GM first matched the sequence GC spanning positions 4 to 5 in the subject sequence, and then also matched the sequence GA from positions 9 to 10.

Similarly to the method matchPattern(), the method countPattern() can be applied to simply count the number of matches of the pattern in the subject sequence. And again, the option fixed controls whether to respect ambiguity codes, or match them exactly.

Challenge

How many hits does the following code return? Why?

dna2 <- DNAString("TGATTGCTTGGTTGMTT")
countPattern("GM", dna2, fixed = FALSE)

Solution

The method countPattern() reports 3 hits, because the option fixed = FALSE allow the pattern GM to match GA, GC, and GM sequences, due to the use of the ambiguity code M in the pattern.

Importing biological strings from files

In practice, users rarely type the strings representing biological sequences themselves. Most of the time, biological strings are imported from files, either downloaded from public repositories or generated locally using bioinformatics programs.

For instance, we can load the set of adapter sequences for the TruSeq™ DNA PCR-Free whole-genome sequencing library preparation kit from a file that we downloaded the file during the lesson setup. Since adapter sequences are nucleic acid sequences, we must use the function readDNAStringSet().

truseq_adapters <- readDNAStringSet(filepath = "data/TruSeq3-PE-2.fa")
truseq_adapters
DNAStringSet object of length 6:
    width seq                                                                                       names               
[1]    34 TACACTCTTTCCCTACACGACGCTCTTCCGATCT                                                        PrefixPE/1
[2]    34 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT                                                        PrefixPE/2
[3]    34 TACACTCTTTCCCTACACGACGCTCTTCCGATCT                                                        PE1
[4]    34 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA                                                        PE1_rc
[5]    34 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT                                                        PE2
[6]    34 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC                                                        PE2_rc

Going further

The help page of the function readDNAStringSet() – accessible using help(readDNAStringSet) – documents related functions designed to import other types of biological sequences, e.g readRNAStringSet(), readAAStringSet().

Operations on biological strings

Computing the frequency of symbols

The Biostrings package provides several functions to process and manipulate classes of biological strings. For example, we have come across matchPattern() and countPattern() earlier in this episode.

Another example of method that can be applied to biological strings is to compute the frequency of letters in a biological sequence, using the method letterFrequency().

letterFrequency(truseq_adapters, letters = DNA_ALPHABET)
      A  C  G  T M R W S Y K V H D B N - + .
[1,]  6 14  3 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,]  5  8 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[3,]  6 14  3 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[4,] 11  3 14  6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[5,]  5  8 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[6,] 11 10  8  5 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The output is a matrix with one row for each sequence in the DNAStringSet object, and one column for each symbol in the alphabet of deoxyribonucleic acids, provided by the Biostrings package in a built-in object called DNA_ALPHABET.

Amino acid sequences

Similarly to the DNAString and DNAStringSet classes, the classes AAString and AAStringSet allow efficient storage and manipulation of a long amino acid sequence, or a set thereof.

Similarly to built-in objects for the DNA alphabet, the built-in objects AA_ALPHABET, AA_STANDARD and AA_PROTEINOGENIC describe different subsets of the alphabet of valid symbols for amino acid sequences.

For instance, the AA_ALPHABET object describes the set of symbols in the full amino acid alphabet.

AA_ALPHABET
 [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y" "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+"
[30] "."

Challenge

Use base R code to identify the two symbols present in the AA_PROTEINOGENIC alphabet object that are absent from the AA_STANDARD alphabet object. What do those two symbols represent?

Solution

> setdiff(AA_PROTEINOGENIC, AA_STANDARD)
[1] "U" "O"

The symbols U and O represent selenocysteine and pyrrolysine, respectively. Those two amino acids are in some species coded for by codons that are usually interpreted as stop codons. As such, they are not included in the alphabet of “standard” amino acids, and a alphabet of “proteinogenic” amino acids was defined to acknowledge the special biology of those amino acids. Either of those alphabets may be used to determine the validity of an amino acid sequence, depending on its biological nature.

Translating nucleotide sequences

One of the key motivations for the use of S4 classes and the object-oriented programming (OOP) model relies on the infrastructure of S4 generics and methods. As described in the earlier episode The S4 class system, generics provide a mechanism for defining and applying distinct implementations of the same generic function name, according to the nature of the input object(s) provided to the function call.

For instance, the Biostrings package provide multiple implementations of a generic called translate(), for translating DNA or RNA sequences into amino acid sequences. The set of input objects supported by the generic translate() can be listed using the function showMethods(), from the CRAN package methods.

showMethods("translate")
Function: translate (package Biostrings)
x="DNAString"
x="DNAStringSet"
x="MaskedDNAString"
x="MaskedRNAString"
x="RNAString"
x="RNAStringSet"

In the output above, we see that that the generic function translate() includes methods capable of handling objects representing DNA and RNA sequences in the DNAString and RNAString classes, respectively; but also lists of DNA and RNA sequences in objects of class DNAStringSet and RNAStringSet, as well as other classes capable of storing DNA and RNA sequences.

To demonstrate the use of the translate() method, we first load a set of open reading frame (ORFs) identified by the NIH Open Reading Frame Finder for the Homo sapiens actin beta (ACTB) mRNA (RefSeq: NM_001101), using the standard genetic code, a minimal ORF length of 75 nucleotide, and starting with the “ATG” start codon only.

actb_orf_nih <- readDNAStringSet("data/actb_orfs.fasta")
Error in .Call2("new_input_filexp", filepath, PACKAGE = "XVector"): cannot open file 'data/actb_orfs.fasta'
actb_orf_nih
Error in eval(expr, envir, enclos): object 'actb_orf_nih' not found

Having imported the nucleotide sequences as a DNAStringSet object, we can apply the translate() method to that object, to produce the amino acid sequence that results from the translation process for each nucleotide sequence.

actb_aa <- translate(actb_orf_nih)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'translate': object 'actb_orf_nih' not found
actb_aa
Error in eval(expr, envir, enclos): object 'actb_aa' not found

In the example above, all amino acid sequences visible start with the typical methionin amino acid encoded by the “ATG” start codon. We also see that all but one of the amini acid sequences visible end with the * symbol, which indicates that the translation process ended on a stop codon. In contrast, the first open reading frame above reached the end of the nucleotide sequence without encoutering a stop codon.

Conveniently, the number of amino acids in each sequence is stated under the header width.

Challenge

Extract the length of each amino acid sequence above as an integer vector. What is the length of the longest amino acid sequence translated from any of those open reading frame?

Compare your result with the sequence information on the UniPro page for ACTB (https://www.uniprot.org/uniprot/P60709#sequences).

Solution

width(actb_aa)
# or
max(width(actb_aa))

The longest translated sequence contains 376 amino acids.

The Uniprot page reports a sequence of 375 amino acids. However, the UniProt amino acid sequence does not comprise any symbol to represent the stop codon. That difference aside, the UniPro amino acid sequence is identical to the sequence that we was produced by the translate() method.

Key Points

  • The Biostrings package defines classes to represent sequences of nucleotides and amino acids.

  • The Biostrings package also defines methods to efficiently process biological sequences.

  • The BSgenome package provides genome sequences for a range of model organisms immediately available as Bioconductor objects.


Working with genomics ranges

Overview

Teaching: XX min
Exercises: XX min
Questions
  • What is the recommended way to represent coordinates on a genomic scale in Bioconductor?

  • What Bioconductor packages provides methods to efficiently process genomic ranges?

  • How can I import/export sets of genomic coordinates from/to various genomic file formats?

Objectives
  • Explain how genomic coordinates and intervals are represented in the Bioconductor project.

  • Identify Bioconductor packages and methods available to process ranges of genomic coordinates.

Install packages

Before we can proceed into the following sections, we install some Bioconductor packages that we will need. First, we check that the BiocManager package is installed before trying to use it; otherwise we install it. Then we use the BiocManager::install() function to install the necessary packages.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GenomicRanges")

The GenomicRanges package and classes

Why do we need classes for genomic ranges?

In the era of genomics, many observations are reported as ranges of coordinates - i.e., intervals - on a genomic scale. Depending on the nature of the assay, those genomic ranges may represent genes, transcripts, exons, single nucleotide polymorphisms (SNPs), transcription factor binding sites, or peaks from next-generation sequencing assays such as ChIP-seq or ATAC-seq.

Genomic ranges tie those observations of assayed values (e.g., gene expression) to a physical location in the genome or an organism. For instance, those genomic ranges can be used to query physical proximity or overlap between assayed features and databases of known regulatory regions.

Often, the final genomic ranges used for reporting measurements are the result of combinations and operations on sets of genomic ranges in databases of known genomic features. For instance, in RNA-sequencing, next-generation sequencing reads are often counted within individual exons, and those counts are subsequently aggregated across all the exons of each gene. Separately, promoters are frequently defined as region of arbitrary width, partly upstream and/or downstream of known transcription start sites (TSS).

Importantly, genomic ranges do not necessarily need to span multiple coordinates. The notion of range is meant in the mathematical way, and single-nucleotide genomic ranges (e.g., SNPs) can be described as opening and closing at the same coordinate (or at the next coordinate, in the case of a right-open interval).

For many organisms, the genetic material is split into a number of separate nucleic acid molecules (e.g., chromosomes, plasmids). As such, genomic ranges are described by the name of the sequence and the numeric interval of coordinates on that sequence.

Example uses of the GenomicRanges algebra.

Example uses of the GenomicRanges algebra. Adapted from Huber, et al. (2015). The figure illustrates the example of a gene model that comprises two transcripts, and the definition of various genomic ranges relative to that gene model. For instance - in this specific illustration - unspliced transcripts summarise the entire range of coordinates from the start of the first exon to the end of the last exon; while the gene region is defined as the set of coordinates included in at least one exon of one transcript.

A brief introduction to intervals

Intervals are described in mathematical terms using a start and an end position on an axis of continuous coordinates. The interval then comprises all the real numbers between those two coordinates, and the width of each interval can be computed from the difference between the coordinates of the start and end positions.

Generally speaking, the start and end position can be any rational number, including floating-point numbers. However, in genomics, integer coordinates are typically used to represent the location of monomers (e.g., nucleotide, amino acid) in the sequence of polymers (e.g., nucleic acid, protein).

You may come across packages, databases, and programming languages that use different rules to define intervals. In R, indexing is 1-based (meaning that the first position in a sequence is 1), which contrasts with Python that is 0-based (the index of the first position in a sequence is 0). Similarly, references files in the UCSC Genome Browser are 0-based, while those of the Ensembl Genome Browser are 1-based.

The definition of intervals in a shared coordinate system allows calculations such as the distance between two intervals - generally calculated as the distance between the two closest edges of those intervals -, and the identification of overlapping intervals.

Example of intervals.

Example of intervals. Three intervals named A, B, and C, are represented. Interval A starts at position 5 and ends at position 9, for a width of 4 units; interval B starts at position 1 and ends at position 3, for a width of 2 units; interval C starts at position 3 and ends at position 6, for a width of 3 units. Intervals A and C overlap, from coordinates 5 to 6; while intervals B and C meet at coordinate 3, but do not strictly overlap each other.

A brief introduction to genomic ranges

Genomic ranges essentially extend the notion of mathematical intervals on sets of biological sequences (e.g., chromosomes). In other words, genomic ranges combine the name of the biological sequence on which they are located with the integer range of coordinates that the genomic ranges span in that sequence. This is key to distinguish genomic features that span an overlapping range of coordinates on different biological sequences.

Furthermore, the double-stranded nature of DNA sequences also adds the notion of strandedness to genomic ranges. If known, the strand information of genomic features is a key piece of information that should be tracked, so that it may be used for downstream analyses. For instance, genomic ranges spanning a common range of coordinates on opposite strands of the same DNA sequence may not be considered to overlap (e.g., for the purpose of strand-specific next-generation sequencing assays).

Genomic ranges are closed intervals - the start and end positions are included in the interval; in the example of nucleic acids, the start position indicates the first nucleotide in the interval, and the end position indicates the last nucleotide in the interval.

Example of genomic intervals.

Example of genomic ranges. Genomic ranges are defined by the name of the biological sequence in which they are located (here, “chr1”), and the positions of start and end in that sequence. Here, numeric positions are not explicitly shown, but implied by the sequence of nucleic acids and the arrow indicating coordinates increasing from the left to the right. In this example, genomic ranges can be used to describe individual exons, with metadata grouping those exons into transcripts and genes. Furthermore, the strandedness of exons, transcripts, and genes is an important piece of information to precisely describe the location of each genomic range in the double-stranded DNA polymer.

The GenomicRanges package

Overview

The GenomicRanges package implements S4 classes to represent genomic ranges as S4 objects.

Specifically, the GRanges class is designed to store a set of intervals including the name of the sequence where features are located as well as the range of integer coordinates spanned by the feature in that sequence.

More generally, the IRanges class is designed to store a set of intervals over a range of integer coordinates, without the notion of sequence names. As such, a GRanges object is merely the combination of an IRanges object and a vector of sequence names.

Those S4 classes provide automatic validity-checking functionality, and a range of methods implementing common operations on integer intervals and genomic ranges, from the calculation of distance between pairs of intervals to the identification of overlapping genomic ranges.

A short presentation of the basic classes defined in the GenomicRanges package is available in one of the package vignettes, accessible as vignette("GenomicRangesIntroduction"), while more detailed information is provided in the other package vignettes, accessible as browseVignettes("GenomicRanges").

First steps

To get started, we load the package.

library(GenomicRanges)

The IRanges class

While the genomic space of many organisms is subdivided into multiple sequences (e.g., chromosomes), many operations on genomic ranges take place within individual sequences, where only integer positions matter. The IRanges class provides a container for such “simple” ranges that are defined by two out of three pieces of information:

The IRanges() constructor function accepts those three pieces of information in the arguments start=, width=, and end=. For instance, we create two integer ranges from their start position and width:

demo_iranges <- IRanges(start = c(10, 15), width = c(10, 5))
demo_iranges
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10
  [2]        15        19         5

We note how the object displays not only the start and width information that we requested for each range, but also the end position that is naturally computed from the other two pieces of information.

Challenge

Create the same two ranges as above, using the arguments start= and end= of the IRanges() constructor function.

Solution

IRanges(start = c(10, 15), end = c(19, 19))
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10
  [2]        15        19         5

The start and end positions as well as the width of every interval can be extracted as numeric vector using the functions start(), end() and width(), respectively.

start(demo_iranges)
[1] 10 15
end(demo_iranges)
[1] 19 19
width(demo_iranges)
[1] 10  5

Objects of the IRanges family extend the Vector class, and are handled as unidimensional vectors in terms of indexing. As such, individual ranges can be extracted by integer index like any regular vector.

demo_iranges[1]
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        10        19        10

Metadata on IRanges

The IRanges class can accommodate metadata information on each range, including names - passed to the names= argument - and miscellaneous metadata passed as named vectors.

For instance, we create two ranges named “A” and “B”. Furthermore, we define metadata fields to store an example of character values and numeric values, respectively. Both the names and the values of the metadata fields are completely arbitrary in this example.

demo_with_metadata <- IRanges(
  start = c(10,  15),
  end   = c(19,  19),
  names = c("A", "B"),
  character_metadata = c("control", "target"),
  numeric_metadata = c(100, 200)
)
demo_with_metadata
IRanges object with 2 ranges and 2 metadata columns:
        start       end     width | character_metadata numeric_metadata
    <integer> <integer> <integer> |        <character>        <numeric>
  A        10        19        10 |            control              100
  B        15        19         5 |             target              200

The metadata columns can be extracted as a DataFrame using the function mcols(), as is “metadata columns”.

mcols(demo_with_metadata)
DataFrame with 2 rows and 2 columns
  character_metadata numeric_metadata
         <character>        <numeric>
A            control              100
B             target              200

The character vector of names can be extracted using the function names().

names(demo_with_metadata)
[1] "A" "B"

Similarly to named vector of base data types, individual ranges can be extracted by name.

demo_with_metadata["A"]
IRanges object with 1 range and 2 metadata columns:
        start       end     width | character_metadata numeric_metadata
    <integer> <integer> <integer> |        <character>        <numeric>
  A        10        19        10 |            control              100

Operations on IRanges

IRanges provide the basis for most operations on ranges of numerical coordinates.

For instance, given two set of ranges - a query set and a subject set - the findOVerlaps() function can be used to find out which pairs of ranges in the two sets overlap with each other.

query_iranges <- IRanges(
  start = c(8, 16),
  end   = c(14, 18)
)
overlaps_iranges <- findOverlaps(query = query_iranges, subject = demo_iranges)
overlaps_iranges
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           1
  [3]         2           2
  -------
  queryLength: 2 / subjectLength: 2

The results are returned in the form of a Hits object, which we has not introduced yet. A Hits object is visualised as a table that comprises two integer columns named queryHits and subjectHits. Each row in the table reports an overlap between one range in the query set and one range in the subject set, with the integer value in each column indicating the index of the range in each set involved in the overlap.

In this example, we confirm that the first range in the query set overlaps the first range in the subject set; while the second range in the query set overlaps both ranges in the subject set.

Going further

For downstream use, the two components can be extracted from Hits objects using their names, respectively:

queryHits(overlaps_iranges)
[1] 1 2 2
subjectHits(overlaps_iranges)
[1] 1 1 2

While displayed as a table, Hits objects are actually handled like vectors. Individual hits between one query range and one subject range can be extracted their index:

overlaps_iranges[1]
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 2 / subjectLength: 2

The GRanges class

Having defined integer ranges, the only additional information necessary to define genomic ranges is the name the genomic sequence on which each range is located.

For instance, we define two genomic ranges, as follows:

To do so, we use the GRanges() constructor function. We provide the sequence names as a character vector to the argument seqnames=, and we provide both the start and end position to the argument ranges= as an IRanges object.

demo_granges <- GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges(
    start = c(10, 20),
    end   = c(25, 35))
)
demo_granges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     10-25      *
  [2]     chr2     20-35      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

In the console, the object displays the sequence names in the seqnames component, and the ranges in the form start-end in the ranges component. Furthermore, the example above also demonstrate that GRanges objects possess a component called strand; the symbol * indicates unstranded genomic ranges, as we have not provided that information.

The strand information can be supplied to the strand= argument, for instance:

demo_granges2 <- GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges(
    start = c(10, 20),
    end   = c(25, 35)),
  strand  = c("+", "-")
)
demo_granges2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     10-25      +
  [2]     chr2     20-35      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Finally, the examples above also demonstrate that GRanges objects include a component called seqinfo, which is occasionally used to store information about each sequence that may be represented in the seqnames component. In the latest example above, we have not provide any information about any sequence. As such, the seqinfo component was automatically populated with the names of the sequences that we used to create the object, while the remaining pieces of information were left unspecified, as NA.

seqinfo(demo_granges2)
Seqinfo object with 2 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1             NA         NA   <NA>
  chr2             NA         NA   <NA>

The example above reveals that information about sequences include not only their respective name and length, but also whether they represent a circular polymer (e.g., plasmid), and the name of the genome that they are part of.

This information can be provided directly to the constructor when the object is created, or edited on an existing object using the seqinfo() accessor and the Seqinfo() constructor:

seqinfo(demo_granges2) <-  Seqinfo(
    seqnames = c("chr1", "chr2"),
    seqlengths = c(1234, 5678),
    isCircular = c(FALSE, TRUE),
    genome = c("homo_sapiens", "homo_sapiens")
)
demo_granges2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     10-25      +
  [2]     chr2     20-35      -
  -------
  seqinfo: 2 sequences (1 circular) from homo_sapiens genome

Metadata on GRanges

Similarly to IRanges, metadata can be passed directly to the GRanges constructor function. For instance:

demo_granges3 <- GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges(
    start = c(10, 20),
    end = c(25, 35)),
  metadata1 = c("control", "target"),
  metadata2 = c(1, 2)
)
demo_granges3
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |   metadata1 metadata2
         <Rle> <IRanges>  <Rle> | <character> <numeric>
  [1]     chr1     10-25      * |     control         1
  [2]     chr2     20-35      * |      target         2
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Importing genomic ranges from files

Frequently, large collections of genomic ranges are imported from files rather than described in manually written code. In particular, genome-wide annotations of known gene features are distributed as files on websites such as the Ensembl FTP and the UCSC Genome Data sites.

Various file formats are commonly used to store genomic ranges in bioinformatics workflows. For instance, the BED (Browser Extensible Data) format is commonly found in Chromatin Immunoprecipitation Sequencing (ChIP-Seq), while GTF (Gene Transfer Format, GTF2.2) is the de facto standard file format to describe genomic features such as exons, transcripts, and genes.

In the following example, we import the gene model for Actin Beta (ACTB) from a small GTF file as a set of genomic ranges. The example file represents a subset of a GTF file for the Homo sapiens species, downloaded from the Ensembl FTP site. The original file contains more than 3 millions lines and 22 metadata fields, from which a subset was extracted into a smaller file for this lesson.

In particular, we use the import() generic defined in the BiocIO package - with methods implemented in the rtracklayer package - as a versatile function that is capable of recognising common file extensions and associating them with the appropriate method for parsing each particular file format.

library(rtracklayer)
actb_gtf_data <- rtracklayer::import("data/actb.gtf")
actb_gtf_data
GRanges object with 0 ranges and 4 metadata columns:
   seqnames    ranges strand |   source     type     score     phase
      <Rle> <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  -------
  seqinfo: no sequences

Going further

Individual methods for parsing specific file formats can be invoked directly. For instance, in this case, the GTF file format being identical to the GFF version 2 file format, we could have directly invoked the function rtracklayer::import.gff2() with the exact same effect.

Refer to the documentation of the rtracklayer package for the full list of methods available.

In the example above, the contents of the GTF file were imported into a GRanges object. For each entry in the file, the sequence name, start and end position, and strand information were used to populated the dedicated components of the object, while all other pieces of information are stored as separate columns of metadata.

From here on, this GRanges object can be manipulated just like any of the other GRanges objects that we have created earlier in this episode.

Operations on GRanges and the GRangesList class

As we have demonstrated so far, GRanges objects can be manually defined or imported from files. Those often represent genomic regions of interest, and databases of known genomic features, respectively. Either way, a number of operations are commonly applied to GRanges objects throughout bioinformatics workflows.

Subset

For instance, the subset() method is extremely convenient to extract a set of genomic ranges matching a condition on any component, including sequence name, start and end position, strand, or any metadata field. In the example below, we extract all the records of type transcript that start at position 5527147.

subset(actb_gtf_data, type == "transcript" & start == 5527147)
GRanges object with 0 ranges and 4 metadata columns:
   seqnames    ranges strand |   source     type     score     phase
      <Rle> <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  -------
  seqinfo: no sequences

Split

Separately, the split() method is useful to divide a set of genomic ranges initially stored in a single GRanges object into groups that are stored in a named list of GRanges objects. Conveniently, the GRangesList class provides a container for efficiently displaying and processing lists of GRanges objects.

In the example below, we first extract the subset of entries that represent exons, before separating those exons by transcript identifier, yielding the result as a GRangesList object.

actb_exons <- subset(actb_gtf_data, type == "exon")
actb_exons_by_transcript <- split(actb_exons, actb_exons$transcript_id)
actb_exons_by_transcript
GRangesList object of length 0:
<0 elements>

When printing the object above in the console, the first line confirms the class of the object as GRrangesList, while each named GRanges in that list is introduced by the dollar sign and the name of that item, just like regular named lists in base R.

Length

By nature, many of the methods applicable to list objects can be directly applied to GRangesList objects. For instance, the lengths() function can be used on GRangesList to display the length of each GRanges object in the list as an integer vector.

In the latest example above, we can compute the number of exons in each transcript as the length of each GRanges object within the GRangesList:

lengths(actb_exons_by_transcript)
named integer(0)

Challenge

Importantly, the function lengths() (with a final s) demonstrated above is different from the function length() (without s). The former is meant to be used on list objects, while the latter is meant to be used on vectors.

What does length(actb_exons_by_transcript) return, and why?

Solution

length(actb_exons_by_transcript)
[1] 0

This code returns the single integer value 23, which is the number of GRanges in the GRangesList object and the number of transcripts for the gene ACTB.

Subset by overlap

Possibly one of the most common operations when working with genomic ranges is to subset arbitrarily large collections of genomic ranges to those located in a specific region of the genome; for instance, when visualising information as tracks in a genome browser.

To demonstrate, we manually define a new GRanges representing a region of interest that we will use to extract all of the genomic ranges imported earlier from the GTF file which overlap that region of interest.

region_of_interest <- GRanges(
    seqnames = "7",
    ranges = IRanges(start = 5525830, end = 5531239)
)
actb_in_region <- subsetByOverlaps(x = actb_gtf_data, ranges = region_of_interest)
actb_in_region
GRanges object with 0 ranges and 4 metadata columns:
   seqnames    ranges strand |   source     type     score     phase
      <Rle> <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>
  -------
  seqinfo: no sequences

Like the subset() method, the subsetByOverlaps() method returns a new GRanges object. We can visually compare the information printed in the object (256 ranges in the new subsetted object, relative to 267 ranges in the original object), or we can programmatically compare the length of the two objects to check whether the new GRanges object is any smaller than the original GRanges object:

length(actb_in_region) - length(actb_gtf_data)
[1] 0

In the example above, we learn that the new GRanges object has 11 records less than the original GRanges object.

Going further

Many more methods exist to operate on GRanges and GRangesList objects that could be demonstrated here.

You can find the full list of functions defined in the GenomicRanges package on the index page of the package documentation, accessible using help(package="GenomicRanges"). You can also find more examples and use cases in the package vignettes, accessible using browseVignettes("GenomicRanges").

Key Points

  • The GenomicRanges package defines classes to represent ranges of coordinates on a genomic scale.

  • The GenomicRanges package also defines methods to efficiently process genomic ranges.

  • The rtracklayer package provides functions to import and export genomic ranges from and to common genomic file formats.