Working with biological sequences
OverviewTeaching: XX min
Exercises: XX minQuestions
What is the recommended way to represent biological sequences in Bioconductor?
What Bioconductor packages provides methods to efficiently process biological sequences?Objectives
Explain how biological sequences are represented in the Bioconductor project.
Identify Bioconductor packages and methods available to process biological sequences.
Before we can proceed into the following sections, we install some Bioconductor packages that we will need.
First, we check that the BiocManager package is installed before trying to use it; otherwise we install it.
Then we use the
BiocManager::install() function to install the necessary packages.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings")
The Biostrings package and classes
Why do we need classes for biological sequences?
Biological sequences are arguably some of the simplest biological entities to represent computationally. Examples include nucleic acid sequences (e.g., DNA, RNA) and protein sequences composed of amino acids.
That is because alphabets have been designed and agreed upon to represent individual monomers using character symbols.
For instance, using the alphabet for amino acids, the reference protein sequence for the Actin, alpha skeletal muscle protein sequence is represented as follows.
However, a major limitation of regular character vectors is that they do not check the validity of the sequences that they contain. Practically, it is possible to store meaningless sequences of symbols in character strings, including symbols that are not part of the official alphabet for the relevant type of polymer. In those cases, the burden of checking the validity of sequences falls on the programs that process them, or causing those programs to run into errors when they unexpectedly encounter invalid symbols in a sequence.
Instead, S4 classes – demonstrated in the earlier episode The S4 class system – provide a way to label objects as distinct “DNA”, “RNA”, or “protein” varieties of biological sequences. This label is an extremely powerful way to inform programs on the set of character symbols they can expect in the sequence, but also the range of computational operations that can be applied to those sequences. For instance, a function designed to translate nucleic acid sequences into the corresponding amino acid sequence should only be allowed to run on sequences that represent nucleic acids.
Can you tell whether this character string is a valid DNA sequence?
Yes, this is a valid DNA sequence using ambiguity codes defined in the IUPAC notation. In this case,
Grepresents the four standard types of nucleotides, while the
Rsymbol acts as a regular expression representing either of the two purine nucleotide bases,
The Biostrings package
In the Bioconductor project, the Biostrings package
implements S4 classes to represent biological sequences as S4 objects, e.g.
DNAString for sequences of nucleotides in deoxyribonucleic acid polymers, and
AAString for sequences of amino acids in protein polymers.
Those S4 classes provide memory-efficient containers for character strings,
automatic validity-checking functionality for each class of biological
molecules, and methods implementing various string matching algorithms and other
utilities for fast manipulation and processing of large biological sequences or
sets of sequences.
A short presentation of the basic classes defined in the
Biostrings package is available in one of the package
vignettes, accessible as
vignette("Biostrings2Classes"), while more detailed
information is provided in the other package vignettes, accessible as
To get started, we load the package.
With the package loaded and attached to the session, we have access to the
Those include functions that let us create new objects of the classes defined
in the package.
For instance, we can create an object that represents a DNA sequence, using the
DNAString() constructor function.
Without assigning the output to an object, we let the resulting object be
printed in the console.
4-letter DNAString object seq: ATCG
Notably, DNA sequences may only contain the symbols
represent the four DNA nucleotide bases, the symbol
N as a placeholder for an
unknown or unspecified base, and a restricted set of additional symbols with
special meaning defined in the
IUPAC Extended Genetic Alphabet.
Notice that the constructor function does not let us create objects that contain
invalid characters, e.g.
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 90 (char 'Z') not in lookup table
Specifically, the IUPAC Extended Genetic Alphabet defines
ambiguity codes that represent sets of nucleotides, in a way similar to regular
IUPAC_CODE_MAP named character vector contains the mapping from the IUPAC
nucleotide ambiguity codes to their meaning.
A C G T M R W S Y K V H D B N "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG" "ACT" "AGT" "CGT" "ACGT"
Any of those nucleotide codes are allowed in the sequence of a
For instance, the symbol
M represents either of the two nucleotides
at a given position in a nucleic acid sequence.
5-letter DNAString object seq: ATCGM
In particular, pattern matching methods implemented in the
Biostrings package recognize the meaning of ambiguity
codes for each class of biological sequence, allowing them to efficiently match
motifs queried by users without the need to design elaborate regular
For instance, the method
matchPattern() takes a
pattern= and a
argument, and returns a
Views object that reports and displays any match of
pattern expression at any position in the
Note that the default option
fixed = TRUE instructs the method to match the
query exactly – i.e., ignore ambiguity codes – which in this case does not report
any exact match.
dna1 <- DNAString("ATCGCTTTGA") matchPattern("GM", dna1, fixed = TRUE)
Views on a 10-letter DNAString subject subject: ATCGCTTTGA views: NONE
Instead, to indicate that the pattern includes some ambiguity code, the argument
fixed must be set to
matchPattern("GM", dna1, fixed = FALSE)
Views on a 10-letter DNAString subject subject: ATCGCTTTGA views: start end width  4 5 2 [GC]  9 10 2 [GA]
In this particular example, two views describe matches of the pattern in the
Specifically, the pattern
GM first matched the sequence
positions 4 to 5 in the subject sequence, and then also matched the sequence
GA from positions 9 to 10.
Similarly to the method
matchPattern(), the method
countPattern() can be
applied to simply count the number of matches of the
pattern in the
And again, the option
fixed controls whether to respect ambiguity codes, or
match them exactly.
How many hits does the following code return? Why?
dna2 <- DNAString("TGATTGCTTGGTTGMTT") countPattern("GM", dna2, fixed = FALSE)
countPattern()reports 3 hits, because the option
fixed = FALSEallows the pattern
GMsequences, due to the use of the ambiguity code
Importing biological strings from files
In practice, users rarely type the strings representing biological sequences themselves. Most of the time, biological strings are imported from files, either downloaded from public repositories or generated locally using bioinformatics programs.
For instance, we can load the set of adapter sequences for the TruSeq™ DNA PCR-Free whole-genome sequencing library preparation kit from a file that we downloaded during the lesson setup.
Since adapter sequences are nucleic acid sequences, we must use the function
truseq_adapters <- readDNAStringSet(filepath = "data/TruSeq3-PE-2.fa") truseq_adapters
DNAStringSet object of length 6: width seq names  34 TACACTCTTTCCCTACACGACGCTCTTCCGATCT PrefixPE/1  34 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT PrefixPE/2  34 TACACTCTTTCCCTACACGACGCTCTTCCGATCT PE1  34 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA PE1_rc  34 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT PE2  34 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC PE2_rc
The help page of the function
readDNAStringSet()– accessible using
help(readDNAStringSet)– documents related functions designed to import other types of biological sequences, e.g
Operations on biological strings
Computing the frequency of symbols
The Biostrings package provides several functions to
process and manipulate classes of biological strings.
For example, we have come across
in this episode.
Another example of a method that can be applied to biological strings is
letterFrequency(), to compute the frequency of letters in a biological
letterFrequency(truseq_adapters, letters = DNA_ALPHABET)
A C G T M R W S Y K V H D B N - + . [1,] 6 14 3 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [2,] 5 8 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [3,] 6 14 3 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [4,] 11 3 14 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [5,] 5 8 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [6,] 11 10 8 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The output is a matrix with one row for each sequence in the
object, and one column for each symbol in the alphabet of deoxyribonucleic
acids, provided by the Biostrings package in a
built-in object called
Amino acid sequences
Similarly to the
DNAStringSet classes, the classes
AAStringSet allow efficient storage and manipulation of a long amino acid sequence, or a set thereof.
Similarly to built-in objects for the DNA alphabet, the built-in objects
AA_PROTEINOGENIC describe different subsets of the alphabet of valid symbols for amino acid sequences.
For instance, the
AA_ALPHABET object describes the set of symbols in the full amino acid alphabet.
 "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y" "V" "U" "O" "B" "J" "Z" "X" "*" "-" "+"  "."
Use base R code to identify the two symbols present in the
AA_PROTEINOGENICalphabet object that are absent from the
AA_STANDARDalphabet object. What do those two symbols represent?
> setdiff(AA_PROTEINOGENIC, AA_STANDARD)  "U" "O"
Orepresent selenocysteine and pyrrolysine, respectively. Those two amino acids are in some species coded for by codons that are usually interpreted as stop codons. As such, they are not included in the alphabet of “standard” amino acids, and an alphabet of “proteinogenic” amino acids was defined to acknowledge the special biology of those amino acids. Either of those alphabets may be used to determine the validity of an amino acid sequence, depending on its biological nature.
Translating nucleotide sequences
One of the key motivations for the use of S4 classes and the object-oriented programming (OOP) model relies on the infrastructure of S4 generics and methods. As described in the earlier episode The S4 class system, generics provide a mechanism for defining and applying distinct implementations of the same generic function name, according to the nature of the input object(s) provided to the function call.
For instance, the Biostrings package provides multiple implementations of a generic called
translate(), for translating DNA or RNA sequences into amino acid sequences.
The set of input objects supported by the generic
translate() can be listed using the function
showMethods(), from the CRAN package methods.
Function: translate (package Biostrings) x="DNAString" x="DNAStringSet" x="MaskedDNAString" x="MaskedRNAString" x="RNAString" x="RNAStringSet"
In the output above, we see that that the generic function
translate() includes methods capable of handling objects representing DNA and RNA sequences in the
RNAString classes, respectively;
but also lists of DNA and RNA sequences in objects of class
RNAStringSet, as well as other classes capable of storing DNA and RNA sequences.
To demonstrate the use of the
translate() method, we first load a set of open
reading frames (ORFs) identified by the
NIH Open Reading Frame Finder
for the Homo sapiens actin beta (ACTB) mRNA (RefSeq: NM_001101),
using the standard genetic code, a minimal ORF length of 75 nucleotides,
and starting with the “ATG” start codon only.
actb_orf_nih <- readDNAStringSet("data/actb_orfs.fasta") actb_orf_nih
DNAStringSet object of length 13: width seq names  222 ATGCCCACCATCACGCCCTGGTGCCTGGGGCGCCCCACGATGG...AAGGCGAGGCTCTGTGCTCGCGGGGCGGACGCGGTCTCGGCG gi|1519311456|ref...  1128 ATGGATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCG...GACGAGTCCGGCCCCTCCATCGTCCACCGCAAATGCTTCTAG gi|1519311456|ref...  126 ATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCGGCAT...TCTTCCCCTCCATCGTGGGGCGCCCCAGGCACCAGGGCGTGA gi|1519311456|ref...  90 ATGTCGTCCCAGTTGGTGACGATGCCGTGCTCGATGGGGTACT...GTGAGGATGCCTCTCTTGCTCTGGGCCTCGTCGCCCACATAG gi|1519311456|ref...  225 ATGGGCACAGTGTGGGTGACCCCGTCACCGGAGTCCATCACGA...AGCACGGGGTGCTCCTCGGGAGCCACACGCAGCTCATTGTAG gi|1519311456|ref... ... ... ...  342 ATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTGTTTTG...GCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAA gi|1519311456|ref...  168 ATGGCTTTATTTGTTTTTTTTGTTTTGTTTTGGTTTTTTTTTT...AATGTGGCCGAGGACTTTGATTGCACATTGTTGTTTTTTTAA gi|1519311456|ref...  111 ATGACTATTAAAAAAACAACAATGTGCAATCAAAGTCCTCGGC...GCTCCAACCGACTGCTGTCACCTTCACCGTTCCAGTTTTTAA gi|1519311456|ref...  105 ATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTT...CCTTCGTGCCCCCCCTTCCCCCTTTTTTGTCCCCCAACTTGA gi|1519311456|ref...  135 ATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTTGTCCCCC...GCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAA gi|1519311456|ref...
Having imported the nucleotide sequences as a
DNAStringSet object, we can
translate() method to that object to produce the amino acid
sequence that results from the translation process for each nucleotide sequence.
actb_aa <- translate(actb_orf_nih) actb_aa
AAStringSet object of length 13: width seq names  74 MPTITPWCLGRPTMEGKTARGASSPAKPALHMPEPLSTTSAAISSSMVSWRRVWTGGGSAKARLCARGADAVSA gi|1519311456|ref...  376 MDDDIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGRPRHQGV...RKYSVWIGGSILASLSTFQQMWISKQEYDESGPSIVHRKCF* gi|1519311456|ref...  42 MMISPRSSSTTAPACARPASRATMPPGPSSPPSWGAPGTRA* gi|1519311456|ref...  30 MSSQLVTMPCSMGYFRVRMPLLLWASSPT* gi|1519311456|ref...  75 MGTVWVTPSPESITMPVVRPEAYRDSTAWIATYMAGVLKVSNMIWVIFSRLALGFRGASVSSTGCSSGATRSSL* gi|1519311456|ref... ... ... ...  114 MRLAWLYLFFLFCFGFFFFFGLTQDLKTGTVKVTAVGWSEHPP...SLAILKATPLLSKENGPVLSQVHTGEVIALLSCKLCNAKFF* gi|1519311456|ref...  56 MALFVFFVLFWFFFFFWLDSGFKNWNGEGDSSRLERASPKVHNVAEDFDCTLLFF* gi|1519311456|ref...  37 MTIKKTTMCNQSPRPHCELWGMLAPTDCCHLHRSSF* gi|1519311456|ref...  35 MQNFFNLRLNTFLFCFILNDEPSCPPFPLFCPPT* gi|1519311456|ref...  45 MMSLRAPPSPFFVPQLEMYEGFWSPWEWVEAARAYLYTDLRPVE* gi|1519311456|ref...
In the example above, all amino acid sequences visible start with the typical
methionin amino acid encoded by the “ATG” start codon.
We also see that all but one of the amino acid sequences visible end with the
* symbol, which indicates that the translation process ended on a stop codon.
In contrast, the first open reading frame above reached the end of the
nucleotide sequence without encoutering a stop codon.
Conveniently, the number of amino acids in each sequence is stated under the
Extract the length of each amino acid sequence above as an integer vector. What is the length of the longest amino acid sequence translated from any of those open reading frames?
Compare your result with the sequence information on the UniPro page for ACTB (https://www.uniprot.org/uniprot/P60709#sequences).
width(actb_aa) # or max(width(actb_aa))
The longest translated sequence contains 376 amino acids.
The Uniprot page reports a sequence of 375 amino acids. However, the UniProt amino acid sequence does not comprise any symbol to represent the stop codon. That difference aside, the UniPro amino acid sequence is identical to the sequence that was produced by the
Biostringspackage defines classes to represent sequences of nucleotides and amino acids.
Biostringspackage also defines methods to efficiently process biological sequences.
BSgenomepackage provides genome sequences for a range of model organisms immediately available as Bioconductor objects.