This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Data Processing and Visualization for Metagenomics

Starting a Metagenomics Project

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do you plan a metagenomics experiment?

  • How does a metagenomics project look like?

Objectives
  • Learn the differences between shotgun and metabarcoding (amplicon metagenomics) techniques.

  • Understand the importance of metadata.

  • Familiarize yourself with the Cuatro Ciénegas experiment.

Metagenomics

Metagenomes are collections of genomic sequences from various (micro)organisms that coexist in any given space. They are like snapshots that can give us information about the taxonomic and even metabolic or functional composition of the communities that we decide to study. Thus, metagenomes are usually employed to investigate the ecology of defining characteristics of niches (e.g., the human gut, or the ocean floor).

Since metagenomes are mixtures of sequences that belong to different species, a metagenomic workflow is designed to answer two questions:

  1. What species are represented in the sample?
  2. What are they capable of doing?

To find which species are present in a niche, we have to do a taxonomic assignation of the obtained sequences. To find out their capabilities, we can look at the genes directly encoded in the metagenome or the genes associated with the species that we found. In order to know which methodology we should use, it is important to know what questions do we want to answer.

Shotgun and amplicon

There are two paths to obtain information from a complex sample:

  1. Shotgun Metagenomics
  2. Metabarcoding.

Each is named after the sequencing methodology employed and have particular use cases, with inherent advantages and disadvantages.

With Shotgun Metagenomics, we sequence random parts (ideally all of them) of the genomes present in a sample. We can search the origin of these pieces (i.e., their taxonomy) and also try to find to what part of the genome they correspond. Given enough pieces, it is even possible to obtain full individual genomes from a shotgun metagenome, which could give us a bunch of information about the species in our study. This, however, requires that we have a lot of genomic sequences from one organism, and since the sequencing is done at random, we usually have to sequence our community a lot (have a high sequencing depth) to make sure that we obtain enough pieces of a given genome. This gets exponentially harder when our species of interest is not very abundant. It also requires that we have enough DNA to work with, which can be difficult to obtain in certain cases. Finally, a lot of sequencing means a lot of expenses, and because of this, making technical and biological replicates can be prohibitively costly.

On the contrary, Metabarcoding tends to be cheaper, which makes it easier to duplicate and even triplicate them without taking a big financial hit. This is because Metabarcoding is the collection of small genomic fragments present in the community and amplified through PCR. If the amplified region is present only once in every genome, ideally, we wouldn’t need to sequence the amplicon metagenome so thoroughly because one sequence is all we need to get the information about that genome, and by extension, about that species. On the other hand, if a genome in the community lacks the region targeted by the PCR primers, then no amount of sequencing can give us information about that genome. This is why the most popular amplicon used for this methodology are 16S amplicons for Bacteria since every known bacterium has this particular region. Other regions can be chosen, but they are used for very specific cases. However, even 16S amplicons are limited to, well, the 16S region, so amplicon metagenomes cannot directly tell us a lot about the metabolic functions found in each genome, although educated guesses can be made by knowing which genes are commonly found in every identified species.

Flow chart that shows the steps of a metagenomics project: Experimental design, Sampling, DNA extraction, Sequencing, Read quality, Assembly, Binning, Bin quality and Data analysis

On Metadata

Once we have chosen an adequate type of methodology for our study, it is important to take extensive notes on the origin of our samples and how we treated them. These notes constitute the metadata, or data about our data, and it is crucial to understand and interpret the results that we are going to obtain later on in our metagenomic analysis. Most of the time, the differences that we observe when comparing metagenomes can be correlated to the metadata, which is why we must include a whole section of our experimental design to the metadata that we expect to collect and record carefully.

Amplicon or Shotgun?

Suppose you would like to compare the gut microbiome of people affected by a rather nasty bacterial disease against the gut microbiome of healthy people.
Which type of metagenomics would you choose?
Which type of metadata would be useful to record?

Before we continue we want to introduce you Chepiche; they are going to be with us during this lesson because they are also interested in learning about metagenomics, in fact they already have Cuatro Ciénegas data to work on! Let’s see!

Cuatro Ciénegas

Photography of a pond in Cuatro Ciénegas

During this lesson, we will work with actual metagenomic information, so we should be familiarized with it. The metagenomes that we will use were collected in Cuatro Ciénegas, a region that has been extensively studied by Valeria Souza. Cuatro Ciénegas is an oasis in the Mexican desert whose environmental conditions are often linked to the ones present in ancient seas, due to a higher than average content of sulfur and magnesium but lower concentrations of phosphorus and other nutrients. Because of these particular conditions, the Cuatro Ciénegas basin is a very interesting place to conduct a metagenomic study, to learn more about the bacterial diversity that is capable to survive and thrive in that environment.

The particular metagenomic study that we are going to work with was collected in a study about the response of the Cuatro Cienegas’ bacterial community to nutrient enrichment. In this study, authors compared the differences between the microbial community in its natural, oligotrophic, phosphorus-deficient environment, a pond from the Cuatro Ciénegas Basin (CCB), and the same microbial community under a fertilization treatment. The comparison between bacterial communities showed that many genomic traits, such as mean bacterial genome size, GC content, total number of tRNA genes, total number of rRNA genes, and codon usage bias were significantly changed when the bacterial community underwent the treatment.

Exercise 1: Reviewing metadata

According to the results described for this CCB study.

  1. What kind of sequencing method do you think they used, and why do you think so?
    A) Metabarcoding B) Shotgun metagenomics
    C) Genomics of axenic cultures

  2. In the table samples treatment information, what was the most important piece of metadata that the authors took?

Solution

A) Metabarcoding. False. With this technique, usually only one region of the genome is amplified.
B) Shotgun Metagenomics. True. Only shotgun metagenomics could have been used to investigate the total number of tRNA genes.
C) Genomics of axenic cultures. False. Information on the microbial community cannot be fully obtained with axenic cultures.

The most important thing to know about our data is which community was and was not supplemented with fertilizers.
However, any differences in the technical parts of the study, such as the DNA extraction protocol, could have affected the results, so tracking those is also important.

Exercise 2: Differentiate between IDs and sample names

Depending on the database, several IDs can be used for the same sample. Please, open Chepiche’s document where the metadata information is stored. Here inspect the IDs and find out which of them correspond to sample JP4110514WATERRESIZE

Solution

ERS1949771 is the SRA ID corresponding to JP4110514WATERRESIZE

Exercise 3: Discuss the importance of Metadata

Which other data could you recommend Chepiche to add in their data, and what did you think is the relevance of this information?

Solution

Metadata will depend on the type of the experiment, but some examples are temperature, sampling methodology, date, place (country, state, region, city, etc.).

Note that throughout the lesson, we will use the first four characters of the file names (alias) to identify the data files corresponding to a sample.

The results of this study, raw sequences, and metadata have been submitted to the NCBI Sequence Read Archive (SRA), and are stored in the BioProject PRJEB22811. There are other metagenomic databases where we can find metagenomics data.

Other metagenomic databases

The NCBI SRA is not the only repository for metagenomic information. There are other public metagenomic databases such as MG-RAST, MGnify, Marine Metagenomics Portal, Terrestrial Metagenome DB and the GM Repo.

Each database requires certain Metadata linked with the data. As an example, when JP4D.fasta is uploaded to mg-RAST the associated Metadata looks like:

Column Description
file_name JP4D.fasta
investigation_type metagenome
seq_meth illumina
project_description This project is a teaching project and uses data from Okie et al Elife 2020
collection_date 2012-06-27
country Mexico
feature pond water
latitude 26.8717055555556
longitude -102.14
env_package water
depth 0.165

Key Points

  • Shotgun metagenomics can be used for taxonomic and functional studies.

  • Metabarcoding can be used for taxonomic studies.

  • Collecting metadata beforehand is fundamental for downstream analysis.

  • We will use data from a Cuatro Ciénegas project to learn about shotgun metagenomics.


Assessing Read Quality

Overview

Teaching: 30 min
Exercises: 20 min
Questions
  • How can I describe the quality of my data?

Objectives
  • Explain how a FASTQ file encodes per-base quality scores.

  • Interpret a FastQC plot summarizing per-base quality across all reads.

  • Use for loops to automate operations on multiple files.

Bioinformatic workflows

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output. The execution of this set of tools in a specified order is commonly referred to as a workflow or a pipeline.

An example of the workflow we will be using for our analysis is provided below with a brief description of each step.

Flow diagram that shows the steps: Sequence reads, Quality control, Assembly, Binning and Taxonomy

  1. Quality control - Assessing quality using FastQC and Trimming and/or filtering reads (if necessary)
  2. Assembly of metagenome
  3. Binning
  4. Taxonomic assignation

These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.

Quality control

We will now assess the quality of the sequence reads contained in our FASTQ files.

Flow diagram that shows the steps: Sequence reads and Quality control.

Details on the FASTQ format

Although it looks complicated (and it is), we can understand the FASTQ format with a little decoding. Some rules about the format include…

Line Description
1 Always begins with ‘@’ followed by the information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes contains the same info as in line 1
4 Has a string of characters which represent the quality scores; must have same number of characters as line 2

We can view the first complete read in one of the files from our dataset by using head to look at the first four lines. But we have to decompress one of the files first.

$ cd /dc_workshop/data/untrimmed_fastq/

$ gunzip JP4D_R1.fastq.gz

$ head -n 4 JP4D_R1.fastq
@MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 1:N:0:CTCAGA
CCCGTTCCTCGGGCGTGCAGTCGGGCTTGCGGTCTGCCATGTCGTGTTCGGCGTCGGTGGTGCCGATCAGGGTGAAATCCGTCTCGTAGGGGATCGCGAAGATGATCCGCCCGTCCGTGCCCTGAAAGAAATAGCACTTGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAGCAAACCTCTCACTCCCTCTACTCTACTCCCTT                                        
+                                                                                                
A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is:

A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, using Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 41 as shown in the chart below.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
                   |         |         |         |         |
Quality score:    01........11........21........31........41                                

Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and depend on how much signal was captured for the base incorporation. In this link you can find more information about quality scores.

Looking back at our read:

@MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:12622:2006 1:N:0:CTCAGA
CCCGTTCCTCGGGCGTGCAGTCGGGCTTGCGGTCTGCCATGTCGTGTTCGGCGTCGGTGGTGCCGATCAGGGTGAAATCCGTCTCGTAGGGGATCGCGAAGATGATCCGCCCGTCCGTGCCCTGAAAGAAATAGCACTTGTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTCAGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAGCAAACCTCTCACTCCCTCTACTCTACTCCCTT                                        
+                                                                                                
A>>1AFC>DD111A0E0001BGEC0AEGCCGEGGFHGHHGHGHHGGHHHGGGGGGGGGGGGGHHGEGGGHHHHGHHGHHHGGHHHHGGGGGGGGGGGGGGGGHHHHHHHGGGGGGGGHGGHHHHHHHHGFHHFFGHHHHHGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFBFFFF@F@FFFFFFFFFFBBFF?@;@#################################### 

We can now see that there is a range of quality scores, but that the end of the sequence is very poor (# = a quality score of 2).

Exercise 1: Looking at specific reads

How would you show in the terminal the ID and quality of the last read in JP4D_R1.fastq ?
a) tail JP4D_R1.fastq
b) head -n 4 JP4D_R1.fastq
c) more JP4D_R1.fastq
d) tail -n4 JP4D_R1.fastq
e) tail -n4 JP4D_R1.fastq | head -n2

Do you trust the sequence in this read?

Solution

  
a) It does show the ID and quality of the last read but also show unnecesary lines from previous reads.  
b) No. It shows the first read's info.  
c) It shows the text of the entire file.  
d) This option is the best answer as it only shows info for the last read.  
e) It does show the ID of the last read but not the quality.  

@MISEQ-LAB244-W7:156:000000000-A80CV:1:2114:17866:28868 1:N:0:CTCAGA

CCCGTTCTCCACCTCGGCGCGCGCCAGCTGCGGCTCGTCCTTCCACAGGAACTTCCACGTCGCCGTCAGCCGCGACACGTTCTCCCCCCTCGCATGCTCGTCCTGTCTCTCGTGCTTGGCCGACGCCTGCGCCTCGCACTGCGCCCGCTCGGTGTCGTTCATGTTGATCTTCACCGTGGCGTGCATGAAGCGGTTCCCGGCCTCGTCGCCACCCACGCCATCCGCGTCGGCCAGCCACTCTCACTGCTCGC

+

AA11AC1>3@DC1F1111000A0/A///BB#############################################################################################################################################################################################################################          

This read has more consistent quality at its first than at the end but still has a range of quality scores, most of them low. We will look at variations in position-based quality in just a moment.

At this point, lets validate that all the relevant tools are installed. If you are using the AWS AMI then these should be preinstalled.

$ fastqc -h 
            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of
    which will help to identify a different potential type of problem in your
    data.
.
.
.

If FastQC is not installed then you would expect to see an error like

$ fastqc -h 
The program 'fastqc' is currently not installed. You can install it by typing:
sudo apt-get install fastqc

If this happens check with your instructor before trying to install it.

Assessing quality using FastQC

In real life, you won’t be assessing the quality of your reads by visually inspecting your FASTQ files. Rather, you’ll be using a software program to assess read quality and filter out poor quality reads. We’ll first use a program called FastQC to visualize the quality of our reads. Later in our workflow, we’ll use another program to filter out poor quality reads.

FastQC has a number of features which can give you a quick impression of any problems your data may have, so you can take these issues into consideration before moving forward with your analyses. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates a very high quality sample:

Graphic of boxplots, where are all of them are at the top of the y axis in the good range of scores.

The x-axis displays the base position in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long. This is much shorter than the reads we are working with in our workflow. For each position, there is a box-and-whisker plot showing the distribution of quality scores for all reads at that position. The horizontal red line indicates the median quality score and the yellow box shows the 1st to 3rd quartile range. This means that 50% of reads have a quality score that falls within the range of the yellow box at that position. The whiskers show the absolute range, which covers the lowest (0th quartile) to highest (4th quartile) values.

For each position in this sample, the quality values do not drop much lower than 32. This is a high quality score. The plot background is also color-coded to identify good (green), acceptable (yellow), and bad (red) quality scores.

Now let’s take a look at a quality plot on the other end of the spectrum.

Graphic of boxplots, where the first ones are in the good range of scores of the y axis and extend to the acceptable and bad ranges of scores toward the right of the x axis

Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the “bad” range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.

Running FastQC

We will now assess the quality of the reads that we downloaded. First, make sure you’re still in the untrimmed_fastq directory

$ cd ~/dc_workshop/data/untrimmed_fastq/ 

Exercise 2: Looking at files metadata

How would you see the size of the files in the untrimmed_fastq\ directory?
(Hint: Look at the options for the ls command to see how to show file sizes.)
a) ls -a
b) ls -S
c) ls -l
d) ls -lh
e) ls -ahlS

Solution

  
a) No. The flag `-a` shows all of the contents, including hidden files and directories.  
b) No. The flag `-S` shows the content Sorted by size starting with the largest file.  
c) Yes. The flag `-l` shows the contents with metadata including file size.    
d) Yes. The flag `-lh` shows the content  with metadata in a human readable manner.  
e) Yes. The combination of all of the flags shows all of the contents with metadata including hidden files, sorted by size.  

-rw-r--r-- 1 dcuser dcuser  24M Nov 26 21:34 JC1A_R1.fastq.gz                      
-rw-r--r-- 1 dcuser dcuser  24M Nov 26 21:34 JC1A_R2.fastq.gz                      
-rw-r--r-- 1 dcuser dcuser 616M Nov 26 21:34 JP4D_R1.fastq              
-rw-r--r-- 1 dcuser dcuser 203M Nov 26 21:35 JP4D_R2.fastq.gz   

There are four FASTQ files ranging from 24M (24MB) to 616M.

FastQC can accept multiple file names as input, and on both zipped and unzipped files, so we can use the \*.fastq*wildcard to run FastQC on all of the FASTQ files in this directory.

$ fastqc *.fastq* 

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

Started analysis of JC1A_R1.fastq.gz                                               
Approx 5% complete for JC1A_R1.fastq.gz                                            
Approx 10% complete for JC1A_R1.fastq.gz                                           
Approx 15% complete for JC1A_R1.fastq.gz                                           
Approx 20% complete for JC1A_R1.fastq.gz                                           
Approx 25% complete for JC1A_R1.fastq.gz                                           
Approx 30% complete for JC1A_R1.fastq.gz                                          
Approx 35% complete for JC1A_R1.fastq.gz  

In total, it should take about five minutes for FastQC to run on all four of our FASTQ files. When the analysis completes, your prompt will return. So your screen will look something like this:

Approx 80% complete for JP4D_R2.fastq.gz
Approx 85% complete for JP4D_R2.fastq.gz
Approx 90% complete for JP4D_R2.fastq.gz
Approx 95% complete for JP4D_R2.fastq.gz
Analysis complete for JP4D_R2.fastq.gz
$

The FastQC program has created several new files within our data/untrimmed_fastq/ directory.

$ ls 
JC1A_R1_fastqc.html             JP4D_R1.fastq                   
JC1A_R1_fastqc.zip              JP4D_R1_fastqc.html 
JC1A_R1.fastq.gz                JP4D_R1_fastqc.zip                   
JC1A_R2_fastqc.html             JP4D_R2_fastqc.html           
JC1A_R2_fastqc.zip              JP4D_R2_fastqc.zip 
JC1A_R2.fastq.gz                JP4D_R2.fastq.gz       

For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We’ll be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/ directory.

$ mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads 
$ mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/ 
$ mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/ 

Now we can navigate into this results directory and do some closer inspection of our output files.

$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/ 

Viewing the FastQC results

If we were working on our local computers, we’d be able to look at each of these HTML files by opening them in a web browser. However, these files are currently sitting on our remote AWS instance, where our local computer can’t see them. Since we are only logging into the AWS instance via the command line our remote computer it doesn’t have any web browser setup to display these files either. So, the easiest way to look at these webpage summary reports is to transfer them to our local computers (i.e. your laptop). To copy a file from a remote server to our own machines, we will use scp, which we learned yesterday in the Introduction to the Command Line lesson.

First, open a new terminal in you local computer, we will make a new directory on our computer to store the HTML files we’re transferring. Let’s put it on our desktop for now. Open a new tab in your terminal program (you can use the pull down menu at the top of your screen or the Cmd+t keyboard shortcut) and type:

$ mkdir -p ~/Desktop/fastqc_html 

Now we can transfer our HTML files to our local computer using scp.

$ scp dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html

As a reminder, the first part of the command dcuser@ec2-34-238-162-94.compute-1.amazonaws.com is the address for your remote computer. Make sure you replace everything after dcuser@ with your instance number (the one you used to log in).

The second part starts with a : and then gives the absolute path of the files you want to transfer from your remote computer. Don’t forget the :. We used a wildcard (*.html) to indicate that we want all of the HTML files.

The third part of the command gives the absolute path of the location you want to put the files in. This is on your local computer and is the directory we just created ~/Desktop/fastqc_html.

You should see a status output like this:

JC1A_R1_fastqc.html     100%  253KB 320.0KB/s   00:00     
JC1A_R2_fastqc.html     100%  262KB 390.1KB/s   00:00     
JP4D_R1_fastqc.html     100%  237KB 360.8KB/s   00:00     
JP4D_R2_fastqc.html     100%  244KB 385.2KB/s   00:00

Now we can go to our new directory and open the 4 HTML files.

Depending on your system, you should be able to select and open them all at once via a right click menu in your file browser.

Exercise 3: Discuss the quality of sequencing files

Discuss your results with a neighbor. Which sample(s) looks the best in terms of per base sequence quality? Which sample(s) look the worst?

Solution

All of the reads contain usable data, but the quality decreases toward the end of the reads. File JC1A_R2_fastqc shows the lowest quality.

Decoding the other FastQC outputs

We’ve now looked at quite a few “Per base sequence quality” FastQC graphs, but there are nine other graphs that we haven’t talked about! Below we have provided a brief overview of interpretations for each of these plots. For more information, please see the FastQC documentation here

Working with the FastQC text output

Now that we’ve looked at our HTML reports to get a feel for the data, let’s look more closely at the other output files. Go back to the tab in your terminal program that is connected to your AWS instance (the tab label will start with dcuser@ip) and make sure you’re in our results subdirectory.

$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/ 
$ ls 
JC1A_R1_fastqc.html           JP4D_R1_fastqc.html                  
JC1A_R1_fastqc.zip            JP4D_R1_fastqc.zip                   
JC1A_R2_fastqc.html           JP4D_R2_fastqc.html                  
JC1A_R2_fastqc.zip            JP4D_R2_fastqc.zip 

Our .zip files are compressed files. They each contain multiple different types of output files for a single input FASTQ file. To view the contents of a .zip file, we can use the program unzip to decompress these files. Let’s try doing them all at once using a wildcard.

$ unzip *.zip 
Archive:  JC1A_R1_fastqc.zip                                                       
caution: filename not matched:  JC1A_R2_fastqc.zip                                 
caution: filename not matched:  JP4D_R1_fastqc.zip                      
caution: filename not matched:  JP4D_R2_fastqc.zip  

This didn’t work. It identified the first file and then got a warning message for each of the other .zip files. This is because unzip expects to get only one zip file as input. We could go through and unzip each file one at a time, but this is very time consuming and error-prone. Someday you may have 500 files to unzip!

A more efficient way is to use a for loop like we learned in the Command Line lesson to iterate through all of our .zip files. Let’s see what that looks like and then we’ll discuss what we’re doing with each line of our loop.

$ for filename in *.zip
> do
> unzip $filename
> done

In this example, the input is the four filenames (one filename for each of our .zip files). Each time the loop iterates, it will assign a file name to the variable filename and run the unzip command. The first time through the loop, $filename is JC1A_R1_fastqc.zip. The interpreter runs the command unzip on JC1A_R1_fastqc.zip. For the second iteration, $filename becomes JC1A_R2_fastqc.zip. This time, the shell runs unzip on JC1A_R2_fastqc.zip. It then repeats this process for the other .zip files in our directory.

When we run our for loop, you will see output that starts like this:

Archive:  JC1A_R1_fastqc.zip                                            
creating: JC1A_R1_fastqc/                                            
creating: JC1A_R1_fastqc/Icons/                                      
creating: JC1A_R1_fastqc/Images/                                    
inflating: JC1A_R1_fastqc/Icons/fastqc_icon.png                      
inflating: JC1A_R1_fastqc/Icons/warning.png                          
inflating: JC1A_R1_fastqc/Icons/error.png                            
inflating: JC1A_R1_fastqc/Icons/tick.png                             
inflating: JC1A_R1_fastqc/summary.txt                                
inflating: JC1A_R1_fastqc/Images/per_base_quality.png                
inflating: JC1A_R1_fastqc/Images/per_tile_quality.png                
inflating: JC1A_R1_fastqc/Images/per_sequence_quality.png            
inflating: JC1A_R1_fastqc/Images/per_base_sequence_content.png       
inflating: JC1A_R1_fastqc/Images/per_sequence_gc_content.png         
inflating: JC1A_R1_fastqc/Images/per_base_n_content.png              
inflating: JC1A_R1_fastqc/Images/sequence_length_distribution.png 
inflating: JC1A_R1_fastqc/Images/duplication_levels.png              
inflating: JC1A_R1_fastqc/Images/adapter_content.png                 
inflating: JC1A_R1_fastqc/fastqc_report.html                         
inflating: JC1A_R1_fastqc/fastqc_data.txt                            
inflating: JC1A_R1_fastqc/fastqc.fo  

The unzip program is decompressing the .zip files and creating a new directory (with subdirectories) for each of our samples, to store all of the different output that is produced by FastQC. There are a lot of files here. The one we’re going to focus on is the summary.txt file.

If you list the files in our directory now you will see:

$ ls 
JC1A_R1_fastqc                  JP4D_R1_fastqc                                                     
JC1A_R1_fastqc.html             JP4D_R1_fastqc.html                                        
JC1A_R1_fastqc.zip              JP4D_R1_fastqc.zip                                                  
JC1A_R2_fastqc                  JP4D_R2_fastqc                                               
JC1A_R2_fastqc.html             JP4D_R2_fastqc.html                                             
JC1A_R2_fastqc.zip              JP4D_R2_fastqc.zip                                         

The .html files and the uncompressed .zip files are still present, but now we also have a new directory for each of our samples. We can see for sure that it’s a directory if we use the -F flag for ls.

$ ls -F 
JC1A_R1_fastqc/                  JP4D_R1_fastqc/                                                     
JC1A_R1_fastqc.html             JP4D_R1_fastqc.html                                        
JC1A_R1_fastqc.zip              JP4D_R1_fastqc.zip                                                  
JC1A_R2_fastqc/                  JP4D_R2_fastqc/                                               
JC1A_R2_fastqc.html             JP4D_R2_fastqc.html                                             
JC1A_R2_fastqc.zip              JP4D_R2_fastqc.zip                                         

Let’s see what files are present within one of these output directories.

$ ls -F JC1A_R1_fastqc/ 
fastqc_data.txt  fastqc.fo  fastqc_report.html	Icons/	Images/  summary.txt

Use less to preview the summary.txt file for this sample.

$ less JC1A_R1_fastqc/summary.txt 
PASS    Basic Statistics        JC1A_R1.fastq.gz                     
FAIL    Per base sequence quality       JC1A_R1.fastq.gz             
PASS    Per tile sequence quality       JC1A_R1.fastq.gz             
PASS    Per sequence quality scores     JC1A_R1.fastq.gz             
WARN    Per base sequence content       JC1A_R1.fastq.gz             
FAIL    Per sequence GC content JC1A_R1.fastq.gz                     
PASS    Per base N content      JC1A_R1.fastq.gz                     
PASS    Sequence Length Distribution    JC1A_R1.fastq.gz             
FAIL    Sequence Duplication Levels     JC1A_R1.fastq.gz             
PASS    Overrepresented sequences       JC1A_R1.fastq.gz             
FAIL    Adapter Content JC1A_R1.fastq.gz  

The summary file gives us a list of tests that FastQC ran, and tells us whether this sample passed, failed, or is borderline (WARN). Remember, to quit from less you must type q.

Documenting our work

We can make a record of the results we obtained for all our samples by concatenating all of our summary.txt files into a single file using the cat command. We’ll call this fastqc_summaries.txt and store it to ~/dc_workshop/docs.

$ mkdir -p ~/dc_workshop/docs
$ cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt

Exercise 4: Quality tests

Which samples failed at least one of FastQC’s quality tests? What test(s) did those samples failed

Solution

We can get the list of all failed tests using grep.

$ cd ~/dc_workshop/docs
$ grep FAIL fastqc_summaries.txt
FAIL    Per base sequence quality       JC1A_R1.fastq.gz             
FAIL    Per sequence GC content JC1A_R1.fastq.gz                     
FAIL    Sequence Duplication Levels     JC1A_R1.fastq.gz             
FAIL    Adapter Content JC1A_R1.fastq.gz                             
FAIL    Per base sequence quality       JC1A_R2.fastq.gz             
FAIL    Per sequence GC content JC1A_R2.fastq.gz                     
FAIL    Sequence Duplication Levels     JC1A_R2.fastq.gz             
FAIL    Adapter Content JC1A_R2.fastq.gz                             
FAIL    Per base sequence content       JP4D_R1.fastq     
FAIL    Adapter Content JP4D_R1.fastq                     
FAIL    Per base sequence quality       JP4D_R2.fastq.gz  
FAIL    Per base sequence content       JP4D_R2.fastq.gz  
FAIL    Adapter Content JP4D_R2.fastq.gz

Quality Encodings Vary

Although we’ve used a particular quality encoding system to demonstrate interpretation of read quality, different sequencing machines use different encoding systems. This means that, depending on which sequencer you use to generate your data, a # may not be an indicator of a poor quality base call.

This mainly relates to older Solexa/Illumina data, but it’s essential that you know which sequencing platform was used to generate your data, so that you can tell your quality control program which encoding to use. If you choose the wrong encoding, you run the risk of throwing away good reads or (even worse) not throwing away bad reads!

Bonus Exercise: Automating a quality control workflow

If you loose your FastQC analyses results. How would you do it again but faster than the first time? As we have seen in a previous lesson, making scripts for repetitive tasks is a very efficient practice during bioinformatic pipelines.

Solution

Make a new script with nano

nano quality_control.sh

Paste inside the commands that we used along with echo commands that shows you how the script is running.

set -e # This will ensure that our script will exit if an error occurs
cd ~/dc_workshop/data/untrimmed_fastq/

echo "Running FastQC ..."
fastqc *.fastq*

mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads

echo "Saving FastQC results..."
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/

cd ~/dc_workshop/results/fastqc_untrimmed_reads/

echo "Unzipping..."
for filename in *.zip
    do
    unzip $filename
    done

echo "Saving summary..."
mkdir -p ~/dc_workshop/docs
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt

If we were to run this script it would ask us for confirmation to redo several steps because we already did all of them. If you want to, you can run it to check that it works, but it is not necessary if you already completed every step of the previous episode.

Key Points

  • Quality encodings vary across sequencing platforms.

  • for loops let you perform the same set of operations on multiple files with a single command.

  • It is important to know the quality of our data to be able to make decisions in the subsequent steps.


Trimming and Filtering

Overview

Teaching: 30 min
Exercises: 25 min
Questions
  • How can we get rid of sequence data that doesn’t meet our quality standards?

Objectives
  • Clean FASTQ reads using Trimmomatic.

  • Select and set multiple options for command line bioinformatic tools.

  • Write for loops with two variables.

Cleaning reads

In the last episode, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of the quality at each base across all the reads from our sample. This information help us to determinate the quality threshold we will accept and, thus we saw information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. HOwever, this does not mean, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application. For our workflow, we will be removing some of the low quality sequences to reduce our false-positive rate due to sequencing error.

To accomplish this, we will use a program called Trimmomatic. This useful tool filters poor quality reads and trims poor quality bases from the specified samples.

Trimmomatic options

Trimmomatic has a variety of options to accomplish its task. If we run the following command, we can see some of its options:

$ trimmomatic

Which will give you the following output:

Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version

This output shows us that we must first specify whether we have paired end (PE) or single end (SE) reads. Next, we will specify which flags we would like to run Trimmomatic with. For example, you can specify threads to indicate the number of processors on your computer that you want Trimmomatic to use. In most cases using multiple threads(processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important. In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below. While, in single end mode, Trimmomatic will expect one file as input, after which you can enter the optional settings and lastly the name of the output file.

Option Meaning
<inputFile1> Input forward reads to be trimmed. Typically the file name will contain an _1 or _R1 in the name.
<inputFile2> Input reverse reads to be trimmed. Typically the file name will contain an _2 or _R2 in the name.
<outputFile1P> Output file that contains surviving pairs from the _1 file.
<outputFile1U> Output file that contains orphaned reads from the _1 file.
<outputFile2P> Output file that contains surviving pairs from the _2 file.
<outputFile2U> Output file that contains orphaned reads from the _2 file.

The last thing Trimmomatic expects to see is the trimming parameters:

step meaning
ILLUMINACLIP Perform adapter removal.
SLIDINGWINDOW Perform sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING Cut bases off the start of a read, if below a threshold quality.
TRAILING Cut bases off the end of a read, if below a threshold quality.
CROP Cut the read to a specified length.
HEADCROP Cut the specified number of bases from the start of the read.
MINLEN Drop an entire read if it is below a specified length.
TOPHRED33 Convert quality scores to Phred-33.
TOPHRED64 Convert quality scores to Phred-64.

We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.

Diagram showing the parts of the sequence that are reviewed by each parameter, and the parts that are mantained or discarted at the end of the reviewing process

However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work, as we do not have the files it refers to:

$ trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
              SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
              SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
              ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20

In this example, we’ve told Trimmomatic:

code meaning
PE that it will be taking a paired end file as input
-threads 4 to use four computing threads to run (this will spead up our run)
SRR_1056_1.fastq the first input file name. Forward
SRR_1056_2.fastq the second input file name. Reverse
SRR_1056_1.trimmed.fastq the output file for surviving pairs from the _1 file
SRR_1056_1un.trimmed.fastq the output file for orphaned reads from the _1 file
SRR_1056_2.trimmed.fastq the output file for surviving pairs from the _2 file
SRR_1056_2un.trimmed.fastq the output file for orphaned reads from the _2 file
ILLUMINACLIP:SRR_adapters.fa to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa
SLIDINGWINDOW:4:20 to use a sliding window of size 4 that will remove bases if their phred score is below 20

Multi-line commands

Some of the commands we ran in this lesson are long! When typing a long command into your terminal, you can use the \ character to separate code chunks onto separate lines. This can make your code more readable.

Running Trimmomatic

Now, we will run Trimmomatic on our data. Navigate to your untrimmed_fastq data directory and verify that you are located in the untrimmed_fastq directory:

$ cd ~/dc_workshop/data/untrimmed_fastq
$ pwd
$ /home/dcuser/dc_workshop/data/untrimmed_fastq

You should have onlye four files in this directory. Those files corresponds to the files of forward and reverse reads from samples JC1A and JP4D.

$ ls
$ JC1A_R1.fastq.gz  JC1A_R2.fastq.gz  JP4D_R1.fastq  JP4D_R2.fastq.gz   

We are going to run Trimmomatic on one of our paired-end samples. While using FastQC, we saw that Universal adapters were present in our samples. The adapter sequences came with the installation of Trimmomatic, so we will first copy these sequences into our current directory.

$ cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/TruSeq3-PE.fa .

We will also use a sliding window of size 4 that will remove bases if their phred score is below 20 (like in our example above). We will also discard any reads that do not have at least 25 bases remaining after this trimming step. This command will take a few minutes to run.

Before, we unzipped one of our files to work with it. Let’s compress the file corresponding to the sample JP4D again before we run Trimmomatic.

gzip JP4D_R1.fastq
$ trimmomatic PE JP4D_R1.fastq.gz JP4D_R2.fastq.gz \
JP4D_R1.trim.fastq.gz  JP4D_R1un.trim.fastq.gz \
JP4D_R2.trim.fastq.gz  JP4D_R2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:35 ILLUMINACLIP:TruSeq3-PE.fa:2:40:15 
TrimmomaticPE: Started with arguments:
 JP4D_R1.fastq.gz JP4D_R2.fastq.gz JP4D_R1.trim.fastq.gz JP4D_R1un.trim.fastq.gz JP4D_R2.trim.fastq.gz JP4D_R2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:35 ILLUMINACLIP:TruSeq3-PE.fa:2:40:15
Multiple cores found: Using 2 threads
Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 1123987 Both Surviving: 751427 (66.85%) Forward Only Surviving: 341434 (30.38%) Reverse Only Surviving: 11303 (1.01%) Dropped: 19823 (1.76%)
TrimmomaticPE: Completed successfully

Exercise 1: What did Trimmomatic do?

Use the output from your Trimmomatic command to answer the following questions.

1) What percent of reads did we discard from our sample?
2) What percent of reads did we keep both pairs?

Solution

1) 1.76%
2) 66.85%

You may have noticed that Trimmomatic automatically detected the quality encoding of our sample. It is always a good idea to double-check this or to enter the quality encoding manually.

We can confirm that we have our output files:

$ ls JP4D*
JP4D_R1.fastq.gz       JP4D_R1un.trim.fastq.gz	JP4D_R2.trim.fastq.gz
JP4D_R1.trim.fastq.gz  JP4D_R2.fastq.gz		JP4D_R2un.trim.fastq.gz

The output files are also FASTQ files. It should be smaller than our input file, because we’ve removed reads. We can confirm this with this command:

$ ls JP4D* -l -h
-rw-r--r-- 1 dcuser dcuser 179M Nov 26 12:44 JP4D_R1.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 107M Mar 11 23:05 JP4D_R1.trim.fastq.gz
-rw-rw-r-- 1 dcuser dcuser  43M Mar 11 23:05 JP4D_R1un.trim.fastq.gz
-rw-r--r-- 1 dcuser dcuser 203M Nov 26 12:51 JP4D_R2.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 109M Mar 11 23:05 JP4D_R2.trim.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 1.3M Mar 11 23:05 JP4D_R2un.trim.fastq.gz

We’ve just successfully run Trimmomatic on one of our FASTQ files! However, there is some bad news. Trimmomatic can only operate on one sample at a time and we have more than one sample. The good news is that we can use a for loop to iterate through our sample files quickly!

$ for infile in *_R1.fastq.gz
do
base=$(basename ${infile} _R1.fastq.gz)
trimmomatic PE ${infile} ${base}_R2.fastq.gz \
${base}_R1.trim.fastq.gz ${base}_R1un.trim.fastq.gz \
${base}_R2.trim.fastq.gz ${base}_R2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:35 ILLUMINACLIP:TruSeq3-PE.fa:2:40:15
done

Go ahead and run the for loop. It should take a few minutes for Trimmomatic to run for each of our four input files. Once it’s done, take a look at your directory contents. You’ll notice that even though we ran Trimmomatic on file JP4D before running the for loop, there is only one set of files for it. Because we matched the ending _R1.fastq.gz, we re-ran Trimmomatic on this file, overwriting our first results. That’s ok, but it’s good to be aware that it happened.

$ ls
JC1A_R1.fastq.gz                     JP4D_R1.fastq.gz                                    
JC1A_R1.trim.fastq.gz                JP4D_R1.trim.fastq.gz                                
JC1A_R1un.trim.fastq.gz              JP4D_R1un.trim.fastq.gz                                
JC1A_R2.fastq.gz                     JP4D_R2.fastq.gz                                 
JC1A_R2.trim.fastq.gz                JP4D_R2.trim.fastq.gz                                 
JC1A_R2un.trim.fastq.gz              JP4D_R2un.trim.fastq.gz                                 
TruSeq3-PE.fa   

Exercise 2: Adapter files

We trimmed our FASTQ files with Nextera adapters, but there are other adapters that are commonly used. How can you check the full list of adapters that came with Trimmomatic installation?

  1. ls ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
  2. cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
  3. head ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/

Solution

  1. Yes, the ls command will display the content of the adapters folder (came with Trimmomatic installation).
  2. No, cp command is incomplete and it is used to copy adapters to your working directory but will not show you the adapters list.
  3. No, head command is usually used to read files content.
$ ls ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
NexteraPE-PE.fa  TruSeq2-SE.fa    TruSeq3-PE.fa
TruSeq2-PE.fa    TruSeq3-PE-2.fa  TruSeq3-SE.fa

We’ve now completed the trimming and filtering steps of our quality control process! Before we move on, let’s move our trimmed FASTQ files to a new subdirectory within our data/ directory.

$ cd ~/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ mv *.trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls
JC1A_R1.trim.fastq.gz    JP4D_R1.trim.fastq.gz                
JC1A_R1un.trim.fastq.gz  JP4D_R1un.trim.fastq.gz              
JC1A_R2.trim.fastq.gz    JP4D_R2.trim.fastq.gz                
JC1A_R2un.trim.fastq.gz  JP4D_R2un.trim.fastq.gz   

Bonus Exercise (Advanced): Quality test after trimming

Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC.

Sort the following chunks of code and decide in which terminal (AWS or local) you should run them to be able to re-run FastQC on your trimmed FASTQ files and visualize the HTML files to see whether your per base sequence quality is higher after trimming.

$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/data/trimmed_fastq/*.html ~/Desktop/fastqc_html/trimmed
$ fastqc ~/dc_workshop/data/trimmed_fastq/*.fastq*
$ mkdir ~/Desktop/fastqc_html/trimmed

Solution

In your AWS terminal window do:

$ fastqc ~/dc_workshop/data/trimmed_fastq/*.fastq*

In a terminal standing on your local computer do:

$ mkdir ~/Desktop/fastqc_html/trimmed
$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/data/trimmed_fastq/*.html ~/Desktop/fastqc_html/trimmed

Then take a look at the html files in your browser.

Remember to replace everything between the @ and : in your scp command with your AWS instance number.

After trimming and filtering, our overall quality is much higher, we have a distribution of sequence lengths, and more samples pass adapter content. However, quality trimming is not perfect, and some programs are better at removing some sequences than others. Trimmomatic did pretty well though, and its performance is good enough for our workflow.

Key Points

  • The options you set for the command-line tools you use are important!

  • Data cleaning is essential at the beginning of metagenomics workflows.

  • Use Trimmomatic to get clean of reads without adapters or low quality bases.

  • Carefully fill the parameters and options required to call a function in the bash shell.

  • Automate repetitive workflows using for loops


Metagenome Assembly

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • Why genomic data should be assembled?

  • What is the difference between reads and contigs?

  • How can we assemble a metagenome?

Objectives
  • Understand what is an assembly.

  • Run a metagenomics assembly workflow.

  • Use an enviroment in a bioinformatic pipeline.

Assembling reads

The assembly process groups reads into contigs and contigs into scaffolds, in order to obtain (ideally) the sequence of a whole chromosome. There are many programs devoted to genome and metagenome assembly, some of the main strategies they use are: Greedy extension, OLC and De Bruijn charts. Contrary to metabarcoding, shotgun metagenomics needs an assembly step. This does not mean that metabarcoding never uses an assembly step, but sometimes is not needed.

Three diagrams depicting the three assembly algorithms

MetaSPAdes is a NGS de novo assembler for assembling large and complex metagenomics data, and it is one of the most used and recommended. It is part of the SPAdes toolkit, that contains several assembly pipelines.

Some of the problems faced by metagenomics assembly are: i) the differences in coverage between the genomes, due to the differences in abundance in the sample, ii) the fact that different species often share conserved regions, iii) and the presence of several strains of a single species in the community. SPAdes already deals with the non-uniform coverage problem in its algorithm, so it is useful for the assembly of simple communities, but the metaSPAdes algorithm deals with the other problems as well, allowing it to assemble metagenomes from complex communities.

Let’s see what happens if we enter the metaspades.py command on our terminal.

$ metaspades.py
$ metaspades.py: command not found   

The reason is because we are not located in our environmnet where we can call Spades, but before going any further, let’s talk about environmnets.

Activating an environment

Environments are part of a bioinformatic tendency to make reproducible research, they are a way to share and maintain our programs in their needed versions used for a pipeline with our colleagues and with our future self. MetaSPAdes is not activated in the (base) environment but this AWS instances came with an environment called metagenomics. We need to activate it in order to start using MetaSPAdes.

We will use Conda as our environment manager. Conda environments are activated with conda activate direction:

$ conda activate metagenomics  

After the environment has been activated, a label is shown before the $ sign.

(metagenomics) $

Now if we call MetaSPAdes at the command line it wont be any error, instead a long help page will be displayed at our screen.

$ metaspades.py
SPAdes genome assembler v3.15.0 [metaSPAdes mode]

Usage: spades.py [options] -o <output_dir>

Basic options:
  -o <output_dir>             directory to store all the resulting files (required)
  --iontorrent                this flag is required for IonTorrent data
  --test                      runs SPAdes on toy dataset
  -h, --help                  prints this usage message
  -v, --version               prints version

Input data:
  --12 <filename>             file with interlaced forward and reverse paired-end reads
  -1 <filename>               file with forward paired-end reads
  -2 <filename>               file with reverse paired-end reads    

Conda is an environment management system

Enviroments help in science reproducibility, allowing to share the specific conditions in which a pipeline is run. Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux.

MetaSPAdes is a metagenomics assembler

The help that we just saw tells us how to run metaspades.py. We are going to use the most simple options, just specifying our forward paired-end reads with -1 and reverse paired-end reads with -2, and the output directory where we want our results to be stored.

$ cd ~/dc_workshop/data/trimmed_fastq
$ metaspades.py -1 JC1A_R1.trim.fastq.gz -2 JC1A_R2.trim.fastq.gz -o ../../results/assembly_JC1A &

Running commands on the background

The & sign that we are using at the end of the command is for telling the machine to run the command on the background, this will help us to avoid the cancelation of the operation in case the connection with the AWS machine is unstable.

When the run is finished it shows this message:

======= SPAdes pipeline finished.

SPAdes log can be found here: /home/dcuser/dc_workshop/results/assembly_JC1A/spades.log

Thank you for using SPAdes!

Now we need to press enter to exit from the background, and a message like this will be displayed:

[1]+  Done                    metaspades.py -1 JC1A_R1.trim.fastq.gz -2 JC1A_R2.trim.fastq.gz -o ../../results/assembly_JC1A

This is becacause of the use of the &. Now, let’s go to the files:

$ cd ../../results/assembly_JC1A
$ ls -F
assembly_graph_after_simplification.gfa
assembly_graph.fastg
assembly_graph_with_scaffolds.gfa
before_rr.fasta
contigs.paths
corrected
dataset.info
first_pe_contigs.fasta
input_dataset.yaml
contigs.fasta
scaffolds.fasta
K21
K33
K55
misc
params.txt
pipeline_state
run_spades.sh
run_spades.yaml
scaffolds.paths
spades.log
strain_graph.gfa
tmp
            

As we can see, MetaSPAdes gave us a lot of files. The ones with the assembly are the contigs.fasta and the scaffolds.fasta. Also, we found three K folders: K21, K33, and K55, this contains the individual result files for an assembly with k-mers equal to those numbers: 21, 33, and 55. The best assembled results are the ones that are displayed outside this k-folders. The folder corrected hold the corrected reads with the SPAdes algorithm. Moreover, the file assembly_graph_with_scaffolds.gfa have the information needed to visualize our assembly by different means, like programs as Bandage.

The contigs are just made from assembled reads, but the scaffolds are the result from a subsequent process in which the contigs are ordered, oriented, and connected with Ns.

We can recognize which sample our assembly outputs corresponds to because they are inside the assembly results folder: assembly_JC1A/. However, the files within it do not have the sample ID. It is very useful to rename these files, in case we need them out of their folder.

Exercise 1: Rename all files in a folder

Add JC1A (the sample ID) separated by a _ at the beggining of the names of all the contents in the assembly_JC1A directory. Remember that many solutions are possible.

A) $ mv * JC1A_
B) $ mv * JC1A_*
C) $ for name in *; do mv $name JC1A_; done
D) $ for name in *; do mv $name JC1A_$name; done

Solution

A) No, this option is going to give you as error mv: target 'JC1A_' is not a directory This is because mv has two options:
mv file_1 file_2
mv file_1, file_2, ..... file_n directory
When a list of files is passed to mv, the mv expects the last parameters to be a directory.
Here, * gives you a list of all the files in the directory. The last parameter is JC1A_ (which mv expects to be a directory).
B) No, again every file is send to the same file.
C) No, every file is sent to the same file JC1A_
D) Yes, this is one of the possible solutions.

¿Do you have another solution?

Exercise 2: Compare two fasta files from the assembly output

You want to know how many contigs and how many scaffolds results for the assembly. Use contigs.fasta and scaffolds.fasta files and sort the commands to create correct code lines.
Do they have the same number of lines? Why?
Hint: You can use the following commands: grep, | (pipe), -l, ">", wc, filename.fasta

Solution

$ grep “>” contigs.fasta | wc -l
$ grep “>” scaffolds.fasta | wc -l

A contig is created from reads and then a scaffold from group of cotings so we expect less lines in the scaffolds.fasta .

Key Points

  • Assembly groups reads into contigs.

  • De Brujin Graphs use Kmers to assembly cleaned reads

  • MetaSPAdes is a metagenomes assembler.

  • Assemblers take FastQ files as input and produce a Fasta file as output.


Metagenome Binning

Overview

Teaching: 50 min
Exercises: 10 min
Questions
  • How can we obtain the original genomes from a metagenome?

Objectives
  • Obtain Metagenome-Assembled Genomes from the metagenomic assembly.

  • Check the quality of the Metagenome-Assembled genomes.

Metagenomic binning

To analyze each of the species inside our sample individually, the original genomes in the sample can be separated with a process called binning. We call these genomes reconstructed from metagenomic assembly MAGs (Metagenome-Assembled Genomes). In this process, the assembled contigs from the metagenome will be assigned to different bins (FASTA files that contain certain contigs). Ideally, each bin corresponds to only one original genome (a MAG).

Diagram depicting the DNA sequences  in the original sample as circular chromosomes, then the DNA fragmented into reads, then assembled into contigs, and then binned

Although an obvious way to separate contigs that correspond to a different species is by their taxonomic assignation, there are more reliable methods that do the binning using characteristics of the contigs, such as their GC content, the use of tetranucleotides (composition), or their coverage (abundance).

Maxbin is a binning algorithm that distinguishes between contigs that belong to different bins according to their coverage levels and the tetranucleotide frequencies they have.

Let’s bin the sample we just assembled. The command for running MaxBin is run_MaxBin.pl, and the arguments it needs are the FASTA file of the assembly, the FASTQ with the forward and reverse reads, the output directory, and name.

$ cd ~/dc_workshop/results/assembly_JC1A
$ mkdir MAXBIN
$ run_MaxBin.pl -thread 8 -contig JC1A_contigs.fasta -reads ../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz -reads2 ../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz -out MAXBIN/JC1A &
MaxBin 2.2.7
Thread: 12
Input contig: JC1A_contigs.fasta
Located reads file [../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz]
Located reads file [../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz]
out header: MAXBIN/JC1A
Running Bowtie2 on reads file [../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz]...this may take a while...
Reading SAM file to estimate abundance values...
Running Bowtie2 on reads file [../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz]...this may take a while...
Reading SAM file to estimate abundance values...
Searching against 107 marker genes to find starting seed contigs for [JC1A_contigs.fasta]...
Running FragGeneScan....
Running HMMER hmmsearch....
Try harder to dig out marker genes from contigs.
Marker gene search reveals that the dataset cannot be binned (the medium of marker gene number <= 1). Program stop.

It seems that it is impossible to bin our assembly because the amount of marker genes is less than 1. We could have expected this as we know it is a small sample.

We will perform the binning process with the other sample from the same study that is a little larger. We have the assembly precomputed in the ~/dc-workshop/mags/ directory.

$ cd ~/dc_workshop/mags/
$ mkdir MAXBIN
$ run_MaxBin.pl -thread 8 -contig JP4D_contigs.fasta -reads ../data/trimmed_fastq/JP4D_R1.trim.fastq.gz -reads2 ../data/trimmed_fastq/JP4D_R2.trim.fastq.gz -out MAXBIN/JP4D &

It will take a few minutes to run. And it will finish with an output like this:

========== Job finished ==========
Yielded 4 bins for contig (scaffold) file JP4D_contigs.fasta

Here are the output files for this run.
Please refer to the README file for further details.

Summary file: MAXBIN/JP4D.summary
Genome abundance info file: MAXBIN/JP4D.abundance
Marker counts: MAXBIN/JP4D.marker
Marker genes for each bin: MAXBIN/JP4D.marker_of_each_gene.tar.gz
Bin files: MAXBIN/JP4D.001.fasta - MAXBIN/JP4D.004.fasta
Unbinned sequences: MAXBIN/JP4D.noclass

Store abundance information of reads file [../data/trimmed_fastq/JP4D_R1.trim.fastq.gz] in [MAXBIN/JP4D.abund1].
Store abundance information of reads file [../data/trimmed_fastq/JP4D_R2.trim.fastq.gz] in [MAXBIN/JP4D.abund2].


========== Elapsed Time ==========
0 hours 6 minutes and 56 seconds.

With the .summary file we can have a quick look at the bins that MaxBin produced.

$ cat MAXBIN/JP4D.summary
Bin name	Completeness	Genome size	GC content
JP4D.001.fasta	57.9%	3141556	55.5
JP4D.002.fasta	87.9%	6186438	67.3
JP4D.003.fasta	51.4%	3289972	48.1
JP4D.004.fasta	77.6%	5692657	38.9

Discussion: The quality of MAGs

Can we trust the quality of our bins only with the given information? What else do we want to know about our MAGs to confidently use them for further analysis?

Quality check

The quality of a MAG is highly dependent on the size of the genome of the species, its abundance in the community, and the depth at which we sequenced it. Two important things that can be measured to know its quality are completeness (is the MAG a complete genome?) and if it is contaminated (does the MAG contain only one genome?).

CheckM is a good program to see the quality of our MAGs. It gives a measure of the completeness and the contamination by counting marker genes in the MAGs. The lineage workflow that is a part of CheckM places your bins in a reference tree to know to which lineage it corresponds to and to use the appropriate marker genes to estimate the quality parameters. Unfortunately, the lineage workflow uses a lot of memory so it can’t run in our machines, but we can tell CheckM to use marker genes from Bacteria only, to spend less memory. This is a less accurate approach but it can also be very useful if you want all of your bins analyzed with the same markers.

We will run the taxonomy workflow specifying the use of markers at the domain level, specific for the rank Bacteria, we will specify that our bins are in FASTA format, that they are located in the MAXBIN directory and that we want our output in the CHECKM/ directory.

$ mkdir CHECKM
$ checkm taxonomy_wf domain Bacteria -x fasta MAXBIN/ CHECKM/ 

The run will end with our results printed in the console.

--------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id     Marker lineage   # genomes   # markers   # marker sets   0    1    2    3    4   5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------
  JP4D.002      Bacteria         5449        104            58        3    34   40   21   5   1       94.83           76.99              11.19          
  JP4D.004      Bacteria         5449        104            58        12   40   46   6    0   0       87.30           51.64               3.12          
  JP4D.001      Bacteria         5449        104            58        24   65   11   3    1   0       70.48           13.09               0.00          
  JP4D.003      Bacteria         5449        104            58        44   49   11   0    0   0       64.44           10.27               0.00          
--------------------------------------------------------------------------------------------------------------------------------------------------------

To have these values in an output that is more usable and shearable we can now run the quality step of CheckM checkm qa and make it print the output in a TSV table, instead of the console. In this step, we can ask CheckM to give us more parameters, like contig number and length.

Ideally, we would like to get only one contig per bin, with a length similar the genome size of the corresponding taxa. Since this scenario is very difficult to obtain we can use parameters that show us how good is our assembly. Here are some of the most common metrics: If we arrange our contigs by size, from larger to smaller, and divide the whole sequence in half, N50 is the size of the smallest contig in the half that has the larger contigs; and L50 is the number of contigs in this half of the sequence. So we want big N50 and small L50 values for our genomes. Read What’s N50?.

To get the table with these extra parameters we need to specify the file of the markers that CheckM used in the previous step Bacteria.ms, the name of the output file we want quality_JP4D.tsv, that we want a table --tab_table, and the option number 2 -o 2 is to ask for the extra parameters printed on the table.

$  checkm qa CHECKM/Bacteria.ms CHECKM/ --file CHECKM/quality_JP4D.tsv --tab_table -o 2

The table we just made looks like this. This will be very useful when you need to document your work or communicate it.

The question of, how much contamination we can tolerate and how much completeness do we need, certainly depends on the scientific question being tackled, but in the CheckM paper, there are some parameters that we can follow.

Exercise 1: Discuss the quality of the obtained MAGs

Fill in the blanks to complete the code you need to download the quality_JP4D.tsv to your local computer:

____ dcuser____ec2-18-207-132-236.compute-1.amazonaws.com____/home/dcuser/dc_workshop/mags/CHECKM/quality_JP4D.tsv ____

Solution

In a terminal that is standing on your local computer do:

$ scp dcuser@ec2-18-207-132-236.compute-1.amazonaws.com:/home/dcuser/dc_workshop/mags/CHECKM/quality_JP4D.tsv <the destination directory of your choice>

Then open the table in a spreadsheet and discuss with your team which of the parameters in the table do you find useful.

Key Points

  • Metagenome-Assembled Genomes (MAGs) sometimes are obtained from curated contigs grouped into bins.

  • Use MAXBIN to assign the contigs to bins of different taxa.

  • Use ChekM to evaluate the quality of each Metagenomics-Assembled Genome.


Taxonomic Assignment

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can I know to which taxa my sequences belong?

Objectives
  • Understand how taxonomic assignment works.

  • Use Kraken to assign taxonomies to reads and contigs.

  • Visualize taxonomic assignations in graphics.

What is taxonomic assignment?

Taxonomic assignment is the process of assigning an Operational Taxonomic Unit (OTUs, that is, groups of related individuals) to sequences, that can be reads or contigs. To assign an OTU to a sequence it is compared against a database, but this comparison can be done in different ways. The comparison database in this assignment process must be constructed using complete genomes. There are many programs for doing taxonomic mapping, almost all of them follows one of the next strategies:

  1. BLAST: Using BLAST or DIAMOND, these mappers search for the most likely hit for each sequence within a database of genomes (i.e. mapping). This strategy is slow.

  2. K-mers: A genome database is broken into pieces of length k, so as to be able to search for unique pieces by taxonomic group, from lowest common ancestor (LCA), passing through phylum to species. Then, the algorithm breaks the query sequence (reads, contigs) into pieces of length k, look for where these are placed within the tree and make the classification with the most probable position.
    Diagram of taxonomic tree Figure 1. Lowest common ancestor assignment example.

  3. Markers: They look for markers of a database made a priori in the sequences to be classified and assign the taxonomy depending on the hits obtained.

A key result when you do taxonomic assignment of metagenomes is the abundance of each taxa or OTU in your sample. The absolute abundance of a taxon is the number of sequences (reads or contigs, depending on what you did) assigned to it. And its relative abundance is the proportion of sequences assigned to it. It is important to be aware of the many biases that that can skew the abundances along the metagenomics workflow, shown in the figure, and that because of them we may not be obtaining the real abundance of the organisms in the sample.

Flow diagram that shows how the initial composition of 33% for each of the three taxa in the sample ends up being 4%, 72% and 24% after the biases imposed by the extraction, PCR, sequencing and bioinformatics steps. Figure 2. Abundance biases during a metagenomics protocol.

Using Kraken 2

Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. kraken2 is already installed in the metagenome environment, lets have a look at kraken2 help.

$ kraken2  
Need to specify input filenames!
Usage: kraken2 [options] <filename(s)>

Options:
  --db NAME               Name for Kraken 2 DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --quick                 Quick operation (use first hit or hits)
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --confidence FLOAT      Confidence score threshold (default: 0.0); must be
                          in [0, 1].
  --minimum-base-quality NUM
                          Minimum base quality used in classification (def: 0,
                          only effective with FASTQ input).
  --report FILENAME       Print a report with aggregrate counts/clade to file
  --use-mpa-style         With --report, format report output like Kraken 1's
                          kraken-mpa-report
  --report-zero-counts    With --report, report counts for ALL taxa, even if
                          counts are zero
  --report-minimizer-data With --report, report minimizer and distinct minimizer
                          count information in addition to normal Kraken report
  --memory-mapping        Avoids loading database into RAM
  --paired                The filenames provided have paired-end reads
  --use-names             Print scientific names instead of just taxids
  --gzip-compressed       Input files are compressed with gzip
  --bzip2-compressed      Input files are compressed with bzip2
  --minimum-hit-groups NUM
                          Minimum number of hit groups (overlapping k-mers
                          sharing the same minimizer) needed to make a call
                          (default: 2)
  --help                  Print this message
  --version               Print version information

If none of the *-compressed flags are specified, and the filename provided
is a regular file, automatic format detection is attempted.

In addition to our input files we also need a database with which to compare them. There are several databases compatible to be used with kraken2 in the taxonomical assignment process.

Very important to know your database!

The database you use will determine the result you get for your data. Imagine you are searching for a lineage that was recently discovered and it is not part of the available databases. Would you find it?

Minikraken is a popular database that attempts to conserve its sensitivity despite its small size (Needs 8GB of RAM for the assignment). Unfortunately although it is much smaller that most databases, it is not small enough to be run by the machines we are using, so we won’t be able to run kraken2. We can check our available RAM with free -hto be sure of this.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           3.9G        272M        3.3G         48M        251M        3.3G
Swap:            0B          0B          0B

If we were to download the database we would use the following command:

$ curl -O ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904.tgz         
$ tar -xvzf minikraken2_v2_8GB_201904.tgz 

Exercise 1: Remembering commands

Fill in the blanks to decompress the following file minikraken2_v2_8GB_201904.tgz.

$ ____ -xvzf  ____.tgz

Solution

$ tar -xvzf  minikraken2_v2_8GB_201904.tgz

tar command is used in linux to decompress files, so in this case it would extract the content of the compressed file minikraken2_v2_8GB_201904.tgz

Taxonomic assignment of metagenomic reads

As we have learned, taxonomic assignment can be attempted before the assembly process. In this case we would use FASTQ files as inputs, which would be JP4D_R1.trim.fastq.gz and JP4D_R2.trim.fastq.gz. And the outputs would be two files: the report JP4D.report and the kraken file JP4D.kraken.

To run kraken2 we would use a command like this:

$ mkdir TAXONOMY_READS
$ kraken2 --db kraken-db --threads 8 --paired --fastq-input JP4D_R1.trim.fastq.gz JP4D_R2.trim.fastq.gz --o
utput TAXONOMY_READS/JP4D.kraken --report TAXONOMY_READS/JP4D.report

Since we can’t run kraken2 here, we precomputed its results in a server, i.e. a more powerful machine. In the server we ran kraken2 and obtainedJP4D-kraken.kraken and JP4D.report.

Let’s look at the precomputed outputs of kraken2 for our JP4D reads.

head ~/dc_workshop/taxonomy/JP4D.kraken  
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:19691:2037	0	250|251	0:216 |:| 0:217
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:14127:2052	0	250|238	0:216 |:| 0:204
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:14766:2063	0	251|251	0:217 |:| 0:217
C	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:15697:2078	2219696	250|120	0:28 350054:5 1224:2 0:1 2:5 0:77 2219696:5 0:93 |:| 379:4 0:82
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:15529:2080	0	250|149	0:216 |:| 0:115
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:14172:2086	0	251|250	0:217 |:| 0:216
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:17552:2088	0	251|249	0:217 |:| 0:215
U	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:14217:2104	0	251|227	0:217 |:| 0:193
C	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:15110:2108	2109625	136|169	0:51 31989:5 2109625:7 0:39 |:| 0:5 74033:2 31989:5 1077935:1 31989:7 0:7 60890:2 0:105 2109625:1
C	MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:19558:2111	119045	251|133	0:18 1224:9 2:5 119045:4 0:181 |:| 0:99

This information may be confusing, let’s take out our cheatsheet to understand some of its components:

Column Description
C Classified or unclassified
MISEQ-LAB244-W7:156:000000000-A80CV:1:1101:15697:2078 FASTA header of the sequence
2219696 Tax ID
250:120 Read length
0:28 350054:5 1224:2 0:1 2:5 0:77 2219696:5 0:93 379:4 0:82 kmers hit to a taxonomic ID e.g. tax ID 350054 has 5 hits, tax ID 1224 has 2 hits, etc.

As we can see, the kraken file is not very readable. So let’s look at the report file:

Reading a Kraken report

  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-‘.
  5. NCBI taxonomy ID
  6. Indented scientific name
head ~/dc_workshop/taxonomy/JP4D.report
 78.13	587119	587119	U	0	unclassified
 21.87	164308	1166	R	1	root
 21.64	162584	0	R1	131567	  cellular organisms
 21.64	162584	3225	D	2	    Bacteria
 18.21	136871	3411	P	1224	      Proteobacteria
 14.21	106746	3663	C	28211	        Alphaproteobacteria
  7.71	57950	21	O	204455	          Rhodobacterales
  7.66	57527	6551	F	31989	            Rhodobacteraceae
  1.23	9235	420	G	1060	              Rhodobacter
  0.76	5733	4446	S	1063	                Rhodobacter sphaeroides

Taxonomic assignment of the contigs of a MAG

We now have the taxonomic identity of the reads of the whole metagenome, but we do not know to which taxon our MAGs correspond to. For this we have to make the taxonomic assignment with their contigs instead of its reads, because we do not have the reads that correspond to a MAG separated from the reads of the entire sample.

For this the kraken2 is a little bit different; here we can look at the command for the JP4D.001.fasta MAG:

$ mkdir TAXONOMY_MAG
$ kraken2 --db kraken-db --threads 12 -input JP4D.001.fasta --output TAXONOMY_MAG/JP4D.001.kraken --report TAXONOMY_MAG/JP4D.001.report

The results of this are pre-computed in the ~/dc_workshop/taxonomy/mags_taxonomy/ directory

$ cd ~/dc_workshop/taxonomy/mags_taxonomy
$ ls
JP4D.001.kraken
JP4D.001.report
more ~/dc_workshop/taxonomy/mags_taxonomy/JP4D.001.report
 50.96	955	955	U	0	unclassified
 49.04	919	1	R	1	root
 48.83	915	0	R1	131567	  cellular organisms
 48.83	915	16	D	2	    Bacteria
 44.40	832	52	P	1224	      Proteobacteria
 19.37	363	16	C	28216	        Betaproteobacteria
 16.22	304	17	O	80840	          Burkholderiales
  5.66	106	12	F	506	            Alcaligenaceae
  2.72	51	3	G	517	              Bordetella
  1.12	21	21	S	2163011	                Bordetella sp. HZ20
  .
  .
  .

By looking at the report, we can see that half of the contigs are unclassified, and that a very little proportion of contigs have been assigned an OTU. This is weird because we expected to have only one genome in the bin.

Just to exemplify how a report of a complete and not contaminated MAG should look like, let’s look at the report of this MAG from another study:

100.00	108	0	R	1	root
100.00	108	0	R1	131567	  cellular organisms
100.00	108	0	D	2	    Bacteria
100.00	108	0	P	1224	      Proteobacteria
100.00	108	0	C	28211	        Alphaproteobacteria
100.00	108	0	O	356	          Rhizobiales
100.00	108	0	F	41294	            Bradyrhizobiaceae
100.00	108	0	G	374	              Bradyrhizobium
100.00	108	108	S	2057741	                Bradyrhizobium sp. SK17

Discussion:

Why do you think we found so many OTUs in this bin?

Visualization of taxonomic assignment results

After we have the taxonomy assignation what follows is some visualization of our results. Krona is a hierarchical data visualization software. Krona allows data to be explored with zooming, multi-layered pie charts and includes support for several bioinformatics tools and raw data formats. To use Krona in our results, let’s go first into our taxonomy directory, which contains the pre-calculated Kraken outputs.

Krona

With Krona we will explore the taxonomy of the JP4D.001 MAG.

$ cd ~/dc_workshop/taxonomy/mags_taxonomy

Krona is called with the ktImportTaxonomy command that needs an input and an output file.
In our case we will create the input file with the columns three and four from JP4D.001.kraken file.

$ cut -f2,3 JP4D.001.kraken > JP4D.001.krona.input

Now we call Krona in our JP4D.001.krona.input file and save results in JP4D.001.krona.out.html.

$ ktImportTaxonomy JP4D.001.krona.input -o JP4D.001.krona.out.html
Loading taxonomy...
Importing JP4D.001.krona.input...
   [ WARNING ]  The following taxonomy IDs were not found in the local database and were set to root
                (if they were recently added to NCBI, use updateTaxonomy.sh to update the local
                database): 1804984 2109625 2259134

And finally, open another terminal in your local computer,download the Krona output and open it on a browser.

$ scp dcuser@ec2-3-235-238-92.compute-1.amazonaws.com:~/dc_workshop/taxonomy/JP4D.001.krona.out.html . 

You will see a page like this:

Krona displays a circled-shape bacterial taxonomy plot with abundance percentages of each taxa

Exercise 2: Exploring Krona visualization

Try double clicking on the segment of the pie chart that represents Bacteria and see what happens. What percentage of bacteria is represented by the genus Paracoccus?

Hint: There is a search box in the top left corner of the window.

Solution

2% of Bacteria corresponds to the genus Paracoccus in this sample. In the top right of the window we see little pie charts that change whenever we change the visualization to expand certain taxa.

Pavian

Pavian is another visualization tool that allows comparison between multiple samples. Pavian should be locally installed and needs R and Shiny, but we can try the Pavian demo WebSite to visualize our results.

First we need to download the files needed as inputs in Pavian, t his time we will visualize the assignation of the reads of both samples: JC1A.report and JP4D.report.
This files corresponds to our Kraken reports. Again in our local machine lets use scp command.

$ scp dcuser@ec2-3-235-238-92.compute-1.amazonaws.com:~/dc_workshop/taxonomy/*report . 

We go to the Pavian demo WebSite, click on Browse and choose our reports. You need to select both reports at the same time.

Pavian website showing the opload of two reports

We click on the Results Overview tab.

Results Overview tab of the Pavian website where it shows the number of reads classified to several categories for the two samples

We click on the Sample tab.

Sankey type visualization that shows the abundance of each taxonomic label in a tree-like manner

We can look at the abundance of a specific taxon by clicking on it.

A bar chart of the abundance of reads of the two samples, showing a segment for the read identified at the specific taxon and another segment for the number of reads identifies at children of the specified taxon

We can look at a comparison of both our samples in the Comparison tab.

A table of the same format as the Kraken report but for both samples at once.

Discussion: Taxonomic level of assignment

What do you think is harder to assign, a species (like E. coli) or a phylum (like Proteobacteria)?

Key Points

  • A database with previous gathered knowledge (genomes) is needed for taxonomic assignment.

  • Taxonomic assignment can be done using Kraken.

  • Krona and Pavian are web based tools to visualize the assigned taxa.


Exploring Taxonomy with R

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • How can I use my taxonomic assignment results to make analyses?

Objectives
  • Comprehend which libraries are required for analysis of the taxonomy of metagenomes.

  • Create and manage a Phyloseq object.

Creating lineage and rank tables

In this episode we will use RStudio to analyze our microbial samples, you don’t have to install anything, you already have an instance on the cloud ready to be used.

Packages like Quiime2, MEGAN, Vegan, or Phyloseq in R allow us to analyze diversity and abundance by manipulating taxonomic-assignment data. In this lesson, we will use Phyloseq. In order to do so, we need to generate an abundance matrix from the Kraken output files. One program widely used for this purpose is kraken-biom.

To do this we could go to our now familiar Bash terminal, but RStudio has an integrated terminal that uses the same language as the one we learned in the Command-line lessons, so let’s take advantage of it. Let’s open RStudio and go to the Terminal tab in the bottom left panel.

Kraken-biom

Kraken-biom is a program that creates BIOM tables from the Kraken output.

In order to run Kraken-biom, we have to move to the folder where our taxonomic-data files are located:

$ cd /home/dcuser/dc_workshop/taxonomy

First, we will visualize the content of our directory by the ls command.

$ ls
JC1A.kraken  JC1A.report  JP41.report  JP4D.kraken  JP4D.report  mags_taxonomy

The kraken-biom program is installed inside our metagenomics environment, so let’s activate it.

$ conda activate metagenomics 

Let’s take a look at the different flags that kraken-biom has:

$ kraken-biom -h                  
usage: kraken-biom [-h] [--max {D,P,C,O,F,G,S}] [--min {D,P,C,O,F,G,S}]
                   [-o OUTPUT_FP] [--otu_fp OTU_FP] [--fmt {hdf5,json,tsv}]
                   [--gzip] [--version] [-v]
                   kraken_reports [kraken_reports ...]

Create BIOM-format tables (http://biom-format.org) from Kraken output
(http://ccb.jhu.edu/software/kraken/).
.
.
.

By a close look at the first output lines, it is noticeable that we need a specific output from Kraken: the .reports.

With the next command, we are going to create a table in Biom format called cuatroc.biom. We will include the two samples we have been working with (JC1A and JP4D) and a third one (JP41), to be able to perform certain analyses later on.

$ kraken-biom JC1A.report JP4D.report JP41.report --fmt json -o cuatroc.biom

If we inspect our folder, we will see that the cuatroc.biom file has been created, this is a biom object which contains both, the abundance and the ID (a number) of each OTU.
With this result, we are ready to return to RStudio’s console and begin to manipulate our taxonomic-data.

Command line prompts

Note that you can distinguish the Bash terminal from the R console by looking at the prompt. In Bash is the $ sign and in R is the > sign.

Creating and manipulating Phyloseq objects

Load required packages

Phyloseq is a library with tools to analyze and plot the taxonomic assignment and abundance information of your metagenomics samples. Let’s install phyloseq (This instruction might not work on certain versions of R) and other libraries required for its execution:

> if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

> BiocManager::install("phyloseq") # Install phyloseq

> install.packages(c("RColorBrewer", "patchwork")) #install patchwork to chart publication-quality plots and readr to read rectangular datasets.

Once the libraries are installed, we must make them available for this R session. Now load the libraries (a process needed every time we begin a new work session in R):

> library("phyloseq")
> library("ggplot2")
> library("RColorBrewer")
> library("patchwork")

Creating the phyloseq object

First, we tell R in which directory we are working.

> setwd("~/dc_workshop/taxonomy/")

Let’s proceed to create the phyloseq object with the import_biom command:

> merged_metagenomes <- import_biom("cuatroc.biom")

Now, we can inspect the result by asking the class of the object created, and doing a close inspection of some of its content:

> class(merged_metagenomes)
[1] "phyloseq"
attr("package")
[1] "phyloseq"

The “class” command indicates that we already have our phyloseq object.

Exploring the taxonomic labels

Let’s try to access the data that is stored inside our merged_metagenomes object. Since a phyloseq object is a special object in R, we need to use the operator @ to explore the subsections of data inside merged_metagenomes. If we type merged_metagenomes@ five options are displayed; tax_table and otu_table are the ones that we will use. After writing merged_metagenomes@otu_table or merged_metagenomes@tax_table, an option of .Data will be the one chosen in both cases. Let’s see what is inside of our tax_table:

> View(merged_metagenomes@tax_table@.Data)

A table where the taxonomic    identification information of all OTUs is displayed. Each row represents one    OTU and the columns represent its identification at different levels in the taxonomic classification ranks, begging with Kingdom until we reach Species    in the seventh column. Figure 1. Table of the taxonomic labels from our merged_metagenomes object.

Here we can see that the tax_table inside our phyloseq object stores all the taxonomic labels that correspond to each OTU. OTUs are identified by a number in the row names of the table.

Next, let’s get rid of some of the unnecessary characters in the OTUs id and put names to the taxonomic ranks:

To remove unnecessary characters in .Data (matrix), we are going to use the command substring(). This command is useful to extract or replace characters in a vector. To use the command, we have to indicate the vector (x) followed by the first element to replace or extract (first) and the last element to be replaced (last). For instance: substring (x, first, last). substring() is a “flexible” command, especially to select characters of different lengths as in our case. Therefore, it is not necessary to indicate “last”, so it will take the last position of the character by default. Considering that a matrix is an arrangement of vectors, we can use this command. Each character in .Data is preceded by 3 spaces occupied by a letter and two underscores, for example: o__Rhodobacterales. In this case, “Rodobacterales” starts at position 4 with an R. So to remove the unnecessary characters we will use the following code:

> merged_metagenomes@tax_table@.Data <- substring(merged_metagenomes@tax_table@.Data, 4)
> colnames(merged_metagenomes@tax_table@.Data)<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

The same table we saw in Figure    3 but with informative names in each of the columns. Now, we can see which of    the columns are associated with which taxonomic classification rank Figure 2. Table of the taxonomic labels from our merged_metagenomes object with corrections.

To explore how many phyla we have, we are going to use a command named unique(). Let’s see the result we obtain from the next code:

> unique(merged_metagenomes@tax_table@.Data[,"Phylum"])
 [1] "Proteobacteria"              "Actinobacteria"              "Firmicutes"                 
 [4] "Cyanobacteria"               "Deinococcus-Thermus"         "Chloroflexi"                
 [7] "Armatimonadetes"             "Bacteroidetes"               "Chlorobi"                   
[10] "Gemmatimonadetes"            "Planctomycetes"              "Verrucomicrobia"            
[13] "Lentisphaerae"               "Kiritimatiellaeota"          "Chlamydiae"                 
[16] "Acidobacteria"               "Spirochaetes"                "Synergistetes"              
[19] "Nitrospirae"                 "Tenericutes"                 "Coprothermobacterota"       
[22] "Ignavibacteriae"             "Candidatus Cloacimonetes"    "Fibrobacteres"              
[25] "Fusobacteria"                "Thermotogae"                 "Aquificae"                  
[28] "Thermodesulfobacteria"       "Deferribacteres"             "Chrysiogenetes"             
[31] "Calditrichaeota"             "Elusimicrobia"               "Caldiserica"                
[34] "Candidatus Saccharibacteria" "Dictyoglomi" 

This is useful, but what we need to do if we need to know how many of our OTUs have been assigned to the phylum Firmicutes?. Let´s use the command sum() to ask R:

> sum(merged_metagenomes@tax_table@.Data[,"Phylum"] == "Firmicutes")
[1] 580

Now, to know for that phylum in particular which taxa there are in a certain rank we can ask it to phyloseq as well.

> unique(merged_metagenomes@tax_table@.Data[merged_metagenomes@tax_table@.Data[,"Phylum"] == "Firmicutes", "Class"])
[1] "Bacilli"          "Clostridia"       "Negativicutes"    "Limnochordia"     "Erysipelotrichia" "Tissierellia" 

Exercise 1: Explore a phylum

Go into groups and choose one phylum that is interesting for your group, and use the learned code to find out how many OTUs have been assigned to your chosen phylum and what are the unique names of the genera inside it. がんばれ! (ganbate; good luck):

Solution

Change the name of a new phylum wherever it is needed and the name of the rank that we are asking for, to get the result. As an example, here is the solution for Proteobacteria:

sum(merged_metagenomes@tax_table@.Data[,"Phylum"] == "Proteobacteria")
unique(merged_metagenomes@tax_table@.Data[merged_metagenomes@tax_table@.Data[,"Phylum"] == "Proteobacteria", "Genus"])

Exploring the abundance table

Until now we have looked at the part of the phyloseq object that stores the information about the taxonomy (at all the possible levels) of each OTU found in our samples. But there is also a part of the phyloseq object that stores the information about how many sequenced reads corresponding to a certain OTU are there in each sample. This table is the otu_table.

> View(merged_metagenomes@otu_table@.Data)

A table where the abundance of each OTU is each sample is shown. Each row represents one    OTU and the columns represent the samples, in the intersection there is a number that indicates how many sequenced reads of that OTU are present in that sample. Figure 3. Table of the abundance of reads in the merged_metagenomes object.

We will take advantage of this information later on in our analyses.

Exercise 2: Searching for the read counts

Using the information from both the tax_table and the otu_table, find how many reads there are for any species of your interest (one that can be found in the tax_table).
Hint: Remember that you can access the contents of a data frame with the ["row_name","column_name"] syntax.
がんばれ! (ganbate; good luck):

Solution

Go to the tax_table:

> View(merged_metagenomes@tax_table@.Data)

Take note of the OTU number for some species: The OTU number is in the leftmost space of the table as a row name for the searched species. Figure 4. The row of the tax_table corresponding to the species Paracoccus zhejiangensis.

Search for the row of the otu_table that has the row name that you chose.

> merged_metagenomes@otu_table@.Data["1077935",]
JC1A JP4D JP41 
  42  782  257 

Phyloseq objects

Finally, we can review our object and see that all datasets (i.e. JC1A, JP4D, and JP41) are in the object. If you look at our Phyloseq object, you will see that there are more data types that we can use to build our object(?phyloseq()), such as a phylogenetic tree and metadata concerning our samples. These are optional, so we will use our basic phyloseq object, for now, composed of the abundances of specific OTUs and the names of those OTUs.

Key Points

  • kraken-biom formats kraken output-files of several samples into the single .biom file that will be phyloseq input.

  • The library phyloseq manages metagenomics objects and computes analyses.

  • A phyloseq object stores a table with the taxonomic information of each OTU and a table with the abundance of each OTU.


Diversity Tackled With R

Overview

Teaching: 40 min
Exercises: 10 min
Questions
  • How can we measure diversity?

  • How can I use R to analyze diversity?

Objectives
  • Plot alpha and beta diversity.

Look at your fingers, controlled by the mind can do great things. But imagine if each one has a little brain of its own, with different ideas, desires, and fears ¡How wonderful things will be made out of an artist with such hands! -Ode to multidisciplinarity

First plunge into diversity

Species diversity, in its most simple definition, is the number of species in a particular area and their relative abundance (evenness). Once we know the taxonomic composition of our metagenomes, we can do diversity analyses. Here we will talk about the two most used diversity metrics, α diversity (within one metagenome) and β (across metagenomes).

Alpha diversity diagram: In lake A, we have three fishes, each one of a different species. On lake B, we have two fishes each one of a different species. And in lake C we have four fishes, each one of different species. Figure 1. Alpha diversity is represented by fishes in a pond. Here, alpha diversity is represented in its simplest way: Richness.

In the next example, we will look at the α and the β components of the diversity of a dataset of fishes in three lakes. The most simple way to calculate the β-diversity is to calculate the species that are distinct between two lakes (sites). Let’s take Lake A and Lake B to do an example. The number of species in Lake A is 3, to this quantity we will substract the number of these species that are shared with the Lake B: 2. So the number of unique species in Lake A compared to Lake B is (3-2) = 1. To this number we will sum the result of the same operations but now take Lake B as our site of reference. In the end, the β diversity between Lake A and Lake B is (3-2) + (3-2) = 2. This process can be repeated taking each pair of lakes as the focused sites.

 Alpha and Beta diversity diagram: Each lake has a different number of species and each species has a different number of fish individuals. Both metrics are taken into account to measure alfa and beta diversity. Figure 2. Alpha and Beta diversity represented by fishes in a pond.

If you want to read more about diversity, we recommend to you this paper on the concept of diversity.

α diversity

Diversity Indices Description
Shannon (H) Estimation of species richness and species evenness. More weight on richness.
Simpson’s (D) Estimation of species richness and species evenness. More weigth on evenness.
Chao1 Abundance based on species represented by a single individual (singletons) and two individuals (doubletons).
Variable Definition
$ H = - \sum_{i=1}^{S} p_{i} \ln{p_{i}} $ Definition
$ S $ Number of OTUs
$ p_{i} $ The proportion of the community represented by OTU i
Variable Definition
$ D = \frac{1}{\sum_{i=1}^{S} p_{i}^{2}} $ Definition
$ S $ Total number of the species in the community
$ p_{i} $ Proportion of community represented by OTU i
Variable Definition
$ S_{chao1} = S_{Obs} + \frac{F_{1} \times (F_{1} - 1)}{2 \times (F_{2} + 1)} $ Count of singletons and doubletons respectively
$ F_{1}, F_{2} $ Count of singletons and doubletons respectively
$ S_{chao1}=S_{Obs} $ The number of observed species

β diversity

Diversity β measures how different two or more communities are, either in their composition (richness) or in the abundance of the organisms that compose it (abundance).

There are different ways to plot and show the results of such analysis. Among others, PCA, PCoA, or NMDS analysis are widely used.

Exercise 1: Simple measure of alpha and beta diveristies.

In the next picture there are two lakes with different fish species: In lake A, we have four different species, two of these species have 3 specimens each one. This lake also has two specimens of a third species and only one specimen of a fourth specie. We got nine fish in total. Lake B has only three different species, the most populated species is also present in lake A and has five specimens, and we have only one specimen of each of the other two species. We got seven species total in lake B Figure 3.

Which of the options below is true for the alpha diversity in lake A, lake B, and beta diversity between lakes A and B, respectively?

  1. 4, 3, 1
  2. 4, 3, 5
  3. 9, 7, 16

Please, paste your result on the collaborative document provided by instructors. Hic Sunt Leones! (Here be Lions!)

Solution

Answer: 4, 3, 5

Plot alpha diversity

We want to know how is the bacterial diversity, so we will prune all of the non-bacterial organisms that we have in our merged_metagenomes Phyloseq object. To do this we will make a subset of all bacterial groups and save them.

> merged_metagenomes <- subset_taxa(merged_metagenomes, Kingdom == "Bacteria")

Now let’s look at some statistics of our metagenomes:

> merged_metagenomes
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4024 taxa and 3 samples ]
tax_table()   Taxonomy Table:    [ 4024 taxa by 7 taxonomic ranks ]
> sample_sums(merged_metagenomes)
  JC1A   JP4D   JP41 
 18412 149590  76589 
> summary(merged_metagenomes@otu_table@.Data)
      JC1A              JP4D              JP41        
 Min.   :  0.000   Min.   :   0.00   Min.   :   0.00  
 1st Qu.:  0.000   1st Qu.:   3.00   1st Qu.:   1.00  
 Median :  0.000   Median :   7.00   Median :   5.00  
 Mean   :  4.575   Mean   :  37.17   Mean   :  19.03  
 3rd Qu.:  2.000   3rd Qu.:  21.00   3rd Qu.:  14.00  
 Max.   :399.000   Max.   :6551.00   Max.   :1994.00  

By the output of the sample_sums() command we can see how many reads there are in the library. Also, the Max, Min, and Mean output on summary() can give us an idea of the evenness. Nevertheless, to have a more visual representation of the diversity inside the samples (i.e. α diversity) we can now look at a graph created using Phyloseq:

> plot_richness(physeq = merged_metagenomes, 
              measures = c("Observed","Chao1","Shannon")) 

A figure divided in three    panels. Each of these panels represents a different alpha diversity index.    Inside this section, each point represents the value assigned on this index to    the three different samples.The different indexes give    different values to the same sample. Figure 4. Alpha diversity indexes for both samples.

Each of these metrics can give an insight into the distribution of the OTUs inside our samples. For example, the Chao1 diversity index gives more weight to singletons and doubletons observed in our samples, while Shannon is an entropy index remarking the impossibility of taking two reads out of the metagenome “bag” and that these two will belong to the same OTU.

Exercise 2: Exploring function flags.

While using the help provided explore these options available for the function in plot_richness():

  1. nrow
  2. sortby
  3. title

Use these options to generate new figures that show you other ways to present the data.

Solution

The code and the plot using the three options will look as follows: The “title” option adds a title to the figure.

> plot_richness(physeq = merged_metagenomes, 
             title = "Alpha diversity indexes for three samples from Cuatro Cienegas",
             measures = c("Observed","Chao1","Shannon"))

Figure 5. Alpha diversity plot with title.

The “nrow” option arranges the graphics horizontally.

> plot_richness(physeq = merged_metagenomes,
             measures = c("Observed","Chao1","Shannon"),
             nrow=3)

Figure 6. Alpha diversity plot with the three panels arranged in rows.

The “sortby” option orders the samples from least to greatest diversity depending on the parameter. In this case, it is ordered by “Shannon” and tells us that the JP4D sample has the lowest diversity and the JP41 sample the highest.

> plot_richness(physeq = merged_metagenomes, 
             measures = c("Observed","Chao1","Shannon"),
             sortby = "Shannon") 

The same panels as before but now the samples are arranged horizontaly according to the values in the Shannon index panel. Figure 7. Samples sorted by Shannon in alpha diversity index plots.

Considering the above mentioned, together with the 3 graphs, we can say that the samples JP41 and JP4D present a high diversity with respect to the JC1A, but that the diversity of the sample JP41 is mainly given by singletons or doubletons, instead, the diversity of JP4D is given by species in much greater abundance. Although because the values of H (Shannon) above 3 are considered to have a lot of diversity.

Discussion

How much can the α diversity be changed by eliminating the singletons and doubletons?

Absolute and relative abundances

From the read counts that we just saw it is evident that there is a great difference in the number of total sequenced reads in each sample. Before we further process our data, take a look if we have any non-identified reads. Marked as blank (i.e “”) on the different taxonomic levels:

> summary(merged_metagenomes@tax_table@.Data== "")
  Kingdom          Phylum          Class           Order           Family          Genus          Species       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:4024      FALSE:4024      FALSE:3886      FALSE:4015      FALSE:3967      FALSE:3866      FALSE:3540     
                                 TRUE :138       TRUE :9         TRUE :57        TRUE :158       TRUE :484      

With the command above, we can see that there are blanks on different taxonomic levels. Although we could expect to see some blanks at the species, or even at the genus level, we will get rid of the ones at the genus level to proceed with the analysis:

> merged_metagenomes <- subset_taxa(merged_metagenomes, Genus != "")
> summary(merged_metagenomes@tax_table@.Data== "")
  Kingdom          Phylum          Class           Order           Family          Genus          Species       
 Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:3866      FALSE:3866      FALSE:3739      FALSE:3860      FALSE:3858      FALSE:3866      FALSE:3527     
                                 TRUE :127       TRUE :6         TRUE :8                         TRUE :339 

Next, since our metagenomes have different sizes, it is imperative to convert the number of assigned reads (i.e. absolute abundance) into percentages (i.e. relative abundances) to be able to compare them.

Right now our OTU table looks like this:

> head(merged_metagenomes@otu_table@.Data)
        JC1A JP4D JP41
1060      32  420   84
1063     316 5733 1212
2033869  135 1232  146
1850250  114  846  538
1061      42 1004  355
265       42  975  205

To make this transformation to percentages we will take advantage of a function of Phyloseq.

> percentages <- transform_sample_counts(merged_metagenomes, function(x) x*100 / sum(x) )
> head(percentages@otu_table@.Data)
             JC1A      JP4D      JP41
1060    0.1877383 0.3065134 0.1179709
1063    1.8539161 4.1839080 1.7021516
2033869 0.7920211 0.8991060 0.2050447
1850250 0.6688178 0.6174056 0.7555755
1061    0.2464066 0.7327130 0.4985675
265     0.2464066 0.7115490 0.2879052

Beta diversity

As we mentioned before, the beta diversity is a measure of how alike or different are our samples (overlap between discretely defined sets of species or operational taxonomic units). In order to measure this, we need to calculate an index that suits the objectives of our research. By the next code, we can display all the possible distance metrics that Phyloseq can use:

> distanceMethodList
$UniFrac
[1] "unifrac"  "wunifrac"

$DPCoA
[1] "dpcoa"

$JSD
[1] "jsd"

$vegdist
 [1] "manhattan"  "euclidean"  "canberra"   "bray"       "kulczynski" "jaccard"    "gower"     
 [8] "altGower"   "morisita"   "horn"       "mountford"  "raup"       "binomial"   "chao"      
[15] "cao"       

$betadiver
 [1] "w"   "-1"  "c"   "wb"  "r"   "I"   "e"   "t"   "me"  "j"   "sor" "m"   "-2"  "co"  "cc"  "g"  
[17] "-3"  "l"   "19"  "hk"  "rlb" "sim" "gl"  "z"  

$dist
[1] "maximum"   "binary"    "minkowski"

$designdist
[1] "ANY"

Describing all these possible distance-metrics is beyond the scope of this lesson, but here we show which are the ones that need a phylogenetic relationship between the species-OTUs present in our samples:

We do not have a phylogenetic tree or phylogenetic relationships. So we can not use any of those three. We will use Bray-curtis, since is one of the most robust and widely use distance metrics to calculate beta diversity.

Let’s keep this up! We already have all that we need to begin the beta diversity analysis. We will use the Phyloseq command ordinate to generate a new object where the distances between our samples will be allocated after they are calculated. For this command, we need to specify which method we will use to generate a matrix. In this example, we will use Non-Metric Multidimensional Scaling or NMDS. NMDS attempts to represent the pairwise dissimilarity between objects in a low-dimensional space, in this case, a two-dimensional plot.

> meta_ord <- ordinate(physeq = percentages, method = "NMDS", 
                     distance = "bray")

If you get some warning messages after running this script, fear not. This is because we only have three samples , this makes the algorithm display a warning concerning the lack of difficulty in generating the distance matrix.

By now, we just need the command plot_ordination(), to see the results from our beta diversity analysis:

> plot_ordination(physeq = percentages, ordination = meta_ord)

Plot with NMDS1 as label in x-axis that goes from -0.4 to 0.2 and NMDS2 in y-axis that goes from -0.2 to 0.1. There are three dots in the plot that are not clustered in any way. Figure 8. Beta diversity with NMDS of our three samples.

In this NMDS plot each of the points represents the combined abundance of all its OTUs. As is depicted, each of the samples occupy its own space in the plot without forming any clusters. This is because each sample is different enough to be considered its own point in the NMDS space.

Key Points

  • Alpha diversity measures the intra-sample diversity

  • Beta diversity measures the inter-sample diversity

  • Phyloseq includes diversity analyses such as alpha and beta diversity calculation.


Taxonomic Analysis with R

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can we know which taxa are in our samples?

  • How can we compare depth-contrasting samples?

  • How can we manipulate our data to deliver a message?

Objectives
  • Manipulate data types inside your phyloseq object

  • Extract specific information from taxonomic-assignation data

Explore our samples at specific taxonomic levels

With the taxonomic assignment information that we obtained from Kraken we have measured diversity, and we have visualized the taxa inside each sample with Krona and Pavian, but Phyloseq allows us to make this visualization in a more flexible and personalized manner. So now we will take advantage of Phyloseq to make abundance plots of the taxa in our samples.

We will start our exploration at the Phylum level. In order to group all the OTUs that have the same taxonomy at a certain taxonomic rank, we will use the function tax_glom().

> percentages_glom <- tax_glom(percentages, taxrank = 'Phylum')
> View(percentages_glom@tax_table@.Data)

Table containing the    taxonomic information of each of the OTUs inside the three samples. Here,    we can see how only the Phylum column has information, leaving the other    taxonomic levels blank. Figure 1. Taxonomic-data table after agglomeration at the phylum level.

Another phyloseq function is psmelt(), which melts phyloseq objects into a data.frame to manipulate them with packages like ggplot2 and vegan.

> percentages_df <- psmelt(percentages_glom)
> str(percentages_df)
'data.frame': 99 obs. of  5 variables:
 $ OTU      : chr  "1063" "1063" "1063" "2350" ...
 $ Sample   : chr  "JP4D" "JC1A" "JP41" "JP41" ...
 $ Abundance: num  85 73.5 58.7 23.8 19.1 ...
 $ Kingdom  : chr  "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
 $ Phylum   : chr  "Proteobacteria" "Proteobacteria" "Proteobacteria" "Bacteroidetes" ...

Now, let’s create another data frame with the original data. This will help us to compare the absolute with the relative abundance and have a more complete picture of our samples.

> absolute_glom <- tax_glom(physeq = merged_metagenomes, taxrank = "Phylum")
> absolute_df <- psmelt(absolute_glom)
> str(absolute_df)
'data.frame': 99 obs. of  5 variables:
 $ OTU      : chr  "1063" "1063" "2350" "1063" ...
 $ Sample   : chr  "JP4D" "JP41" "JP41" "JC1A" ...
 $ Abundance: num  116538 41798 16964 12524 9227 ...
 $ Kingdom  : chr  "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
 $ Phylum   : chr  "Proteobacteria" "Proteobacteria" "Bacteroidetes" "Proteobacteria" ...

With these objects and what we have learned regarding R data structures and ggplot2, we can proceed to compare them with a plot. First, let’s take some steps that will allow us to personalize our plot making it accesible for color-blindness. We will create a color palette. With colorRampPalette we will chose 8 colors from the Dark2 palette and make a “ramp” with it, that is, covert those 8 colors to the number of colors needed to have one for each phylum in our data frame. For this we need to have our Phylum column in the factor structure.

> absolute_df$Phylum <- as.factor(absolute_df$Phylum)
> phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))

Now, let´s create the figure for the data with absolute abundances (i.e absolute_plot object)

> absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+ 
    geom_bar(aes(), stat="identity", position="stack")+
    scale_fill_manual(values = phylum_colors_abs)

With the position="stack" command, we are telling the ggplot function that the values must stack each other for each of the samples. In this way, we will have all of our different categories (OTUs) stacked in one bar and not each in a separate one. For more info position_stack

Next, we will create the figure for the representation of the relative abundance data, and ask RStudio to show us both plots thanks to the | function from the library patchwork:

> percentages_df$Phylum <- as.factor(percentages_df$Phylum)
> phylum_colors_rel<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(percentages_df$Phylum)))
> relative_plot <- ggplot(data=percentages_df, aes(x=Sample, y=Abundance, fill=Phylum))+ 
    geom_bar(aes(), stat="identity", position="stack")+
    scale_fill_manual(values = phylum_colors_rel)
> absolute_plot | relative_plot

A two-part plot contrasting    the absolute versus the relative abundance of the three samples. On the right    side, we can see how each of the bars has its own height, making it difficult    to compare the information between samples. Whereas, the right side shows    three bars with the same height after the abundance was transformed to    percentage inside of each sample. Figure 2. Taxonomic diversity of absolute and relative abundance.

At once, we can denote the difference between the two plots and how processing the data can enhance the display of important results. However, it is noticeable that we have too many taxa to adequately distinguish the color of each one of them, less of the ones that hold the greatest abundance. In order to change that, we will use the power of data-frames and R. We will change the identification of the OTUs whose relative abundance is less than 0.2%:

> percentages_df$Phylum <- as.character(percentages_df$Phylum) # Return the Phylum column to be of type character
> percentages_df$Phylum[percentages_df$Abundance < 0.5] <- "Phyla < 0.5% abund."
> unique(percentages_df$Phylum)
[1] "Proteobacteria"    "Bacteroidetes"     "Actinobacteria"    "Firmicutes"        "Cyanobacteria"    
[6] "Planctomycetes"    "Verrucomicrobia"   "Phyla < 0.5 abund"

Let’s ask R to display the figures again by re-running our code:

> percentages_df$Phylum <- as.factor(percentages_df$Phylum)
> phylum_colors_rel<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(percentages_df$Phylum)))
> relative_plot <- ggplot(data=percentages_df, aes(x=Sample, y=Abundance, fill=Phylum))+ 
  geom_bar(aes(), stat="identity", position="stack")+
  scale_fill_manual(values = phylum_colors_rel)
> absolute_plot | relative_plot

A new two-part plot with    a reassignment of the low-abundant taxa on the right side. Compared to the    left legend, the one in the right has fewer groups because of the process of    reassigning the taxa with an abundance lower than 0.5 % to just one    group/color. Figure 3. Taxonomic diversity of absolute and relative abundance with corrections.

Exercise 1: Taxa agglomeration

Go into groups and agglomerate the taxa in the data with absolute abundance, so as to have a better visualization of the data. Remember to check the data classes inside your data frame. According to the ColorBrewer package it is recommended not to have more than 9 different colors in a plot.

What is the correct order to run the next chunks of code? Compare your graphs with your partners’.

Hic Sunt Leones! (Here be Lions!):

A) absolute_df$Phylum <- as.factor(absolute_df$Phylum)

B) absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_abs)

C) absolute_$Phylum[absolute_$Abundance < 300] <- "Minoritary Phyla"

D) phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))

E) absolute_df$Phylum <- as.character(absolute_df$Phylum)

Solution

By reducing agglomerating the samples that have less than 300 reads, we can get a more decent plot. Certainly, this will be difficult since each of our samples has a contrasting number of reads.

E) absolute_df$Phylum <- as.character(absolute_df$Phylum)

C) absolute_df$Phylum[absolute_$Abundance < 300] <- "Minoritary Phyla"

A) absolute_df$Phylum <- as.factor(absolute_df$Phylum)

D) phylum_colors_abs<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(absolute_df$Phylum)))

B) absolute_plot <- ggplot(data= absolute_df, aes(x=Sample, y=Abundance, fill=Phylum))+
geom_bar(aes(), stat="identity", position="stack")+
scale_fill_manual(values = phylum_colors_abs)

Show your plots:

absolute_plot | relative_plot

New reassignment to the low abundant taxa on the left part of the plot. A new class has been created that contains the taxa with less than 300 reads

Going further, let’s take an interesting lineage and explore it thoroughly

As we have already reviewed, Phyloseq offers a lot of tools to manage and explore data. Let’s take a look deeply at a function that we already use, but now with guided exploration. The subset_taxa command is used to extract specific lineages from a stated taxonomic level, we have used it to get rid of the reads that do not belong to bacteria with merged_metagenomes <- subset_taxa(merged_metagenomes, Kingdom == "Bacteria").

We are going to use it now to extract a specific phylum from our data, and explore it at a lower taxonomic level: Genus. We will take as an example the phylum Cyanobacteria (certainly, this is a biased and arbitrary decision, but who does not feel attracted to these incredible microorganisms?):

> cyanos <- subset_taxa(merged_metagenomes, Phylum == "Cyanobacteria")
> unique(cyanos@tax_table@.Data[,2])
[1] "Cyanobacteria"

Let’s do a little review of all that we saw today: Transformation of the data; Manipulation of the information; and plotting:

> cyanos_percentages <- transform_sample_counts(cyanos, function(x) x*100 / sum(x) )
> cyanos_glom <- tax_glom(cyanos_percentages, taxrank = "Genus")
> cyanos_df <- psmelt(cyanos_glom)
> cyanos_df$Genus[cyanos_df$Abundance < 10] <- "Genera < 10.0 abund"
> cyanos_df$Genus <- as.factor(cyanos_df$Genus)
> genus_colors_cyanos<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(cyanos_df$Genus)))  
> plot_cyanos <- ggplot(data=cyanos_df, aes(x=Sample, y=Abundance, fill=Genus))+ 
    geom_bar(aes(), stat="identity", position="stack")+
    scale_fill_manual(values = genus_colors_cyanos)
> plot_cyanos

A new plot with three bars    representing the absolute abundance of Cyanobacteria in each of the samples.    Each of the colors represents a Genus. Because we are seeing relative    abundances, all the bars are of the same height. Figure 5. Diversity of Cyanobacteria at genus level inside our samples.

Exercise 3: Recap of abundance plotting

Match the chunk of code with its description and put them in the correct order to create a relative abundace plot at the genus level of a particular phylum. がんばって! (ganbatte; good luck):

Description Command
Plot the relative abundance at the genus levels. plot_proteo
Convert all the genera that have less than 3% of abundance into only one label. proteo_percentages <- transform_sample_counts(proteo, function(x) >x*100 / sum(x) )
Make just one row that groups all the observations of the same genus. proteo <- subset_taxa(merged_metagenomes, Phylum == "Proteobacteria")
Create a phyloseq object only with the reads assigned to a certain phylum. unique(proteo@tax_table@.Data[,2])
Show the plot. proteo_glom <- tax_glom(proteo_percentages, taxrank = "Genus")
Transform the phyloseq object to a data frame. plot_proteo <- ggplot(data=proteo_df, aes(x=Sample, y=Abundance, fill=Genus))+
  geom_bar(aes(), stat="identity", position="stack")+
  scale_fill_manual(values = genus_colors_proteo)
Convert the Genus column into the factor structure. proteo_df$Genus[proteo_df$Abundance < 3] <- "Genera < 3% abund"
Look at the phyla present in your phyloseq object. proteo_df <- psmelt(proteo_glom)
Convert the abundance counts to relative abundance. genus_colors_proteo<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(proteo_df$Genus)))
Make a palette with the appropriate colors for the amount of genera. proteo_df$Genus <- as.factor(proteo_df$Genus)

Solution

# Create a phyloseq object only with the reads assigned to a certain phylum.
proteo <- subset_taxa(merged_metagenomes, Phylum == "Proteobacteria")
# Look at the phyla present in your phyloseq object
unique(proteo@tax_table@.Data[,2])
# Convert the abundance counts to relative abundance
proteo_percentages <- transform_sample_counts(proteo, function(x) x*100 / sum(x) )
# Make just one row that groups all the observations of the same genus.
proteo_glom <- tax_glom(proteo_percentages, taxrank = "Genus")
# Transform the phyloseq object to a data frame
proteo_df <- psmelt(proteo_glom)
# Convert all the genera that have less than 3% of abundance into only one label
proteo_df$Genus[proteo_df$Abundance < 3] <- "Genera < 3% abund"
# Convert the Genus column into the factor structure
proteo_df$Genus <- as.factor(proteo_df$Genus)
# Make a palette with the appropriate colors for the amount of genera
genus_colors_proteo<- colorRampPalette(brewer.pal(8,"Dark2")) (length(levels(proteo_df$Genus)))
# Plot the relative abundance at the genus levels
plot_proteo <- ggplot(data=proteo_df, aes(x=Sample, y=Abundance, fill=Genus))+ 
  geom_bar(aes(), stat="identity", position="stack")+
  scale_fill_manual(values = genus_colors_proteo)
# Show the plot
plot_proteo  

A new plot with three bars    representing the absolute abundance of Proteobacteria in each of the samples.    Each of the colors represents a Genus. Because we are seeing relative    abundances, all the bars are of the same height.

Key Points

  • Depths and abundances can be visualized using phyloseq

  • The library phyloseq lets you manipulate metagenomic data in a taxonomic specific perspective.


Other Resources

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Where are other metagenomic resources?

  • How can lessons be previewed?

Objectives
  • Know some places where to find more information.

Other resources

Cuatro Ciénegas

More information about 4 Ciénegas is available in these podcasts Mexican oasis in the desert and Mexican Oasis, in this Cuatro Ciénegas Video, and finally in the manuscript Cuatro Ciénegas.

Other tutorials and books

Now that you finished this metagenomic lesson, you are ready to explore on your own the universe of available tutorials. Phyloseq tutorial contains much more examples of metagenomic data manipulation.
The Computational Genomics tutorial by Schmeir explains carefully each step of the process. To know more about metagenomics history in the Meren Lab blog there is a wonderful entry called History of metagenomics as well as several videos explaining Metapangenomics: A nexus between pangenomes and metagenomes, The power of metagenomic read recruitment, Genome-resolved metagenomics: key concepts in reconstructing genomes from metagenomes. Finally, for spansih speakers the ISME course is a detailed tutorial for 16s metabarcoding ISME Análisis de diversidad.

The book Microbiomes shares a contemporary concept of the microbiomes. Books Statistical analysis of microbiome data and statistical analysis of microbiome data with r contain a state of the art compendium of the statistical and computational microbiome analysis techniques beyond diversity analysis.

Other studies and databases

Some databases are: jgi, The Earth microbiome project, metaSUB, The atlas of soil microbiome, and The Human microbiome project.

MG-RAST

A useful and easy to use resource is MG-RAST, is at the same time a database and an online analysis tool. MG-RAST is an online metagenomic platform where you can upload your raw data with its corresponding metadata and get a full taxonomic analysis of it. MG-RAST is a great place to get started in this type of analyzes and it is also a big repository of available data for future experiments. On the downside, it is not possible to greatly modify the steps and parameters in the MG-RAST workflow, so there is not much room when it comes to implementing our preferred analysis tools when using MG-RAST.

The Cuatro Ciénegas data that we used in the workshop is already in MG-RAST! You can check it out here.

We can check the taxonomical distribution of our sample at different taxonomical level.

The most abundant phylum is Proteobacteria.
Pie chart showing the relative abundance at phylum level, and the legend with the phylum names, read count and percentages.

Since we have a shotgun metagenome, we can also investigate the metabolic functions present in our sample. MG-RAST can find genes and annotate their function through an implementation of RAST, or Rapid Annotation using Subsystems Technology. By looking at the charts generated by this analysis, we see that most of the genes are dedicated to metabolism.

Pie chart showing the relative abundance of general functional categories, and the legend with the category names, read count and percentages.

Pie chart showing the relative abundance of specific functional categories, and the legend with the category names, read count and percentages.

MG-RAST has it’s own specific pipeline, so it is a very useful tool to have a quick look of your data, and also to store it and share it!, but it does not keep you from making your own personalized analysis like we just learn!

Discussion: Exploring more resources

Explore one of the suggested resources and discuss your findings with a neighbor.

Carpentries Philosophy

A good lesson should be as complete and clear that becomes easy to teach by any instructor. Carpentries lessons are developed for the community, and now you are part of us. This lesson is being developed and we are sure that you can colaborate and help us improve it.

Key Points

  • Enjoy metagenomics.