This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Metagenome Binning

Overview

Teaching: 50 min
Exercises: 10 min
Questions
  • How can we obtain the original genomes from a metagenome?

Objectives
  • Obtain Metagenome-Assembled Genomes from the metagenomic assembly.

  • Check the quality of the Metagenome-Assembled genomes.

Metagenomic binning

To analyze each of the species inside our sample individually, the original genomes in the sample can be separated with a process called binning. We call these genomes reconstructed from metagenomic assembly MAGs (Metagenome-Assembled Genomes). In this process, the assembled contigs from the metagenome will be assigned to different bins (FASTA files that contain certain contigs). Ideally, each bin corresponds to only one original genome (a MAG).

Diagram depicting the DNA sequences  in the original sample as circular chromosomes, then the DNA fragmented into reads, then assembled into contigs, and then binned

Although an obvious way to separate contigs that correspond to a different species is by their taxonomic assignation, there are more reliable methods that do the binning using characteristics of the contigs, such as their GC content, the use of tetranucleotides (composition), or their coverage (abundance).

Maxbin is a binning algorithm that distinguishes between contigs that belong to different bins according to their coverage levels and the tetranucleotide frequencies they have.

Let’s bin the sample we just assembled. The command for running MaxBin is run_MaxBin.pl, and the arguments it needs are the FASTA file of the assembly, the FASTQ with the forward and reverse reads, the output directory, and name.

$ cd ~/dc_workshop/results/assembly_JC1A
$ mkdir MAXBIN
$ run_MaxBin.pl -thread 8 -contig JC1A_contigs.fasta -reads ../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz -reads2 ../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz -out MAXBIN/JC1A &
MaxBin 2.2.7
Thread: 12
Input contig: JC1A_contigs.fasta
Located reads file [../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz]
Located reads file [../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz]
out header: MAXBIN/JC1A
Running Bowtie2 on reads file [../../data/trimmed_fastq/JC1A_R1.trim.fastq.gz]...this may take a while...
Reading SAM file to estimate abundance values...
Running Bowtie2 on reads file [../../data/trimmed_fastq/JC1A_R2.trim.fastq.gz]...this may take a while...
Reading SAM file to estimate abundance values...
Searching against 107 marker genes to find starting seed contigs for [JC1A_contigs.fasta]...
Running FragGeneScan....
Running HMMER hmmsearch....
Try harder to dig out marker genes from contigs.
Marker gene search reveals that the dataset cannot be binned (the medium of marker gene number <= 1). Program stop.

It seems that it is impossible to bin our assembly because the amount of marker genes is less than 1. We could have expected this as we know it is a small sample.

We will perform the binning process with the other sample from the same study that is a little larger. We have the assembly precomputed in the ~/dc-workshop/mags/ directory.

$ cd ~/dc_workshop/mags/
$ mkdir MAXBIN
$ run_MaxBin.pl -thread 8 -contig JP4D_contigs.fasta -reads ../data/trimmed_fastq/JP4D_R1.trim.fastq.gz -reads2 ../data/trimmed_fastq/JP4D_R2.trim.fastq.gz -out MAXBIN/JP4D &

It will take a few minutes to run. And it will finish with an output like this:

========== Job finished ==========
Yielded 4 bins for contig (scaffold) file JP4D_contigs.fasta

Here are the output files for this run.
Please refer to the README file for further details.

Summary file: MAXBIN/JP4D.summary
Genome abundance info file: MAXBIN/JP4D.abundance
Marker counts: MAXBIN/JP4D.marker
Marker genes for each bin: MAXBIN/JP4D.marker_of_each_gene.tar.gz
Bin files: MAXBIN/JP4D.001.fasta - MAXBIN/JP4D.004.fasta
Unbinned sequences: MAXBIN/JP4D.noclass

Store abundance information of reads file [../data/trimmed_fastq/JP4D_R1.trim.fastq.gz] in [MAXBIN/JP4D.abund1].
Store abundance information of reads file [../data/trimmed_fastq/JP4D_R2.trim.fastq.gz] in [MAXBIN/JP4D.abund2].


========== Elapsed Time ==========
0 hours 6 minutes and 56 seconds.

With the .summary file we can have a quick look at the bins that MaxBin produced.

$ cat MAXBIN/JP4D.summary
Bin name	Completeness	Genome size	GC content
JP4D.001.fasta	57.9%	3141556	55.5
JP4D.002.fasta	87.9%	6186438	67.3
JP4D.003.fasta	51.4%	3289972	48.1
JP4D.004.fasta	77.6%	5692657	38.9

Discussion: The quality of MAGs

Can we trust the quality of our bins only with the given information? What else do we want to know about our MAGs to confidently use them for further analysis?

Quality check

The quality of a MAG is highly dependent on the size of the genome of the species, its abundance in the community, and the depth at which we sequenced it. Two important things that can be measured to know its quality are completeness (is the MAG a complete genome?) and if it is contaminated (does the MAG contain only one genome?).

CheckM is a good program to see the quality of our MAGs. It gives a measure of the completeness and the contamination by counting marker genes in the MAGs. The lineage workflow that is a part of CheckM places your bins in a reference tree to know to which lineage it corresponds to and to use the appropriate marker genes to estimate the quality parameters. Unfortunately, the lineage workflow uses a lot of memory so it can’t run in our machines, but we can tell CheckM to use marker genes from Bacteria only, to spend less memory. This is a less accurate approach but it can also be very useful if you want all of your bins analyzed with the same markers.

We will run the taxonomy workflow specifying the use of markers at the domain level, specific for the rank Bacteria, we will specify that our bins are in FASTA format, that they are located in the MAXBIN directory and that we want our output in the CHECKM/ directory.

$ mkdir CHECKM
$ checkm taxonomy_wf domain Bacteria -x fasta MAXBIN/ CHECKM/ 

The run will end with our results printed in the console.

--------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id     Marker lineage   # genomes   # markers   # marker sets   0    1    2    3    4   5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------
  JP4D.002      Bacteria         5449        104            58        3    34   40   21   5   1       94.83           76.99              11.19          
  JP4D.004      Bacteria         5449        104            58        12   40   46   6    0   0       87.30           51.64               3.12          
  JP4D.001      Bacteria         5449        104            58        24   65   11   3    1   0       70.48           13.09               0.00          
  JP4D.003      Bacteria         5449        104            58        44   49   11   0    0   0       64.44           10.27               0.00          
--------------------------------------------------------------------------------------------------------------------------------------------------------

To have these values in an output that is more usable and shearable we can now run the quality step of CheckM checkm qa and make it print the output in a TSV table, instead of the console. In this step, we can ask CheckM to give us more parameters, like contig number and length.

Ideally, we would like to get only one contig per bin, with a length similar the genome size of the corresponding taxa. Since this scenario is very difficult to obtain we can use parameters that show us how good is our assembly. Here are some of the most common metrics: If we arrange our contigs by size, from larger to smaller, and divide the whole sequence in half, N50 is the size of the smallest contig in the half that has the larger contigs; and L50 is the number of contigs in this half of the sequence. So we want big N50 and small L50 values for our genomes. Read What’s N50?.

To get the table with these extra parameters we need to specify the file of the markers that CheckM used in the previous step Bacteria.ms, the name of the output file we want quality_JP4D.tsv, that we want a table --tab_table, and the option number 2 -o 2 is to ask for the extra parameters printed on the table.

$  checkm qa CHECKM/Bacteria.ms CHECKM/ --file CHECKM/quality_JP4D.tsv --tab_table -o 2

The table we just made looks like this. This will be very useful when you need to document your work or communicate it.

The question of, how much contamination we can tolerate and how much completeness do we need, certainly depends on the scientific question being tackled, but in the CheckM paper, there are some parameters that we can follow.

Exercise 1: Discuss the quality of the obtained MAGs

Fill in the blanks to complete the code you need to download the quality_JP4D.tsv to your local computer:

____ dcuser____ec2-18-207-132-236.compute-1.amazonaws.com____/home/dcuser/dc_workshop/mags/CHECKM/quality_JP4D.tsv ____

Solution

In a terminal that is standing on your local computer do:

$ scp dcuser@ec2-18-207-132-236.compute-1.amazonaws.com:/home/dcuser/dc_workshop/mags/CHECKM/quality_JP4D.tsv <the destination directory of your choice>

Then open the table in a spreadsheet and discuss with your team which of the parameters in the table do you find useful.

Key Points

  • Metagenome-Assembled Genomes (MAGs) sometimes are obtained from curated contigs grouped into bins.

  • Use MAXBIN to assign the contigs to bins of different taxa.

  • Use ChekM to evaluate the quality of each Metagenomics-Assembled Genome.