This lesson is still being designed and assembled (Pre-Alpha version)

Evolutionary Genome Mining

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • What is Evolutionary Genome Mining?

  • Which kind of BGCs can EvoMining find?

  • What do I need in order to run an evolutionary genome mining analysis?

Objectives
  • Understand EvoMining pipeline.

  • Run an example of evolutionary analysis in cpsG gene family.

  • Explore MicroReact interactive output interface.

a) EvoMining expansion-and-recruitment pipeline. A group of grey stacked cylinders representing genomes in a database (DB).                                    	Homologues and expansions of seed enzymes, represented as an orange arrow, from the enzyme DB                                    	are searched by blastp in the genome DB.                                    	The outcome is integrated as the expanded enzyme families (EFs) within the genome DB.                                    	Bidirectional best hits (BBH) of seed enzymes, red arrows, are marked as conserved metabolism.                                    	The EFs are amplified after being compared against a DB of natural products (NP) biosynthetic enzymes,                                    	represented by a blue cylinder, to find recruitments defined as enzymes of the family that are part of a MIBiG BGC.                                    	b) The genome DB, represented by the gray stacked cylinders, is searched as previously described.                                    	Additionally, antiSMASH predictions, cyan arrows, can be added by the user.                                    	antiSMASH enzyme predictions that are at the same time marked in red are defined as transition enzymes, purple arrows.                                    	c) EvoMining phylogenetic reconstruction and visualization. On the left side, a phylogenetic reconstruction of an EF is shown.                                    	On the right side it is shown the EvoMining tree displaying the EvoMining predictions (green),                                    	which are those extra copies closer to enzyme recruitments into BGC (blue) than to conserved metabolic enzymes (red).                                    	antiSMASH predicted enzymes are represented in cyan, transition enzymes in black and                                    	extra copies that are neither antiSMASH nor EvoMining predictions are left in grey.

Usually, bioinformatics tools related to the prediction of Natural Products (NP) biosynthetic genes try to find metabolic pathways of enzymes that are known to be related with the synthesis of secondary metabolites. However, these approaches fail for the discovery of novel biosynthetic systems. Thus, EvoMining tries to circumvent this problem by detecting novel enzymes that may be implicated in the synthesis of new natural products in Bacteria.

To know more about EvoMining you can read Selem et al, Microbial Genomics 2019.

EvoMining searches protein expansions that may have evolved from the conserved metabolism into a specialized metabolism. It builds phylogenetic trees based on all the protein copies of a certain enzyme in a given genome database. The output tree differentiates copies that are related with the conserved metabolism, copies that are known to be implicated in discovered NP-producing-BGCs i.e. BGCs from MiBIG database and, optionally, protein copies that belong to BGCs predicted by antiSMASH. Finally, some branch in the tree will be depicted as “EvoMining hits”, which represent enzyme expansions that are evolutionary closer to those copies related with the secondary metabolism (MiBIG or antiSMASH BGCs) than to those related with the conserved (primary) metabolism.

Run evomining image

First, place yourself at your working directory.

$ cd   ~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2  
$ ls
CORASON_GENOMES  Corason_Rast.IDs  cpsg.query  GENOMES  output

The general structure of a docker container is shown in the next bash-box. Note that it requires to specify which docker container will run. Optionally, with -v flag it is possible to share a directory with the container, with -p flag a port is shared and it is also possible to specify which program will run inside the container.

$ docker run --rm -i -t -v <your local directory>:<inside docker directory> -p <inside port>:80 <docker container> <program inside docker>    

EvoMining is inside a docker container, so the general structure to start your analysis will be as follows:

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8080:80 nselem/evomining:latest /bin/bash   

Let’s explain the pieces of this line.

command Explanation
docker tells the system that we are running a docker command
run the command that we are running is to run a docker container
–rm this container will be removed after closed
-i this container allows user interaction
-t this interaction will be through a terminal
-v a data volume (directory) will be shared between your local machine and the container
-p a port will allow a web based app

However, sometimes the port 80 is busy, in that case you can use other ports like 8080 or 8084. If this is the case, please use the port 80X where X is a number between 01..30 provided by your instructor.

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8080:80 nselem/evomining:latest /bin/bash  
$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8084:80 nselem/evomining:latest /bin/bash  

If your docker container worked, now you will see in your terminal a new prompt. Instead of the usual dollar sign, there should be a number # at the beginning of your terminal. This is because now you are inside the docker container and you have sudo permissions inside the docker.

#  

To exit container use exit

# exit

And now your prompt must be back in the dollar sign

$   

Set EvoMining genomic database

Start the container again with your corresponding port.

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 80X:80 nselem/evomining:latest /bin/bash   

Though we will NOT run the test EvoMining command, it looks as follows:

# perl startEvoMining.pl  

Instead of that, customize the genomic database by using the same as CORASON.
Notice that EvoMining requires RAST-like annotated genomes and for this reason we are using the fasta files that CORASON converts from our gbk inputs.

# perl startEvoMining.pl -g GENOMES -r  Corason_Rast.IDs

EvoMining phylogenetic reconstruction providing evolutionary insights into the metabolic origin                               	and the fate of members of diverse EF from the Streptococcus example.                               	Seed enzymes are labeled in orange. The most conserved copies or central metabolism copies are marked in red.                               	Enzyme copies recruited into specialized metabolism, contained in MIBiG, are labeled in blue.                               	Enzyme copies that are closer to blue enzyme recruitments than to red conserved enzymes are labeled in green                               	and represent EvoMining Hits. Extra copies with an unknown metabolic fate are shown in gray.

Finally, remember that X means your user-number and open your browser at the address: http://132.248.196.38:80X/EvoMining/html/index.html. Once there, just click the start button and enjoy! (click on the submit buttons!)

When you finish using this container, please exit it.

#exit

Visualize your results in MicroReact

Firstly, you have to run all the pipeline in the website: http:///EvoMining/html/index.html, and then all the output files will be generated. You can use the EvoMining basic interface or take your results into MicroReact.

EvoMining outputs are stored in the directory <conserved-db>_<natural-db>_<genomes-db>

$ ls

To explore EvoMining outputs, you need to upload 1.nwk and 1.csv files to microReact. There are many methods to download files from the server to your local computer.

If you are using JupyterHub you explore the file folders and select the files and then press the download button.

You can use the export button in the file panel of R studio. To download the files, first in the files panel open in the directory ~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree,
Select the path to download file

Then, select files 1.nwk and 1.csv in that directory, click more in the engine icon, and select the export option in the menu. The files will be downloaded to your local computer, and now you will be able to upload them to MicroReact.

Select the path to download file

Alternatively, If your prefer to use the terminal to download files the scp protocol can download the files into your local machine.

scp betterlab@132.248.196.38:~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree/1.nwk ~/Downloads/.
scp betterlab@132.248.196.38:~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree/1.csv ~/Downloads  

Here you can find the MicroReact visualization of this EvoMining run.

MicroReact visualization of the EvoMining run Streptococcus example.                                              	At the left a bar-chart with the EF in the X axis and the number of entries in the Y axis.                                              	At the right, the EvoMining phylogenetic tree using the same color code as the chart.                                              	Right of the tree the legend indicating the colors by metabolism; central metabolism enzymes in red,                                              	expansion enzymes in gray, recruited enzymes contained in MIBiG in blue,                                              	secondary metabolism enzymes (EvoMining hits) are marked in green,                                              	and seed enzymes are colored in orange. Below appears the metadata from the run,                                              	organized in a five row table including Id, metabolism, genome, function and copies.

Other resources

To run EvoMining with a larger conserved-metabolite DB you can use EvoMining Zenodo data.

To explore more EvoMining options, please explore EvoMining wiki.

Set the conserved-enzymes database

When using EvoMining, oftenly you will desire to construct your own conserved enzymes database. To know more about how to configure a database, consult the EvoMining wiki in the EvoMining databases part. Natural products database could also be replaced for another set of genes that are “true positives”, for example a set of regulatory genes.

As an example, transform the file cpsg.query into the format of this database. This file contains the aminoacid sequence of the cpsG gene. Firstly, copy this file into what will become the conserved-enzymes database.

$ cp cpsg.query cpsg_cdb

Now, it requires some editing. Open nano editor and change the first line >cpsg to >SYSTEM1|1|phosphomannomutase|Saga . EvoMining conserved-database needs a four-field format pipe-separated that contains; the name of the metabolic system to which the enzyme belongs (SYSTEM1), a consecutive number of the enzyme (1 in this case), the function of the enzyme, and finally, an abbreviation of the organism Saga, (S. Agalactiae).
The reason behind this is that this was the way we needed EvoMining for its first use and we have not changed the headers since.

$ nano cpsg_cdb
>SYSTEM1|1|phosphomannomutase|Saga
MIFVTVGTHEQQFNRLIKEVDRLKGTGAIDQEVFIQTGYSDFEPQNCQWSKFLSYDDMNSYMKEAEIVITHGGPATFMSVISLGKLPVVVPRRKQFGEHINDHQIQFLKKIAHLYPLAWIED

Run your EvoMining docker

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 80X:80 nselem/evomining:latest /bin/bash   

and inside this new container:

# perl startEvoMining.pl -g GENOMES -r  Corason_Rast.IDs -c cpsg_cdb

Use the website again and think about the results.

Exercise 1. Set EvoMining parameters

Complete the blanks in the following EvoMining run: actinoSMASH A file with the ids of antiSMASH recognized genes. Actinos a directory with RAST-like fasta and annotation files. Histidine-db A fasta file with some proteins in the histidine pathway.
Actinos.ids tabular files with the RAST ids and the name of the organisms.

 # perl starEvoMining.pl -g ____ -c _____ -r _____ -a ___________  

Solution

# perl starEvoMining.pl -g Actinos -c Histidine-db -r Actinos.ids -a actinoSMASH  

Actinos is the genomic database, Histidine-db is the conserved-enzymes database, Actinos.ids is the file that relates Rast ids with the organisms names, and actinoSMASH contains the genes identified by antiSMASH.

Discussion 1. Retro EvoMining in enzyme database

What do you learn from running in a conserved-enzymes database the gene cpsG that is part of a specialized BGC?

Solution

cpsG does not have extra copies in Streptococcus agalactiae, so there are no expansions that may be functional divergent. cpsG single copies in the genomes look red-colored in EvoMining output, as if they belong to the conserved-metabolism. However, this is not the case, the color is because there is only one copy and it is merged into MIBiG true-positives because it was originally a gene in the specialized metabolism. So it is important to know the seed enzymes.

cpsg_cdb_MiBIG_DB.faa_GENOMES

ARTS is another evolutionary genome mining software with its corresponding database ARTS-db .

Evolutionary Genome Mining

To learn more about Genome Mining you can read these references:

  • The confluence of big data and evolutionary genome mining for the discovery of natural products. Chevrette et al, 2021`
  • Evolutionary Genome Mining for the Discovery and Engineering of Natural Product Biosynthesis. Chevrette et al, 2022

More about docker

To see the running container use ps

$ docker ps

If there are containers in use you will see a list of all of them.

2f879ba6e337   nselem/evomining:latest   "/bin/bash"   11 hours ago   Up 11 hours   0.0.0.0:8014->80/tcp, :::8014->80/tcp   relaxed_dirac

To stop the running container use docker stop and to remove them use docker rm

$ docker stop relaxed_dirac  
$ docker remove relaxed_dirac  
2f879ba6e337

can be downloaded by specifying our queries with --format genbank.

Key Points

  • EvoMining is a command-line tool that performs evolutionary genome mining over gene families

  • EvoMining hits can belong to new BGC

  • MicroReact is an interactive genomic visualizer compatible with EvoMining output