Evolutionary Genome Mining

Overview

Teaching: 20 min
Exercises: 20 min

Questions

What is Evolutionary Genome Mining?

Which kind of BGCs can EvoMining find?

What do I need in order to run an evolutionary genome mining analysis?

Objectives

Understand EvoMining pipeline.

Run an example of evolutionary analysis in cpsG gene family.

Explore MicroReact interactive output interface.

Usually, bioinformatics tools related to the prediction of Natural Products (NP) biosynthetic genes try to find metabolic pathways of enzymes that are known to be related with the synthesis of secondary metabolites. However, these approaches fail for the discovery of novel biosynthetic systems. Thus, EvoMining tries to circumvent this problem by detecting novel enzymes that may be implicated in the synthesis of new natural products in Bacteria.

To know more about EvoMining you can read Selem et al, Microbial Genomics 2019.

EvoMining searches protein expansions that may have evolved from the conserved metabolism into a specialized metabolism. It builds phylogenetic trees based on all the protein copies of a certain enzyme in a given genome database. The output tree differentiates copies that are related with the conserved metabolism, copies that are known to be implicated in discovered NP-producing-BGCs i.e. BGCs from MiBIG database and, optionally, protein copies that belong to BGCs predicted by antiSMASH. Finally, some branch in the tree will be depicted as “EvoMining hits”, which represent enzyme expansions that are evolutionary closer to those copies related with the secondary metabolism (MiBIG or antiSMASH BGCs) than to those related with the conserved (primary) metabolism.

Run evomining image

First, place yourself at your working directory.

$ cd   ~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2  
$ ls

CORASON_GENOMES  Corason_Rast.IDs  cpsg.query  GENOMES  output

The general structure of a docker container is shown in the next bash-box. Note that it requires to specify which docker container will run. Optionally, with -v flag it is possible to share a directory with the container, with -p flag a port is shared and it is also possible to specify which program will run inside the container.

$ docker run --rm -i -t -v <your local directory>:<inside docker directory> -p <inside port>:80 <docker container> <program inside docker>    

EvoMining is inside a docker container, so the general structure to start your analysis will be as follows:

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8080:80 nselem/evomining:latest /bin/bash   

Let’s explain the pieces of this line.

command	Explanation
docker	tells the system that we are running a docker command
run	the command that we are running is to run a docker container
–rm	this container will be removed after closed
-i	this container allows user interaction
-t	this interaction will be through a terminal
-v	a data volume (directory) will be shared between your local machine and the container
-p	a port will allow a web based app

However, sometimes the port 80 is busy, in that case you can use other ports like 8080 or 8084. If this is the case, please use the port 80X where X is a number between 01..30 provided by your instructor.

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8080:80 nselem/evomining:latest /bin/bash  
$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 8084:80 nselem/evomining:latest /bin/bash  

If your docker container worked, now you will see in your terminal a new prompt. Instead of the usual dollar sign, there should be a number # at the beginning of your terminal. This is because now you are inside the docker container and you have sudo permissions inside the docker.

To exit container use exit

# exit

And now your prompt must be back in the dollar sign

Set EvoMining genomic database

Start the container again with your corresponding port.

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 80X:80 nselem/evomining:latest /bin/bash   

Though we will NOT run the test EvoMining command, it looks as follows:

# perl startEvoMining.pl

Instead of that, customize the genomic database by using the same as CORASON.
Notice that EvoMining requires RAST-like annotated genomes and for this reason we are using the fasta files that CORASON converts from our gbk inputs.

# perl startEvoMining.pl -g GENOMES -r  Corason_Rast.IDs

Finally, remember that X means your user-number and open your browser at the address: http://132.248.196.38:80X/EvoMining/html/index.html. Once there, just click the start button and enjoy! (click on the submit buttons!)

When you finish using this container, please exit it.

#exit

Visualize your results in MicroReact

Firstly, you have to run all the pipeline in the website: http:///EvoMining/html/index.html, and then all the output files will be generated. You can use the EvoMining basic interface or take your results into MicroReact.

EvoMining outputs are stored in the directory <conserved-db>_<natural-db>_<genomes-db>

$ ls

To explore EvoMining outputs, you need to upload 1.nwk and 1.csv files to microReact. There are many methods to download files from the server to your local computer.

If you are using JupyterHub you explore the file folders and select the files and then press the download button.

You can use the export button in the file panel of R studio. To download the files, first in the files panel open in the directory ~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree,

Then, select files 1.nwk and 1.csv in that directory, click more in the engine icon, and select the export option in the menu. The files will be downloaded to your local computer, and now you will be able to upload them to MicroReact.

Alternatively, If your prefer to use the terminal to download files the scp protocol can download the files into your local machine.

scp betterlab@132.248.196.38:~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree/1.nwk ~/Downloads/.
scp betterlab@132.248.196.38:~/pan_workshop/results/genome-mining/corason-conda/EXAMPLE2/ALL_curado.fasta_MiBIG_DB.faa_GENOMES/blast/seqf/tree/1.csv ~/Downloads  

Here you can find the MicroReact visualization of this EvoMining run.

Other resources

To run EvoMining with a larger conserved-metabolite DB you can use EvoMining Zenodo data.

To explore more EvoMining options, please explore EvoMining wiki.

Set the conserved-enzymes database

When using EvoMining, oftenly you will desire to construct your own conserved enzymes database. To know more about how to configure a database, consult the EvoMining wiki in the EvoMining databases part. Natural products database could also be replaced for another set of genes that are “true positives”, for example a set of regulatory genes.

As an example, transform the file cpsg.query into the format of this database. This file contains the aminoacid sequence of the cpsG gene. Firstly, copy this file into what will become the conserved-enzymes database.

$ cp cpsg.query cpsg_cdb

Now, it requires some editing. Open nano editor and change the first line >cpsg to >SYSTEM1|1|phosphomannomutase|Saga . EvoMining conserved-database needs a four-field format pipe-separated that contains; the name of the metabolic system to which the enzyme belongs (SYSTEM1), a consecutive number of the enzyme (1 in this case), the function of the enzyme, and finally, an abbreviation of the organism Saga, (S. Agalactiae).
The reason behind this is that this was the way we needed EvoMining for its first use and we have not changed the headers since.

$ nano cpsg_cdb

>SYSTEM1|1|phosphomannomutase|Saga
MIFVTVGTHEQQFNRLIKEVDRLKGTGAIDQEVFIQTGYSDFEPQNCQWSKFLSYDDMNSYMKEAEIVITHGGPATFMSVISLGKLPVVVPRRKQFGEHINDHQIQFLKKIAHLYPLAWIED

Run your EvoMining docker

$ docker run --rm -i -t -v $(pwd):/var/www/html/EvoMining/exchange -p 80X:80 nselem/evomining:latest /bin/bash   

and inside this new container:

# perl startEvoMining.pl -g GENOMES -r  Corason_Rast.IDs -c cpsg_cdb

Use the website again and think about the results.

Exercise 1. Set EvoMining parameters

Complete the blanks in the following EvoMining run: actinoSMASH A file with the ids of antiSMASH recognized genes. Actinos a directory with RAST-like fasta and annotation files. Histidine-db A fasta file with some proteins in the histidine pathway.
Actinos.ids tabular files with the RAST ids and the name of the organisms.
 # perl starEvoMining.pl -g ____ -c _____ -r _____ -a ___________  
Solution
# perl starEvoMining.pl -g Actinos -c Histidine-db -r Actinos.ids -a actinoSMASH  
Actinos is the genomic database, Histidine-db is the conserved-enzymes database, Actinos.ids is the file that relates Rast ids with the organisms names, and actinoSMASH contains the genes identified by antiSMASH.

Discussion 1. Retro EvoMining in enzyme database

What do you learn from running in a conserved-enzymes database the gene cpsG that is part of a specialized BGC?

Solution

cpsG does not have extra copies in Streptococcus agalactiae, so there are no expansions that may be functional divergent. cpsG single copies in the genomes look red-colored in EvoMining output, as if they belong to the conserved-metabolism. However, this is not the case, the color is because there is only one copy and it is merged into MIBiG true-positives because it was originally a gene in the specialized metabolism. So it is important to know the seed enzymes.

cpsg_cdb_MiBIG_DB.faa_GENOMES

ARTS is another evolutionary genome mining software with its corresponding database ARTS-db .

Evolutionary Genome Mining

To learn more about Genome Mining you can read these references:

The confluence of big data and evolutionary genome mining for the discovery of natural products. Chevrette et al, 2021`

Evolutionary Genome Mining for the Discovery and Engineering of Natural Product Biosynthesis. Chevrette et al, 2022

More about docker

To see the running container use ps
$ docker ps

If there are containers in use you will see a list of all of them.
2f879ba6e337   nselem/evomining:latest   "/bin/bash"   11 hours ago   Up 11 hours   0.0.0.0:8014->80/tcp, :::8014->80/tcp   relaxed_dirac

To stop the running container use docker stop and to remove them use docker rm
$ docker stop relaxed_dirac  
$ docker remove relaxed_dirac  
2f879ba6e337
can be downloaded by specifying our queries with --format genbank.

Key Points

EvoMining is a command-line tool that performs evolutionary genome mining over gene families

EvoMining hits can belong to new BGC

MicroReact is an interactive genomic visualizer compatible with EvoMining output

previous episode

Genome Mining in Prokaryotes

next episode

Evolutionary Genome Mining

Overview

Run evomining image

Set EvoMining genomic database

Visualize your results in MicroReact

Other resources

Set the conserved-enzymes database

Exercise 1. Set EvoMining parameters

Solution

Discussion 1. Retro EvoMining in enzyme database

Solution