Finding Variation on Genomic Vicinities

Overview

Teaching: 20 min
Exercises: 15 min

Questions

How can I follow variation in genomic vicinities given a reference BGC?

Which gene families are the conserved part of a BGC family?

Objectives

Learn to find BGC-families

Visualize variation in a genomic vicinity

CORASON is a visual tool that identifies gene clusters that share a common genomic core and reconstructs multi-locus phylogenies of these gene clusters to explore their evolutionary relationships. CORASON was developed to find and prioritize biosynthetic gene clusters (BGC), but can be used for any kind of clusters.

Input: query gene, reference BGC and genome database. Output: SVG figure with BGC in the family sorted according to the multi-locus phylogeny of the core genes.

Advantages

SVG graphs Scalable graphs that allow metadata easy display.
Interactive, CORASON is not a static database, it allows to explore your own genomes.
Reproducibility, since CORASON runs on docker and conda, the containerization allows to always perform the same analysis even if you change your Linux/perl/blast/muscle/Gblocks/quicktree local distributions.

CORASON conda

Here we are testing the new stand-alone corason in a conda environment with gbk as input-files. Firstly, activate the corason-conda environment.

$ conda deactivate 
$ conda activate /miniconda3/envs/corason  

(corason) user@server:~$

With the environment activated, all CORASON-dependencies are ready to be used. The next step is to clone CORASON-software from its GitHub repository. Firstly, place yourself at the results directory and then clone CORASON-code.

$ mkdir -p ~/pan_workshop/results/genome-mining 
$ cd ~/pan_workshop/results/genome-mining
$ git clone https://github.com/miguel-mx/corason-conda.git 
$ ls

The GitHub CORASON repository must be now in your directory.

corason-conda

Change directory by descending to EXAMPLE2. The file cpsg.query, contains the reference protein cpsG, whose encoding gene is part of the polysaccharide BGC produced by some S. agalactiae

$ cd corason-conda/EXAMPLE2        
$ ls  

cpsg.query

As genomic database we will use the prokka-annotated gbk files of S. agalactiae. This database will be stored in the reserved directory CORASON_GENOMES.

$ mkdir  CORASON_GENOMES  
$ cp ~/pan_workshop/results/annotated/Streptococcus_agalactiae_*/*gbk CORASON_GENOMES/  

CORASON was written to be used with RAST annotation as input files, in this case we are using a genome database composed of .gbk files. Thus, we need to convert gbk files into CORASON-compatible input files.

$ ../CORASON/gbkIndex.pl CORASON_GENOMES ../CORASON CORASON_GENOMES

Directory CORASON_GENOMES                                                                                                
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_18RS21_prokka.gbk 100001 ../CORASON                                 
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_515_prokka.gbk 100002 ../CORASON                                   
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_A909_prokka.gbk 100003 ../CORASON                             
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_CJB111_prokka.gbk 100004 ../CORASON                         
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_COH1_prokka.gbk 100005 ../CORASON                           
../CORASON/gbk_to_fasta.pl CORASON_GENOMES agalactiae_H36B_prokka.gbk 100006 ../CORASON                            
../CORASON/gbk_to_fasta.pl CORASON_GENOMES thermophilus_LMD-9_prokka.gbk 100007 ../CORASON                          
../CORASON/gbk_to_fasta.pl CORASON_GENOMES thermophilus_LMG_18311_prokka.gbk 100008 ../CORASON   

Now, all converted genome files, aminoacid fasta files (.faa) and annotation files (.txt) should be placed in the ‘GENOMES directory’ one level up outside output.

$ mv output/* . 
$ ls

CORASON_GENOMES  Corason_Rast.IDs  cpsg.query  GENOMES  output

Finally, we have the query enzyme, the IDs file and a genomic database of S. agalactiae in the same directory. Run CORASON with cpsg.query as query with 1000006 as an example of reference BGC.

$ ../CORASON/corason.pl -q cpsg.query -s 100006  -rast_ids Corason_Rast.IDs

>> 100001_666.input  Cluster1237  Cluster1245  Cluster1253       cpsg.query_BGC.tre            cpsg.query_tree.svg        
>> 100002_2002.input Cluster1238  Cluster1246  Cluster1254       cpsg.query.BLAST              Frequency                  
>> 100003_1173.input Cluster1239  Cluster1247  Cluster1255       cpsg.query.BLAST.pre          GBK                        
>> 100004_35.input   Cluster1240  Cluster1248  Cluster1256       cpsg.query.parser             Joined.svg                 
>> 100005_912.input  Cluster1241  Cluster1249  Cluster1257       cpsg.query.parser.pre         MINI                       
>> 100006_1247.input Cluster1242  Cluster1250  Concatenados.faa  cpsg.query_PrincipalHits.tre  PrincipalHits.muscle       
>> 100007_1022.input Cluster1243  Cluster1251  CORASON           cpsg.query_Report             PrincipalHits.muscle-gb    
>> 100008_1024.input Cluster1244  Cluster1252  cpsg.query        cpsg.query.svg                TempConcatenados.faa 

Finally, we have all the genomic vicinities sorted phylogenetically according to the genes in the core-cluster. We can either see the output directly on the server or download the resulting svg file to our local computer. The format svg stands for scalable vector graphics, we are save it as an html file in order to get the option of open it as html. To see it in the server we must first copy the Joined.svg into Joined.html, then, go to the EXAMPLE2 directory in the r-studio files panel located at the bottom right. Once there clik in the Joined.html file and chose the option View in Web Browser. The corason figure will output in a new tab.

$ cp output/cpsg.query-output/Joined.svg  Joined.html

You can also use scp in a local terminal to download the resulting svg file to your local computer.

$ scp (remoto)/corason-conda/EXAMPLE2/output/cpsg.query-output  Downloads/.

CORASON has several installation and a conda beta versions

CORASON has one docker installation where you have to go inside the docker container, its input are RAST files. The script run_corason belongs to another installation that accepts gbk as input files. And finally, this is a beta version of CORASON conda container.

Key Points

CORASON is a command-line tool that finds BGC-families

Genomic vicinity variation is organized phylogenetically according to the conserved genes in the BGC-family

previous episode

Genome Mining in Prokaryotes

next episode

Finding Variation on Genomic Vicinities

Overview

CORASON conda

CORASON has several installation and a conda beta versions

Key Points

previous episode

next episode