Secondary metabolite biosynthetic gene cluster identification
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can I annotate known BGC?
Which kind of analysis antiSMASH can perform?
Which file extension accepts antiSMASH?
Objectives
Understand antiSMASH applications.
Perform a Minimal antiSMASH run analysis.
Explore several Streptococcus genomes by identifying the BGCs presence and the types of secondary metabolites produced.
Introduction
Within microbial genomes, we can find some specific regions that take part in the biosynthesis of secondary metabolites, these sections are known as Biosynthetic Gene Clusters, which are relevant due to the possible applications that they may have, for example: antimicrobials, antitumors, cholesterol-lowering, immunosuppressants, antiprotozoal, antihelminth, antiviral and anti-aging activities.
What is a natural Product?
antiSMASH is a pipeline based on profile hidden Markov models that allow us to identify the gene clusters contained within the genome sequences that encode secondary metabolites of all known broad chemical classes.
antiSMASH input files
antiSMASH pipeline can work with three different file formats GenBank
,
FASTA
and EMBL
. Both GenBank
and EMBL
formats include genome
annotations, while a FASTA
file just comprises the nucleotides of
each genomic contig.
Running antiSMASH
The command line usage of antiSMASH is detailed in the following repositories:
In summary, you will need to use your genome as the input. Then,
antiSMASH will create an output folder for each of your genomes.
Within this folder, you will find a single .gbk
file for each of
the detected Biosynthetic Gene Clusters (we will use these files
for subsequent analyses) and a .html
file, among others.
By opening the .html
file, you can explore the antiSMASH annotations.
You can run antiSMASH in two ways Minimal and Full-featured run, as follows:
Run type | Command |
---|---|
Minimal run | antismash genome.gbk |
Full-featured run | antismash –cb-general –cb-knownclusters –cb-subclusters –asf –pfam2go –smcog-trees genome.gbk |
Exercise 1: antiSMASH scope
If you run an Streptococcus agalactiae annotated genome using antiSMASH, what results do you expect?
a. The substances that a cluster produces
b. A prediction of the metabolites that the clusters can produce
c. A prediction of genes that belong to a biosynthetic cluster
Solution
a. False. antiSMASH is not an algorithm devoted to substance prediction.
b. False. Although antiSMASH does have some information about metabolites produced by similar clusters, this is not its main purpose.
c. True. antiSMASH compares domains and performs a prediction of the genes that belong to biosynthetic gene clusters.
Run your own antiSMASH analysis
First, activate the GenomeMining conda environment:
$ conda deactivate
$ conda activate /miniconda3/envs/GenomeMining_Global
Second, run the antiSMASH command shown earlier in this lesson
on the data .gbk
or .fasta
files. The command can be executed
either in one single file, all the files contained within a folder or
in a specific list of files. Here we show how it can be performed in
these three different cases:
Running antiSMASH in a single file
Choose the annotated file ´agalactiae_A909_prokka.gbk´
$ mkdir -p ~/pan_workshop/results/antismash
$ cd ~/pan_workshop/results/antismash
$ antismash --genefinding-tool=none ~/pan_workshop/results/annotated/Streptococcus_agalactiae_A909_prokka/Streptococcus_agalactiae_A909_prokka.gbk
To see the antiSMASH generated outcomes do the following:
$ tree -L 1 ~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka
.
├── CP000114.1.region001.gbk
├── CP000114.1.region002.gbk
├── css
├── images
├── index.html
├── js
├── regions.js
├── Streptococcus_agalactiae_A909.gbk
├── Streptococcus_agalactiae_A909.json
├── Streptococcus_agalactiae_A909.zip
└── svg
Running antiSMASH over a list of files
Now, imagine that you want to run antiSMASH over all
Streptococcus agalactiae annotated genomes. Use
a for
cycle and *
as a wildcard to run antiSMASH
over all files that start with “S” in the annotated directory.
$ for gbk_file in ~/pan_workshop/results/annotated/*/S*.gbk
> do
> antismash --genefinding-tool=none $gbk_file
> done
Visualizing antiSMASH results
In order to see the results after an antiSMASH run, we need to access to
the index.html
file. Where is this file located?
$ cd Streptococcus_agalactiae_A909_prokka
$ pwd
~/pan_workshop/results/antismash/Streptococcus_agalactiae_A909_prokka
As outcomes you should get a folder comprised mainly by the following files:
.gbk
files For each Biosynthetic Gene cluster region found..json
file To know the input file name, the antiSMASH used version and the regions data (id,sequence_data).index.html
file To visualize the outcomes from the analysis.
In order to access these results, we can use scp protocol to download the directory in your local computer.
If using scp
, on your local machine, open a GIT bash terminal in the
destiny folder and execute the following command:
$ scp -r user@bioinformatica.matmor.unam.mx:~/pan_workshop/results/antismash/S*A909.prokka/* ~/Downloads/.
If using R-studio then in the left panel, choose the “more” option, and “export” your file to your local computer. Decompress the Streptococcus_agalactiae_A909.prokka.zip file.
Another way to download the data to your computer, you first compress the folder and download the compressed file from JupyterHub
$ cd ~/pan_workshop/results/antismash
$ zip -r Streptococcus_agalactiae_A909_prokka.zip Streptococcus_agalactiae_A909_prokka
$ ls
In the JupyterHub, navigate to the pan_workshop/results/antismash/ folder, select the file we just created, and press the download button at the top
Once in your local machine, in the directory, Streptococcus_agalactiae_A909.prokka,
open the index.html
file on your local web browser.
Understanding the output
The visualization of the results includes many different options. To understand all the possibilities, we suggest reading the following tutorial:
Briefly, in the “Overview” page ´.HTML´ you can find all the regions found within every analyzed record/contig (antiSMASH inputs), and summarized all this information in five main features:
- Region: The region number.
- Type: Type of the product detected by antiSMASH.
- From,To: The region’s location (nucleotides).
- Most similar known cluster: The closest compound from th MIBiG database.
- Similarity: Percentage of genes within the closest known compound that have significant BLAST hit (The last two columns containing comparisons to the MIBiG database will only be shown if antiSMASH was run with the KnownClusterBlast option ´–cc-mibig´).
antiSMASH web services
antiSMASH can also be used in an online server in the antiSMASH website: You will be asked to give your email. Then, the results will be sent to you and you will be allowed to donwload a folder with the annotations.
Exercises and discussion
Exercise 2: NCBI and antiSMASH webserver
Run antiSMASH web server over S. agalactiae A909. First, explore the NCBI assembly database to obtain the accession. Get the id of your results.
Solution
- Go to NCBI and search S. agalactiae A909.
- Choose the assembly database and copy the GenBank sequence ID at the bottom of the site.
- Go to antiSMASH
- Choose the accession from NCBI and paste
CP000114.1
- Run antiSMASH and paste into the collaborative document your results id example ` bacteria-cbd13a47-8095-495f-957c-dcf58c261529`
Exercise 3: (*Optional exercise ) Run antiSMASH over thermophilus
Try Download and annotate S. thermophilus genomes strains LMD-9 (MiBiG), and strain CIRM-BIA 65 reference genome. With the following information, generate the script to run antiSMASH only on the S. thermophilus annotated genomes.
done antismash --genefinding-tool=none $__ for ___ in ____________ do
Solution
Guide to solve this problem First download genomes with genome-download. Remember to activate the conda environment!
Download the annotated genomes, and then the genbank files.
ncbi-genome-download --formats genbank --genera "Streptococcus thermophilus" -S "LMD-9" -o thermophilus bacteria
ncbi-genome-download --formats fasta --genera "Streptococcus thermophilus" -S "CIRM-BIA 65" -n bacteria
Then, move the genomes to the directory~/pan_workshop/results/annotated/S*.gbk
First, we need to start the for cycle:
for mygenome in ~/pan_workshop/results/annotated/S*.gbk
note that we are using the namemygenome
as the variable name in the for cycle.Then you need to use the reserved word
do
to start the cycle.Then you have to call antiSMASH over your variable
mygenome
. Remember the$
before your variable to indicate to bash that now you are using the value of the variable.Finally, use the reserved word
done
to finish the cycle.
~~~ for variable in ~/pan_workshop/results/annotated/t*.gbk
do
antismash –genefinding-tool=none $mygenome
done
~~~ {: .language-bash}
Exercise 4: The size of a region
Sort the structure of the next commands that attempt to know the size of a region:
SOURCE
ORGANISM
LOCUS
head
grep
$ _____ region.gbk $ _____ __________ region.gbk
Apply your solution to get the size of the first region of S. agalactiae A909
Solution
$ cd ~/pan_workshop/results/antismash $ head Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk $ grep LOCUS Streptococcus_agalactiae_A909.prokka/CP000114.1.region001.gbk
Locus contains the information of the characteristics of the sequence
LOCUS CP000114.1 42196 bp DNA linear UNK 13-JUN-2022
In this case the size of the region is 42196 bp
Discussion 1: BGC classes in Streptococcus
What BGC classes does the genus Streptococcus have ?
Solution
In antiSMASH website output of A909 strain we see clusters of two classes, T3PKS and arylpolyene. Nevertheless, this is only one Streptococcus in order to infer all BGC classes in the genus we need to run antiSMASH in more genomes.
Key Points
antiSMASH is a bioinformatic tool capable of identifying, annotating and analysing secondary metabolite BGC
antiSMASH can be used as a web-based tool or as stand-alone command-line tool
The file extensions accepted by antiSMASH are GenBank, FASTA and EMBL