This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Clustering with BLAST Results

Overview

Teaching: 30 min
Exercises: 5 min
Questions
  • How can we use the blast results to form families?

Objectives
  • Use a clustering algorithm to form families using the E-value.

Using E-values to cluster sequences into families

In the previous episode, we obtained the E-value between each pair of sequences of our mini dataset. Even though it is not strictly an identity measure between the sequences, the E-value allows us to know from a pool of sequences which one is the best match to a query sequence. We will now use this information to cluster the sequences from families using an algorithm written in Python, and we will see how it joins the sequences progressively until each of them is part of a family.

Processing the BLAST results

For this section, we will use Python. Let’s open the notebook and start by importing the libraries that we will need.

import os 
import pandas as pd
from matplotlib import cm
import numpy as np

First, we need to read the mini-genomes.blast file that we produced. Let’s import the BLAST results to Python using the column names: qseqid,sseqid, evalue.

os.getcwd()
blastE = pd.read_csv( '~/pan_workshop/results/blast/mini/output_blast/mini-genomes.blast', sep = '\t',names = ['qseqid','sseqid','evalue'])
blastE.head()
  qseqid	               sseqid	               evalue
0	2603V|GBPINHCM_01420	NEM316|AOGPFIKH_01528	4.110000e-67
1	2603V|GBPINHCM_01420	A909|MGIDGNCP_01408	4.110000e-67
2	2603V|GBPINHCM_01420	515|LHMFJANI_01310	4.110000e-67
3	2603V|GBPINHCM_01420	2603V|GBPINHCM_01420	4.110000e-67
4	2603V|GBPINHCM_01420	A909|MGIDGNCP_01082	1.600000e+00

Now we want to make two columns that have the name of the genomes of the queries, and the name of the genomes of the subjects. We will take this information from the query and subject IDs (the label that we added at the beginning of the episode).

First, let’s obtain the genome of each query gene.

qseqid = pd.DataFrame(blastE,columns=['qseqid'])

newqseqid = qseqid["qseqid"].str.split("|", n = 1, expand = True)
newqseqid.columns= ["Genome1", "Gen"]
newqseqid["qseqid"]= qseqid
dfqseqid =newqseqid[['Genome1','qseqid']]

dfqseqid.head()
  Genome1	qseqid
0	2603V	2603V|GBPINHCM_01420
1	2603V	2603V|GBPINHCM_01420
2	2603V	2603V|GBPINHCM_01420
3	2603V	2603V|GBPINHCM_01420
4	2603V	2603V|GBPINHCM_01420

Now let’s repeat the same for the sseqid column.

sseqid = pd.DataFrame(blastE,columns=['sseqid'])

newsseqid = sseqid["sseqid"].str.split("|", n = 1, expand = True)
newsseqid.columns= ["Genome2", "Gen"]
newsseqid["sseqid"]= sseqid 
dfsseqid = newsseqid[['Genome2','sseqid']]

Now that we have two dataframes with the new columns that we wanted, let’s combine them with the evalue of the blastE dataframe into a new one called df.

evalue = pd.DataFrame(blastE, columns=['evalue'])
df = dfqseqid
df['Genome2']=dfsseqid['Genome2']
df['sseqid']=sseqid
df['evalue']=evalue
df.head()
  Genome1	qseqid	Genome2	sseqid	evalue
0	2603V	2603V|GBPINHCM_01420	NEM316	NEM316|AOGPFIKH_01528	4.110000e-67
1	2603V	2603V|GBPINHCM_01420	A909	A909|MGIDGNCP_01408	4.110000e-67
2	2603V	2603V|GBPINHCM_01420	515	515|LHMFJANI_01310	4.110000e-67
3	2603V	2603V|GBPINHCM_01420	2603V	2603V|GBPINHCM_01420	4.110000e-67
4	2603V	2603V|GBPINHCM_01420	A909	A909|MGIDGNCP_01082	1.600000e+00

Now we want a list of the unique genes in our dataset.

qseqid_unique=pd.unique(df['qseqid'])
sseqid_unique=pd.unique(df['sseqid'])
genes = pd.unique(np.append(qseqid_unique, sseqid_unique))

We can check that we have 43 genes in total with len(genes).

Now, we want to know which one is the biggest genome (the one with more genes) to make the comparisons.

First, we compute the unique genomes.

genomes=pd.unique(df['Genome1'])
genomes=list(genomes)
genomes
['2603V', '515', 'A909', 'NEM316']

Now, we will create a dictionary that shows which genes are in each genome.

dic_gen_genomes={}
for a in genomes:
    temp=[]
    for i in range(len(genes)):
        if a in genes[i]:
            gen=genes[i]
            temp.append(gen)
    dic_gen_genomes[a]=temp

We can now use this dictionary to know how many genes each genome has and therefore identify the biggest genome.

genome_temp=[]
size_genome=[]
for i in dic_gen_genomes.keys():
    size=len(dic_gen_genomes[i])
    genome_temp.append(i)
    size_genome.append(size)

genomes_sizes = pd.DataFrame(genome_temp, columns=['Genome'])
genomes_sizes['Size']=size_genome

genome_sizes_df = genomes_sizes.sort_values('Size', ascending=False)
genome_sizes_df
Genome	Size
2	A909	12
0	2603V	11
1	515	10
3	NEM316	10

Now we can sort our genomes by their size.

genomes=genome_sizes_df['Genome'].tolist()
genomes
['A909', '2603V', '515', 'NEM316']

So the biggest genome is A909 and we will start our clustering algorithm with it.

Finding gene families with the BBH algorithm

To make a gene family, we first need to identify the most similar genes between genomes. The Bidirectional best-hit algorithm will allow us to find the pairs of genes that are the most similar (lowest e-value) to each other in each pair of genomes.

 Bidirectional best-hit algorithm

For this, we will define a function to find in each genome the gene that is most similar to each gene in our biggest genome A909.

Clustering algorithms

Can the BBH algorithm make gene families that have more than one gene from the same genome?

Solution

No. Since BBH finds the best hit of a query in each of the other genomes it will only give one hit per genome. This will force to have a different gene family for each duplicate of a gene (paralog). This also means that you are getting the best hit, which is not necessarily a “good” hit. To define what a good hit is we would need to use an algorithm that considers a similarity threshold.

def besthit(gen,genome,data):
    # gen: a fixed gen in the list of unique genes
    # genome: the genome in which we will look the best hit
    # df: the data frame with the evalues
    filter_cd=(data['qseqid']==gen) & (data['Genome2']==genome) & (data['Genome1']!=genome)
    if (len(data[filter_cd]) == 0 ):
        gen_besthit = "NA"
    else:
        gen_besthit = data.loc[filter_cd,'sseqid'].at[data.loc[filter_cd,'evalue'].idxmin()]
   
    return(gen_besthit)

Now we will define a second function, that uses the previous one, to obtain the bidirectional best hits.

def besthit_bbh(gengenome,listgenomes,genome,data):
    # gengenome: a list with all the genes of the biggest genome.
    # listgenomes: the list with all the genomes in order.
    # genome: the genome to which the genes in `gengenome` belongs.
    # data: the data frame with the evalues.
    
    dic_besthits = {}
    for a in gengenome:
        temp=[]
        for b in listgenomes:
            temp2=besthit(a,b,data)
            temp3=besthit(temp2,genome,data)
            if temp3 == a:
                temp.append(temp2)
            else:
                temp.append('NA')
        dic_besthits[a]=temp
        
    return(dic_besthits)

In one of the previous steps, we created a dictionary with all the genes present in each genome. Since we know that the biggest genome is A909, we will obtain the genes belonging to A909 and gather them in a list.

genome_A909 = dic_gen_genomes['A909']

Now, we will apply the function besthit_bbh to the previous list, genomes, and the genome A909 that is genomes[0].

g_A909_bbh=besthit_bbh(genome_A909,genomes,genomes[0],df)

In g_A909_bbh we have a dictionary that has one gene family for each gene in A909. Let’s convert it to a dataframe and have a better look at it.

family_A909=pd.DataFrame(g_A909_bbh).transpose()
family_A909.columns = ['g_A909','g_2603V','g_515','g_NEM316']
family_A909.g_A909 = family_A909.index
family_A909
	                g_A909	                g_2603V	                g_515	                g_NEM316
A909|MGIDGNCP_01408	A909|MGIDGNCP_01408	2603V|GBPINHCM_01420	515|LHMFJANI_01310	NEM316|AOGPFIKH_01528
A909|MGIDGNCP_00096	A909|MGIDGNCP_00096	2603V|GBPINHCM_00097	515|LHMFJANI_00097	NEM316|AOGPFIKH_00098
A909|MGIDGNCP_01343	A909|MGIDGNCP_01343	NA                  	NA	                NEM316|AOGPFIKH_01415
A909|MGIDGNCP_01221	A909|MGIDGNCP_01221	NA	                515|LHMFJANI_01130	NA
A909|MGIDGNCP_01268	A909|MGIDGNCP_01268	2603V|GBPINHCM_01231	515|LHMFJANI_01178	NEM316|AOGPFIKH_01341
A909|MGIDGNCP_00580	A909|MGIDGNCP_00580	2603V|GBPINHCM_00554	515|LHMFJANI_00548	NEM316|AOGPFIKH_00621
A909|MGIDGNCP_00352	A909|MGIDGNCP_00352	2603V|GBPINHCM_00348	515|LHMFJANI_00342	NEM316|AOGPFIKH_00350
A909|MGIDGNCP_00064	A909|MGIDGNCP_00064	2603V|GBPINHCM_00065	515|LHMFJANI_00064	NEM316|AOGPFIKH_00065
A909|MGIDGNCP_00627	A909|MGIDGNCP_00627	NA	                NA	                NA
A909|MGIDGNCP_01082	A909|MGIDGNCP_01082	2603V|GBPINHCM_01042	NA	                NA
A909|MGIDGNCP_00877	A909|MGIDGNCP_00877	2603V|GBPINHCM_00815	515|LHMFJANI_00781	NEM316|AOGPFIKH_00855
A909|MGIDGNCP_00405	A909|MGIDGNCP_00405	2603V|GBPINHCM_00401	515|LHMFJANI_00394	NEM316|AOGPFIKH_00403

Here, we have all the families that contain one gene from the biggest genome. The following step is to repeat this for the second-biggest genome. To do this, we need to remove from the list genes the genes that are already placed in the current families.

list_g=[]
for elemt in g_A909_bbh.keys():
    list_g.append(elemt)
    for g_hit in g_A909_bbh[elemt]:
        list_g.append(g_hit)
genes2=genes
genes2=genes2.tolist()
genesremove=pd.unique(list_g).tolist()
genesremove.remove('NA')
for b_hits in genesremove:
    genes2.remove(b_hits)

genes2
['2603V|GBPINHCM_00748', '2603V|GBPINHCM_01226', '515|LHMFJANI_01625', 'NEM316|AOGPFIKH_01842']

For this 4 genes we will repeat the algorithm. First, we create the list with the genes that belongs to the second biggest genome 2603V.

genome_2603V=[]
for i in range(len(genes2)):
    if "2603V" in genes2[i]:
        gen = genes2[i]
        genome_2603V.append(gen)
        
genome_2603V
['2603V|GBPINHCM_00748', '2603V|GBPINHCM_01226']

We apply the function besthit_bbh to this list.

g_2603V_bbh=besthit_bbh(genome_2603V,genomes,genomes[1],df)

We convert the dictionary into a dataframe.

family_2603V=pd.DataFrame(g_2603V_bbh).transpose()
family_2603V.columns = ['g_A909','g_2603V','g_515','g_NEM316']
family_2603V.g_2603V = family_2603V.index
family_2603V.head()
                        g_A909	g_2603V	               g_515	  g_NEM316
2603V|GBPINHCM_00748	NA	2603V|GBPINHCM_00748	NA	   NA
2603V|GBPINHCM_01226	NA	2603V|GBPINHCM_01226	NA	   NA

Again, let’s eliminate the genes that are already placed in families to repeat the algorithm.

for a in genome_2603V:
    genes2.remove(a)

genes2
['515|LHMFJANI_01625', 'NEM316|AOGPFIKH_01842']
genome_515=[]
for i in range(len(genes2)):
    if "515" in genes2[i]:
        gen = genes2[i]
        genome_515.append(gen)
        
genome_515
['515|LHMFJANI_01625']
g_515_bbh=besthit_bbh(genome_515,genomes,genomes[2],df)
family_515=pd.DataFrame(g_515_bbh).transpose()
family_515.columns = ['g_A909','g_2603V','g_515','g_NEM316']
family_515.g_515 = family_515.index
family_515
                    g_A909  g_2603V g_515               g_NEM316
515|LHMFJANI_01625  NA      NA      515|LHMFJANI_01625  NEM316|AOGPFIKH_01842

Since in this last step we used all the genes, we have finished our algorithm.

Now we will only create a final dataframe to integrate all of the obtained families.

families_bbh=pd.concat([family_A909,family_2603V,family_515])
families_bbh.to_csv('families_bbh.csv')
families_bbh
	                 g_A909	           g_2603V	             g_515	             g_NEM316
A909|MGIDGNCP_01408	A909|MGIDGNCP_01408	2603V|GBPINHCM_01420	515|LHMFJANI_01310	NEM316|AOGPFIKH_01528
A909|MGIDGNCP_00096	A909|MGIDGNCP_00096	2603V|GBPINHCM_00097	515|LHMFJANI_00097	NEM316|AOGPFIKH_00098
A909|MGIDGNCP_01343	A909|MGIDGNCP_01343	NA	                NA	                  NEM316|AOGPFIKH_01415
A909|MGIDGNCP_01221	A909|MGIDGNCP_01221	NA	                515|LHMFJANI_01130	  NA
A909|MGIDGNCP_01268	A909|MGIDGNCP_01268	2603V|GBPINHCM_01231	515|LHMFJANI_01178	NEM316|AOGPFIKH_01341
A909|MGIDGNCP_00580	A909|MGIDGNCP_00580	2603V|GBPINHCM_00554	515|LHMFJANI_00548	NEM316|AOGPFIKH_00621
A909|MGIDGNCP_00352	A909|MGIDGNCP_00352	2603V|GBPINHCM_00348	515|LHMFJANI_00342	NEM316|AOGPFIKH_00350
A909|MGIDGNCP_00064	A909|MGIDGNCP_00064	2603V|GBPINHCM_00065	515|LHMFJANI_00064	NEM316|AOGPFIKH_00065
A909|MGIDGNCP_00627	A909|MGIDGNCP_00627	NA	                NA	                    NA
A909|MGIDGNCP_01082	A909|MGIDGNCP_01082	2603V|GBPINHCM_01042	NA	                NA
A909|MGIDGNCP_00877	A909|MGIDGNCP_00877	2603V|GBPINHCM_00815	515|LHMFJANI_00781	NEM316|AOGPFIKH_00855
A909|MGIDGNCP_00405	A909|MGIDGNCP_00405	2603V|GBPINHCM_00401	515|LHMFJANI_00394	NEM316|AOGPFIKH_00403
2603V|GBPINHCM_00748	NA	             2603V|GBPINHCM_00748	  NA	                NA
2603V|GBPINHCM_01226	NA	             2603V|GBPINHCM_01226	  NA	                NA
515|LHMFJANI_01625	NA	                NA	                 515|LHMFJANI_01625	  NEM316|AOGPFIKH_01842

Here we have our complete pangenome! In the first column, we have the gene family names, and then one column per genome with the genes that belong to each family.

Finally, we will export to a csv file.

families_bbh.to_csv('~/pan_workshop/results/blast/mini/families_minis.csv')

Explore functional annotation of gene families

Now that we have our genes grouped together in gene families and since we have the functional annotation of each gene, we can check if the obtained families coincide with the functional annotations. For this, we will go back to our Terminal to use Bash.

The unique functional annotation that our mini genomes have are the following.

$ cat mini-genomes.faa | grep '>' | cut -d' ' -f2- | sort | uniq
30S ribosomal protein S16
50S ribosomal protein L16
bifunctional DNA primase/polymerase
Glutamate 5-kinase 1
Glycine betaine transporter OpuD
glycosyltransferase
Glycosyltransferase GlyG
peptidase U32 family protein
Periplasmic murein peptide-binding protein
PII-type proteinase
Putative N-acetylmannosamine-6-phosphate 2-epimerase
Replication protein RepB
Ribosome hibernation promotion factor
UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase
Vitamin B12 import ATP-binding protein BtuD

Let’s use these functional annotation names to obtain the gene names in the mini-genomes.faa file and the gene family names from the families_minis.csv table. With all the information together let’s create a new table that describes our pangenome.

$ echo Function$'\t'Gene$'\t'Family > mini_pangenome.tsv
$ cat mini-genomes.faa | grep '>' | cut -d' ' -f2- | sort | uniq | while read function
do 
grep "$function" mini-genomes.faa | cut -d' ' -f1 | cut -d'>' -f2 | while read line
do 
family=$(grep $line families_minis.csv| cut -d',' -f1)
echo $function$'\t'$line$'\t'$family
done
done >> mini_pangenome.tsv
$ head mini_pangenome.tsv
Function                  Gene                  Family
30S ribosomal protein S16 2603V|GBPINHCM_01420  A909|MGIDGNCP_01408
30S ribosomal protein S16 515|LHMFJANI_01310    A909|MGIDGNCP_01408
30S ribosomal protein S16 A909|MGIDGNCP_01408   A909|MGIDGNCP_01408
30S ribosomal protein S16 NEM316|AOGPFIKH_01528 A909|MGIDGNCP_01408
50S ribosomal protein L16 2603V|GBPINHCM_00097  A909|MGIDGNCP_00096
50S ribosomal protein L16 515|LHMFJANI_00097    A909|MGIDGNCP_00096
50S ribosomal protein L16 A909|MGIDGNCP_00096   A909|MGIDGNCP_00096
50S ribosomal protein L16 NEM316|AOGPFIKH_00098 A909|MGIDGNCP_00096

Exercise 1(Begginer): Partitioning the pangenome

Since we have a very small pangenome we can know the partitions of our pangenome just by looking at a small table. Look at the mini_pangenom.tsv table and decide which families correspond to the Core, Shell and Cloud genomes.

Note: You might want to download the file to your computer and open it in a spreadsheet program to read it easily.

Solution

Functional annotation of family No. Genomes Partition
30S ribosomal protein S16 4 Core
50S ribosomal protein L16 4 Core
Glutamate 5-kinase 1 4 Core
glycosyltransferase 4 Core
peptidase U32 family protein 4 Core
Putative N-acetylmannosamine-6-phosphate 2-epimerase 4 Core
Ribosome hibernation promotion factor 4 Core
UDP-N-acetylglucosamine–N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase 4 Core
Glycine betaine transporter OpuD 2 Shell
Periplasmic murein peptide-binding protein 2 Shell
Replication protein RepB 2 Shell
Vitamin B12 import ATP-binding protein BtuD 2 Shell
bifunctional DNA primase/polymerase 1 Cloud
Glycosyltransferase GlyG 1 Cloud
PII-type proteinase 1 Cloud

Key Points

  • The Bidirectional Best-Hit algorithm groups sequences together into families according to the E-value.