This lesson is still being designed and assembled (Pre-Alpha version)

Retrieve an Initial Set of Sequences and Cluster


Teaching: 0 min
Exercises: 0 min
  • Key question (FIXME)

  • First learning objective. (FIXME)

Similarity Threshold

Users can choose a different similarity threshold if desired. For example, you may wish to work at 99% identity for small clades. In this case, replace 0.97 with 0.99 throughout.

Your Clade

In all commands, you must substitute your clade name for “NAME”.

Retrieve initial sequences

We offer three ways of doing this - using GenBank, PR2 or SILVA. We recommend using SILVA or PR2 database. Use the grep command to pull out your sequences of interest. Note that clade names often differ in these different databases.

grep –A 1 -i “NAME” silva_reference_database.fasta > NAME.fasta

This will target targeted cultured isolates or isolates that were morphologically identified After you have downloaded your sequences, clean fasta headers to contain the GenBank accession number only.

sed 's/gi\|[0-9]*\|gb\|//' NAMEgb.fasta | sed 's/\..*//' > NAME.fasta

(: .language-bash)

Cluster your sequences

Clustering will allow you to produce a manageable number of sequences using usearch (two commands). Similarity threshold is set at 97% here but can be adjusted by changing the “-id” command. These commands choose the longest sequence as the representative sequence for each cluster. We will use also this step to remove sequences shorter than 500 bp. You will use these representative sequences for your initial alignment/tree.

usearch -sortbylength NAME.fasta -fastaout NAME.sorted.fasta -minseqlength 500 -notrunclabels
usearch -cluster_smallmem NAME.sorted.fasta -id 0.97 -centroids NAME.clustered.fasta -uc NAME.cluste

Key Points

  • First key point. Brief Answer to questions. (FIXME)