Getting Started
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
(NOTE: Cover how to organize directory at start of project.)
Choose group to curate
Your group should be of a manageable size in terms of taxonomic breadth and total available sequences (e.g. a subset of Rhizaria or diatoms rather than the entire clade). It will be more useful for you and EukRef in general to finish curation of a smaller group. You can expand later.
(NOTE: Provide test files so instructors / maintainers can run through and test lesson.)
Step 1 - Generate a curated starting sequence set, alignment, and tree for your group This initial sequence set will form the basis of your entire curation effort. Therefore it is critical that these sequences are well-curated, and you need an alignment of these sequences and a reliable phylogenetic tree built from them. There are multiple ways to get this initial sequence set, but in each case, you should ensure it meets the criteria below. Additionally, you need a file that contains diverse outgroups: 5-10 sequences from all of the neighboring clades (e.g., if you are working on ciliates you should include apicomplexa, dinoflagellates, and stramenopiles). This alignment MUST be untrimmed (e.g. contain all gaps) so that additional sequences can be added. The initial tree must be checked to make sure it covers the diversity of the group and reflects phylogenetic relationships expected based on current research/literature. Any errant sequences need to be removed from the alignment.
Criteria for the initial set of sequences, alignment, and tree:
- The initial set of sequences should be small, ~200 sequences or fewer, so that you can easily check the resulting alignment and tree.
- Sequences should cover the breadth of the clade, including divergent groups.
- The alignment and tree should have outgroups included to detect sequences that branch outside of your clade.
- The alignment should be untrimmed
- The tree produced from this alignment should display expected relationships and problematic sequences should be removed.
Three possible ways to get your initial alignment:
- You may already have a curated starting alignment and tree from previous work. In this case, you are ready to go after checking your tree and alignment. Proceed to Step 5.
- Download sequences from GenBank based on taxonomy. This will target the sequences from cultured isolates or those that were identified in other ways (e.g. morphologically identified and picked). Proceed to Step 2.
- Alternatively, you may wish to start with the SILVA or PR2 set of sequences for your group. You must ensure that these sequences meet the criteria above. Proceed to Step 2.
Key Points
First key point. Brief Answer to questions. (FIXME)
Retrieve an Initial Set of Sequences and Cluster
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Similarity Threshold
Users can choose a different similarity threshold if desired. For example, you may wish to work at 99% identity for small clades. In this case, replace 0.97 with 0.99 throughout.
Your Clade
In all commands, you must substitute your clade name for “NAME”.
Retrieve initial sequences
We offer three ways of doing this - using GenBank, PR2 or SILVA. We recommend using SILVA or PR2 database. Use the grep command to pull out your sequences of interest. Note that clade names often differ in these different databases.
grep –A 1 -i “NAME” silva_reference_database.fasta > NAME.fasta
This will target targeted cultured isolates or isolates that were morphologically identified After you have downloaded your sequences, clean fasta headers to contain the GenBank accession number only.
sed 's/gi\|[0-9]*\|gb\|//' NAMEgb.fasta | sed 's/\..*//' > NAME.fasta
(: .language-bash)
Cluster your sequences
Clustering will allow you to produce a manageable number of sequences using usearch (two commands). Similarity threshold is set at 97% here but can be adjusted by changing the “-id” command. These commands choose the longest sequence as the representative sequence for each cluster. We will use also this step to remove sequences shorter than 500 bp. You will use these representative sequences for your initial alignment/tree.
usearch -sortbylength NAME.fasta -fastaout NAME.sorted.fasta -minseqlength 500 -notrunclabels
usearch -cluster_smallmem NAME.sorted.fasta -id 0.97 -centroids NAME.clustered.fasta -uc NAME.cluste
Key Points
First key point. Brief Answer to questions. (FIXME)
Building Initial Alignment
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Assemble a set of close outgroup sequences
Outgroup sequences must be full-length. You should include outgroups from all of the clades that are neighbors to your group of interest, 5-10 sequences per group. These should be chosen based on your knowledge (ask for help if needed). You can get outgroup sequences from GenBank, SILVA or PR2. See step 2 for details. Add your outgroup sequences to clustered starting sequences (NAME.clustered.fasta file)
Align using MAFFT.
Use the default MAFFT algorithm (command below) unless there is a specific reason to use a different algorithm or approach. (Experience users can use different preferred alignment program)
mafft --reorder --auto NAME.clustered.fasta > NAME_aligned.fasta
Alternative: Align using INFERNAL
cmalign --outformat AFA -o NAME_aligned_infernal.fasta SSUref.cmm NAME.clustered.fasta sed 's/\./-/g' NAME_aligned_infernal.fasta > NAME_aligned.fasta
Alternative: Align using SINA*
./sina -i NAME.clustered.fasta -o NAME_aligned.fasta --outtype fasta --ptdb REFERENCE_ALN.ARB
If you choose to use SINA alignment, that you have to provide a reference alignment in ARB format. Easiest way to get one is to downloaded from SILVA’s ARB database page.
Trim the alignment using trimal.
trimal -in NAME_aligned.fasta -out NAME.trimal.fasta -gt 0.3 -st 0.001
(NOTE: Add explanation of options and flags for all function calls.)
Key Points
First key point. Brief Answer to questions. (FIXME)
Build an Initial Tree
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Build a phylogenetic tree using RAxML
raxmlHPC-PTHREADS-SSE3 -T 4 -m GTRCAT -c 25 -e 0.001 -p 31415 -f a -N 100 -x 02938 -n NAME -s NAME.trimal.fasta -w /full/path/to/output_directory
Constraint trees: You can use a constraint tree in RAxML (-g option), which will allow you to “dictate” how some part of the tree should look like based on previously published relationships for your clade, for example, based on morphology and/or multigene phylogeny. Constraint trees should be used with caution and should ONLY constrain robust relationships (those observed in multiple studies) since it is desirable to avoid fitting the tree into your personal favorite topology. All taxa in the constraint tree must be in the alignment and names in the alignment an in the constraint tree must match exactly. Moreover, constraint tree cannot contain taxa not present in the alignment. Constrained trees can be edited in Mesquite.
Alternative: Build tree using FastTree
FastTree is very easy to install and gives you “quick and dirty” tree, in case you don’t want to or can’t wait for RAxML tree or in cases when the number of taxa is too high (more than several thousand).
fasttreeMP -gtr -nt NAME.trimal.fasta > NAME.tre
Visualize tree using FigTree.
Identify errant sequences to be removed. Go back to step 3 to refine alignment, remove errant sequences and repeat as necessary until you have a high quality starting initial alignment and tree. If your tree does not reflect known relationships you may wish to use a constraint tree in step 4a. Proceed to step 5 once you are happy with your tree/alignment. You will need the unaligned set of sequences (without outgroups) as input for step 5 (initial_DS.fas).
Key Points
First key point. Brief Answer to questions. (FIXME)
Download Databases
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Download and unpack databases
tar -xvf databases.tar.gz
Inside the unpacked “databases” directory you will find three files: Reference_DB.fas and gb203_pr2_all_10_28_97p_noorg.fasta, and tax_d.bin. These files are used for filtering results within eukref_gbretrieve.py. All three files must be in the current working directory where you run the eukref_gbrtrieve.py script in Step 6.
Get GenBank nt database
Locate a copy of GenBank nt database in use at your institution or on your server. At the EukRef workshop there is a copy on the Amazon server. GenBank NT should ideally be downloaded to a server. There will be more than 40 files totaling more than 30 GB. Downloading can take several hours. You may wish to install and use a tool like wget. In a workshop setting download one time and allow access for all participants.
Make a new folder called DATABASEFOLDER. From ftp://ftp.ncbi.nlm.nih.gov/blast/db/ download taxdb.tar.gz and all nt.tar.gz files. Also, download all nt_.tar.gz.md5 files.
Unpack with the following command:
for i in nt*; do tar -xvf $i ; done
Run md5 checksum to ensure all files are fully downloaded.
(NOTE: Add explanation of what checksums are and instructions on how to run.)
Export database
Before running the pipeline run the following commands
export BLASTDB=/path/to/DATABASE_FOLDER
(you can also add BLASTDB to your PATH permanently)
(NOTE - Add instructions on how to modify PATH.)
Key Points
First key point. Brief Answer to questions. (FIXME)
Retrieve All Sequences That Belong To Your Clade
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Retrieve sequences from GenBank database
The eukref_gbretrieve.py
script script will run in a loop until no new sequences are retrieved. This may take several hours for large groups.
Inputs:
-i Unaligned fasta file of your clustered starting set of sequences (the output of Step 2, NAME.clustered.fasta). Should be refined version, with any errant sequences detected in tree inspection during Step 4 removed. Should NOT include outgroups. Fasta headers must either be in standard GenBank format (>gi|ginumber|gb|accession| ), or have the accession number followed by a space (>accession ) -dbnt (/path/to/DATABASE_FOLDER/nt)GenBank NT file from Step 5c . -dbsi (Reference_DB.udb)PR2 SSU reference database plus representative bacteria, used for filtering results. Must be in current working directory. -n Number of sequences retrieved from GenBank per blasted sequence. recommended: 100 -p Number of CPUs. -g Name of the most inclusive group you are working with from PR2 taxonomy. Find in the PR2 taxonomy file (ReferenceDB.fas) by grep. -m Blast method. recommended: megablast -idsi PR2 Blast cut-off (nothing less than 70% similar will be retrieved). -idnt GenBank Blast cut-off (average similarity to everything on the original database).
Example command, update with your information.
python eukref_gbretrieve.py -i current_DB.fasta -dbnt /scratch/NCBI_NT/nt -dbsi ../../Reference_DB.fas -n 100 -p 8 -g NAME -m megablast -idsi 75 -idnt 90 -td tax_d.bin
Outputs:
Reference set of sequences (current_DB.fas). Reference sequences with only accessions in header for downstream use (current_DB_final.fas). List of accession numbers (accessions_current_DB.txt). A fasta file of short reads (<500 bp), and of chimeras will also be generated.
Format sequences and metadata
Run the eukref_gbmetadata.py
script to pull taxonomy and environmental information from GenBank records to 1) generate your initial reference database, and 2) reformat the fasta headers for easier annotation in a tree.
First, download the GenBank format for all sequences retrieved in step 6a from NCBI batch entrez. Upload accessions ( accessions_current_DB.txt). Click the retrieve records link. Click “send to” then download as GenBank(full) file (.gb extension). This will download the gb file for all of your accessions. Save as gb_metadata.gb
Inputs:
-gb Genbank flat file -i fasta file output from step 6a (current_DB_final.fasta) –outgroup outgroup fasta file -t reference taxonomy file (pr2_4.11_full.txt)
python eukref_gbmetadata.py -gb gb_metadata.gb -i current_DB_final.fasta -o annotated_DB_for_tree.fasta -m metadata.txt -t /path/to/pr2_4.11_full.txt --outgroup outgroup_filtered.fasta
Outputs:
metadata.txt: metadata file (tab delimited format) that includes taxonomy from GenBank, reference taxonomy, environmental data available in the GenBank record, and the publication associated with an accession. annotated_DB_for_tree.fasta: fasta file with headers labeled for easier curation. We will use the PR2 taxonomic strings or alternatively the GenBank taxonomic string if the sequence is not in PR2. The outgroup_filtered.fasta will be added as well to this fasta file.
Key Points
First key point. Brief Answer to questions. (FIXME)
Build an Alignment with the Reference Sequences
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Cluster sequences using usearch
usearch -sortbylength annotated_DB_for_tree.fasta -fastaout current_DB.sorted.fasta -minseqlength 64 -notrunclabels
If the FASTA file to be clustered contains long (e.g. genomic) sequences, it is likely that the clustering step will run out of memory. Check the screen output of this “usearch -sortbylength” step, which indicates the length of the longest sequence found within the file; this should not exceed 5,000 (the hard limit imposed in Step 5).[/su_note]
NOTE: you should choose the -id (similarity threshold for clustering) that is appropriate for your group.
usearch -cluster_smallmem current_DB.sorted.fasta -id 0.97 -centroids current_DB.clustered.fasta -uc current_DB.clusters.uc -notrunclabels
Align using MAFFT
If there is not a specific reason to use a different approach to use the default mafft algorithm. See instructions below for alignments with groups known to be difficult.
mafft --reorder --auto current_DB.clustered.fasta > current_DB_aligned.fasta
Alternative: Align using INFERNAL
cmalign --outformat AFA -o current_DB_aligned_infernal.fasta SSUref.cmm current_DB.clustered.fasta sed 's/\./-/g' current_DB_aligned_infernal.fasta > current_DB_aligned.fasta
Alternative: Align using SINA*
./sina -i NAME.clustered.fasta -o NAME_aligned.fasta --outtype fasta --ptdb REFERENCE_ALN.ARB
If you choose to use SINA alignment, that you have to provide a reference alignment in ARB format. Easiest way to get one is to downloaded from SILVA’s ARB database page.
It is critical to open and check your alignment in Aliview software (or similar) to see if there are misaligned blocks or sequences. Make sure you look at the entire length of the alignment. If the alignment is not good you will not produce a good tree. If you have a poor alignment you can 1) align according to a template (seed) alignment, the command below, or 2) modify your alignment by hand.
If you already have a curated alignment for your group you can use it as a template (seed) to align your new sequences with the following command.
mafft --reorder --genafpair --maxiterate 1000 --retree --anysymbol --seed template_curated_alignment.fasta current_DB.OUTGROUP.clustered.fasta > current_DB_aligned.fasta
Remember to open your alignment file and check before proceeding.
Trim the alignment using trimal
trimal -in current_DB_aligned.fasta -out current_DB_trim.fasta -gt 0.3 -st 0.001
Key Points
First key point. Brief Answer to questions. (FIXME)
Build RaxML Trees and Clean Up
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Build a phylogenetic tree using RAxML
Warning: Threads
Do not use more threads (-T option) than available processors. Typically no more than 4 on a laptop and potentially many more on a cluster. NOTE: you can add a backbone constraint tree with the option -g. This should be done if clades within your group are always monophyletic in the literature (e.g. in phylogenomic or multigene analyses), but are not monophyletic in 18S trees. The sequences (tips) in this constraint tree must be in your alignment (they should be a subset of your sequences), and the names must match exactly.
raxmlHPC-PTHREADS-SSE3 -T 4 -m GTRCAT -c 25 -p 31415 -x 20398 -d -f a -N 100 -n NAME -s current_DB.trim.fasta –w /full/path/to/output_directory
Alternative: Build tree using FastTree
FastTree is very easy to install and gives you “quick and dirty” tree, in case you don’t want to or can’t wait for RAxML tree or in cases when the number of taxa is too high (more than several thousand).
fasttreeMP -gtr -nt current_DB.trim.fasta > NAME_fasttree.tre
Key Points
First key point. Brief Answer to questions. (FIXME)
Visualize Your Tree and Remove Errant Sequences
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Download your tree e.g. “RAxML_bestTree.clade”. Open tree in FigTree and annotate taxa with either “remove” for taxa you wish to remove, or you can modify/add to the existing name. You can only annotate taxa, NOT nodes or clades. You can select entire clades to annotate all taxa at once. Annotated nodes cannot be collapsed (annotations within the collapsed node will not be reported). “Save as” the tree with annotations for example “RAxML_bestTree.clade_annotated.tre”, which is a nexus file.
(NOTE: Add screenshots illustrating how to do the above in FigTree)
Inputs:
-t RAxML_bestTree.clade_annotated.tre -m metadata.txt
python eukref_treeparse.py -t RAxML_bestTree.clade_annotated.tre -m metadata.txt
Output: metadata_out.txt
The “metadata_out.txt” will be your tab-delimited metadata file with “remove” in the last column for the sequences you want to remove. You can sort your metadata_out.txt file in excel to get all of the accessions of sequences that should be removed from your tree. Save these accessions to be removed in a file called accessions_remove.txt.
Inputs:
-i current_DB.clustered.fasta -s accessions_remove.txt
python eukref_filter_fasta.py -i current_DB.clustered.fasta -s accessions_remove.txt -o current_DB.cleaned.fasta
Output:
current_DB.cleaned.fasta
Go back to Step 7 to align and build a new tree, repeat until you are happy with your current set of sequences.
Key Points
First key point. Brief Answer to questions. (FIXME)
Build Reference Tree
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Once you are happy with the set of sequences in your tree and your alignment you are ready to build your final reference tree. Your final, trimmed alignment is in input into RAxML. NOTE: you can add a backbone constraint tree with the option -g. The sequences (tips) in this constraint tree must be in your alignment (they should be a subset of your sequences), and the names must match exactly.
raxmlHPC-PTHREADS-SSE3 -T 4 -m GTRCAT -c 25 -p 31415 -x 20398 -d -f a -N 100 -n NAME -s final_DB.trim
Key Points
First key point. Brief Answer to questions. (FIXME)
Annotation
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Trim your metadata
Use the script eukref_metatrim.py Trim your metadata file to include only representative sequences of clusters – these are the sequences in your tree.
Inputs:
-i current_DB.cleaned.fasta is the fasta file with your clustered sequences containing only the sequences used in your reference tree. -m metadata.txt is your metadata file from eukref_gbmetadata.py.
python eukref_metatrim.py -i current_DB.cleaned.fasta -m metadata.txt -o metadata_ref.txt
Outputs:
-o metadata_ref.txt, which only has the accessions in your reference tree. This will make annotating easier.
Annotate classification
You need to annotate classification your metadata_ref.txt. Note that you should have GenBank taxonomy and SILVA taxonomy in your metadata_ref.txt file so you know the starting point. In many cases, it will be easiest to annotate clades directly on your reference tree. You can annotate the tree and then export annotations to your metadata_ref.txt file using the instructions below.
Download your reference tree e.g. “RAxML_bestTree.clade”. Open tree in FigTree and annotate taxa with the name of the taxon. You can only annotate the taxa, NOT nodes or clades. You can select entire clades to annotate all taxa at once. Annotated nodes cannot be collapsed (annotations within the collapsed node will not be reported). “Save as” the tree with annotations for example “RAxML_bestTree.clade_annotated.tre”, which is a nexus file.
Run script eukref_treeparse.py to add annotations in your tree to your metadata_ref.txt file.
Inputs:
-t RAxML_bestTree.clade_annotated.tre -m metadata_ref.txt
python eukref_treeparse.py -t RAxML_bestTree.clade_annotated.tre -m metadata_ref.txt
Outputs:
The “metadata_ref_out.txt” will be your tab-delimited metadata file that includes the Accession Number of representative for your clusters and the annotations that you made on the tree in the last column.
Expand your curation
Use the script eukref_metaexpand.py to expand your curation to all sequences within the clusters.
Inputs:
-i metadata.txt is the initial metadata file you got -r metadata_ref_out.txt is the reference metadata you just added annotations to -c current_DB.clusters.uc is the cluster file (.uc) from usearch. Use the most recent version if you clustered multiple times.
python eukref_metaexpand.py -i metadata.txt -r metadata_ref_out.txt -c current_DB.clusters.uc -o metadata_ref_expanded.txt
Outputs:
-o metadata_ref_expanded.txt is the expanded annotated metadata file
Key Points
First key point. Brief Answer to questions. (FIXME)
Getting Started
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Key Points
First key point. Brief Answer to questions. (FIXME)