Retrieve All Sequences That Belong To Your Clade


Retrieve sequences from GenBank database

The script script will run in a loop until no new sequences are retrieved. This may take several hours for large groups.


-i Unaligned fasta file of your clustered starting set of sequences (the output of Step 2, NAME.clustered.fasta). Should be refined version, with any errant sequences detected in tree inspection during Step 4 removed. Should NOT include outgroups. Fasta headers must either be in standard GenBank format (>gi|ginumber|gb|accession| ), or have the accession number followed by a space (>accession ) -dbnt (/path/to/DATABASE_FOLDER/nt)GenBank NT file from Step 5c . -dbsi (Reference_DB.udb)PR2 SSU reference database plus representative bacteria, used for filtering results. Must be in current working directory. -n Number of sequences retrieved from GenBank per blasted sequence. recommended: 100 -p Number of CPUs. -g Name of the most inclusive group you are working with from PR2 taxonomy. Find in the PR2 taxonomy file (ReferenceDB.fas) by grep. -m Blast method. recommended: megablast -idsi PR2 Blast cut-off (nothing less than 70% similar will be retrieved). -idnt GenBank Blast cut-off (average similarity to everything on the original database).

Example command, update with your information.

python -i current_DB.fasta -dbnt /scratch/NCBI_NT/nt -dbsi ../../Reference_DB.fas -n 100 -p 8 -g NAME -m megablast -idsi 75 -idnt 90 -td tax_d.bin


Reference set of sequences (current_DB.fas). Reference sequences with only accessions in header for downstream use (current_DB_final.fas). List of accession numbers (accessions_current_DB.txt). A fasta file of short reads (<500 bp), and of chimeras will also be generated.

Format sequences and metadata

Run the script to pull taxonomy and environmental information from GenBank records to 1) generate your initial reference database, and 2) reformat the fasta headers for easier annotation in a tree.

First, download the GenBank format for all sequences retrieved in step 6a from NCBI batch entrez. Upload accessions ( accessions_current_DB.txt). Click the retrieve records link. Click “send to” then download as GenBank(full) file (.gb extension). This will download the gb file for all of your accessions. Save as


-gb Genbank flat file -i fasta file output from step 6a (current_DB_final.fasta) –outgroup outgroup fasta file -t reference taxonomy file (pr2_4.11_full.txt)

python -gb -i current_DB_final.fasta -o annotated_DB_for_tree.fasta -m metadata.txt -t /path/to/pr2_4.11_full.txt --outgroup outgroup_filtered.fasta


metadata.txt: metadata file (tab delimited format) that includes taxonomy from GenBank, reference taxonomy, environmental data available in the GenBank record, and the publication associated with an accession. annotated_DB_for_tree.fasta: fasta file with headers labeled for easier curation. We will use the PR2 taxonomic strings or alternatively the GenBank taxonomic string if the sequence is not in PR2. The outgroup_filtered.fasta will be added as well to this fasta file.

