A tool to predict bacterial species from fasta files using RefSeq 16S.
Project description
Spestimator
Species Estimator & Genome Retriever
Species + Estimator = Spestimator
Spestimator is a lightweight Python command-line tool designed to quickly identify bacterial species from input FASTA sequences (contigs or reads) and automatically retrieve their corresponding reference genomes from NCBI RefSeq.
Why Spestimator?
I needed a tool that could:
- Quickly estimate a range of potential reference genomes for an input sample.
- Use a database maintained by a trustworthy, authoritative group (NCBI RefSeq) rather than maintaining a custom, static database that gets outdated.
Installation
From Pip/From Source
Spestimator requires NCBI BLAST+ to be installed and available in your system path.
sudo apt-get install ncbi-blast+
# installation from pypi
pip install spestimator
# from source
git clone https://github.com/erinyoung/Spestimator.git
cd Spestimator
pip install .
From Conda (includes blast)
conda install -c bioconda spestimator
Quick Start
Identify organisms in your FASTA file(s)
spestimator -i *.fasta -o results.csv
Identify organisms and download the matched RefSeq genomes to a folder:
spestimator -i *.fasta -o results.csv -d genomes_dir/
Updating the database
RefSeq updates quarterly. There is an attempt to keep this package on a similar update schedule, but this may not be feasible. To get a current version of the database.
# downloads https://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz for blast database
# uses eutils to get taxids for blast database (NCBI api-key will speed this step up)
# downloads https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt for refseq accesions
spestimator --update-db --db-dir database --api-key NCBI_API_KEY
This database can then be used by
spestimator -i *.fasta -o results.csv -d genomes_dir/ --db-dir database
Usage & Options
usage: spestimator [-h] [-v] [-i INPUT [INPUT ...]] [-o OUTPUT] [-d [DIR]] [--db-dir DB_DIR] [--db-name DB_NAME] [-u] [--api-key API_KEY] [-t THREADS]
[--max-target-seqs MAX_TARGET_SEQS] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--min-hits MIN_HITS]
[--min-alignment-len MIN_ALIGNMENT_LEN] [--top-k-taxa TOP_K_TAXA]
Spestimator: Predict bacterial TaxIDs from 16S and download genomes.
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i, --input INPUT [INPUT ...]
Input FASTA files
-o, --output OUTPUT Output CSV file
-d, --download-genomes [DIR]
Download found genomes. Defaults to 'genomes/' if flag is used without a path.
--db-dir DB_DIR Override path to BLAST database directory
--db-name DB_NAME Custom name for the database to appear in results (Default: DB filename)
-u, --update-db Download database and generate metadata
--api-key API_KEY NCBI API Key (Speeds up metadata generation)
-t, --threads THREADS
BLAST threads
Filtering Options:
--max-target-seqs MAX_TARGET_SEQS
BLAST: Hits to keep per read (Default: 10)
--min-identity MIN_IDENTITY
Filter: Minimum Percent Identity (0-100). Default: 90.0
--min-coverage MIN_COVERAGE
Filter: Minimum Query Coverage (0-100). Default: 0.0
--min-hits MIN_HITS Filter: Minimum reads required to report an organism
--min-alignment-len MIN_ALIGNMENT_LEN
Filter: Minimum Alignment Length in bp (Default: 0/No Filter)
--top-k-taxa TOP_K_TAXA
Report: Only keep the top K unique organisms per file (Default: 10)
How It Works
- Runs blastn of your input against the 16S database in the repo.
- Filters hits based on identity and coverage.
- Aggregates hits to identify the most likely species present.
- (Optional): Uses ncbi-datasets-pyclient to download the reference genome assembly (
GCF_xxxx) associated with the identified species.
Output Format
The results CSV contains the following columns:
- organism: The clean species name (e.g., E. coli).
- taxid: NCBI Taxonomy ID for the organism.
- species_taxid: NCBI Taxonomy ID for the species of the organism.
- refseq_assembly_accession: The Assembly ID (e.g.,
GCF_000005845.2) used for downloading. - sacc: The specific 16S sequence accession hit.
- count: Number of input sequences matching this organism.
- avg_pident: Average percent identity of the matches.
- avg_bitscore: A metric combining match quality and length.
Example
spestimator -i tests/sample_positive.fasta
The results should look like this
input file,sacc,organism,refseq_assembly_accession,taxid,species_taxid,count,total_bitscore,avg_bitscore,avg_pident,max_pident,avg_qcov,best_evalue
sample_positive.fasta,NR_112088,Streptococcus pyogenes,GCF_900475035.1,1314,1314,1,2772,2772.0,100.0,100.0,33.25947263461113,0.0
sample_positive.fasta,NR_115729,Streptococcus canis,GCF_900636575.1,1329,1329,1,2545,2545.0,98.085,98.085,32.395302459561265,0.0
sample_positive.fasta,NR_024633,Streptococcus canis,GCF_900636575.1,1329,1329,1,2536,2536.0,97.202,97.202,33.25947263461113,0.0
sample_positive.fasta,NR_037101,Streptococcus urinalis,GCF_900636885.1,149016,149016,1,2534,2534.0,97.135,97.135,33.25947263461113,0.0
sample_positive.fasta,NR_043661,Streptococcus dysgalactiae subsp. equisimilis,GCF_016128095.1,119602,1334,1,2534,2534.0,97.443,97.443,32.92709949036118,0.0
sample_positive.fasta,NR_025148,Streptococcus iniae,GCF_000831485.1,1346,1346,1,2532,2532.0,96.422,96.422,34.05716818081099,0.0
sample_positive.fasta,NR_115802,Streptococcus ictaluri 707-05,GCF_000188015.2,764299,380397,1,2523,2523.0,97.627,97.627,32.68335918457789,0.0
sample_positive.fasta,NR_178901,Streptococcus penaeicida,GCF_053204055.1,1765960,1765960,1,2514,2514.0,96.217,96.217,33.96853534234434,0.0
sample_positive.fasta,NR_199923,Streptococcus tangpeifui,GCF_044503215.1,2709400,2709400,1,2494,2494.0,95.858,95.858,34.23443385774429,0.0
sample_positive.fasta,NR_040821,Streptococcus agalactiae ATCC 13813,GCF_001552035.1,888745,1311,1,2484,2484.0,96.536,96.536,33.25947263461113,0.0
Disclaimer: NCBI Datasets
Spestimator relies on ncbi-datasets-pyclient to retrieve genomes. This library updates frequently, as does the underlying datasets tool. If you encounter unexpected errors during genome download (e.g., ApiException or connection drops), please submit an issue and use the accessions in the "refseq_accession" column to download with datasets separately.
# download datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets && chmod +x datasets
# get a list of accessions
cut -f 4 -d , results.csv | grep GCF > id_list.txt
# use the list of accessions to download genomes
./datasets download genome accession --inputfile id_list.txt --filename ncbi_dataset.zip
# decompress file and use for additional analysis
unzip ncbi_dataset.zip
AI Usage
Spestimator began as a set of custom shell and Python scripts used for ad-hoc analysis. To make these tools more reliable and accessible, Google's Gemini was used to accelerate the transition into a Python package. Gemini assisted in modularizing the codebase, replacing parsing logic with Pandas operations, and implementing a "mocked" testing suite that allows for safe CI/CD without hitting NCBI servers. Gemini also created the logo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spestimator-0.3.0.233.tar.gz.
File metadata
- Download URL: spestimator-0.3.0.233.tar.gz
- Upload date:
- Size: 6.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
561646e5dd11ce80cd674a04fd8ea2ababda8524aaf97d2e5b94607bdcba84fa
|
|
| MD5 |
f379b6096148f4653b3c7d144a7c658e
|
|
| BLAKE2b-256 |
85d2ebffd65cb1f6057785c7bc8d8ce53d290c1119686242099ccede44124dd1
|
File details
Details for the file spestimator-0.3.0.233-py3-none-any.whl.
File metadata
- Download URL: spestimator-0.3.0.233-py3-none-any.whl
- Upload date:
- Size: 6.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78537c4fef5b3366fc16a8a663bc1e217461f6c7ae0ed73d0cee55c81075be87
|
|
| MD5 |
2dbbdf52f545594bd451d3cae86c4a1c
|
|
| BLAKE2b-256 |
48e88a7e950d12e2924ddecbf32215483a3f463495fffa169d01e468bc1053be
|