A tool to predict bacterial species from fasta files using RefSeq 16S.

These details have not been verified by PyPI

Project links

Homepage

Project description

Spestimator

PyPI - Version PyPI - Python Version GitHub Actions Workflow Status GitHub License

Species Estimator & Genome Retriever

Species + Estimator = Spestimator

Spestimator is a lightweight Python command-line tool designed to quickly identify bacterial species from input FASTA sequences (contigs or reads) and automatically retrieve their corresponding reference genomes from NCBI RefSeq.

Why Spestimator?

I needed a tool that could:

Quickly estimate a range of potential reference genomes for an input sample.
Use a database maintained by a trustworthy, authoritative group (NCBI RefSeq) rather than maintaining a custom, static database that gets outdated.

Installation

From Pip/From Source

Spestimator requires NCBI BLAST+ to be installed and available in your system path.

sudo apt-get install ncbi-blast+

# installation from pypi
pip install spestimator

# from source
git clone https://github.com/erinyoung/Spestimator.git
cd Spestimator
pip install .

From Conda (includes blast)

conda install -c bioconda spestimator

Quick Start

Identify organisms in your FASTA file(s)

spestimator -i *.fasta -o results.csv

Identify organisms and download the matched RefSeq genomes to a folder:

spestimator -i *.fasta -o results.csv -d genomes_dir/

Updating the database

RefSeq updates quarterly. There is an attempt to keep this package on a similar update schedule, but this may not be feasible. To get a current version of the database.

# downloads https://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz for blast database
# uses eutils to get taxids for blast database (NCBI api-key will speed this step up)
# downloads https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt for refseq accesions
spestimator --update-db --db-dir database --api-key NCBI_API_KEY

This database can then be used by

spestimator -i *.fasta -o results.csv -d genomes_dir/ --db-dir database

Usage & Options

usage: spestimator [-h] [-v] [-i INPUT [INPUT ...]] [-o OUTPUT] [-d [DIR]] [--db-dir DB_DIR] [--db-name DB_NAME] [-u] [--api-key API_KEY] [-t THREADS]
                   [--max-target-seqs MAX_TARGET_SEQS] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--min-hits MIN_HITS]
                   [--min-alignment-len MIN_ALIGNMENT_LEN] [--top-k-taxa TOP_K_TAXA]

Spestimator: Predict bacterial TaxIDs from 16S and download genomes.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i, --input INPUT [INPUT ...]
                        Input FASTA files
  -o, --output OUTPUT   Output CSV file
  -d, --download-genomes [DIR]
                        Download found genomes. Defaults to 'genomes/' if flag is used without a path.
  --db-dir DB_DIR       Override path to BLAST database directory
  --db-name DB_NAME     Custom name for the database to appear in results (Default: DB filename)
  -u, --update-db       Download database and generate metadata
  --api-key API_KEY     NCBI API Key (Speeds up metadata generation)
  -t, --threads THREADS
                        BLAST threads

Filtering Options:
  --max-target-seqs MAX_TARGET_SEQS
                        BLAST: Hits to keep per read (Default: 10)
  --min-identity MIN_IDENTITY
                        Filter: Minimum Percent Identity (0-100). Default: 90.0
  --min-coverage MIN_COVERAGE
                        Filter: Minimum Query Coverage (0-100). Default: 0.0
  --min-hits MIN_HITS   Filter: Minimum reads required to report an organism
  --min-alignment-len MIN_ALIGNMENT_LEN
                        Filter: Minimum Alignment Length in bp (Default: 0/No Filter)
  --top-k-taxa TOP_K_TAXA
                        Report: Only keep the top K unique organisms per file (Default: 10)

How It Works

Runs blastn of your input against the 16S database in the repo.
Filters hits based on identity and coverage.
Aggregates hits to identify the most likely species present.
(Optional): Uses ncbi-datasets-pyclient to download the reference genome assembly (GCF_xxxx) associated with the identified species.

Output Format

The results CSV contains the following columns:

organism: The clean species name (e.g., E. coli).
taxid: NCBI Taxonomy ID for the organism.
species_taxid: NCBI Taxonomy ID for the species of the organism.
refseq_assembly_accession: The Assembly ID (e.g., GCF_000005845.2) used for downloading.
sacc: The specific 16S sequence accession hit.
count: Number of input sequences matching this organism.
avg_pident: Average percent identity of the matches.
avg_bitscore: A metric combining match quality and length.

Example

spestimator -i tests/sample_positive.fasta

The results should look like this

input file,sacc,organism,refseq_assembly_accession,taxid,species_taxid,count,total_bitscore,avg_bitscore,avg_pident,max_pident,avg_qcov,best_evalue
sample_positive.fasta,NR_112088,Streptococcus pyogenes,GCF_900475035.1,1314,1314,1,2772,2772.0,100.0,100.0,33.25947263461113,0.0
sample_positive.fasta,NR_115729,Streptococcus canis,GCF_900636575.1,1329,1329,1,2545,2545.0,98.085,98.085,32.395302459561265,0.0
sample_positive.fasta,NR_024633,Streptococcus canis,GCF_900636575.1,1329,1329,1,2536,2536.0,97.202,97.202,33.25947263461113,0.0
sample_positive.fasta,NR_037101,Streptococcus urinalis,GCF_900636885.1,149016,149016,1,2534,2534.0,97.135,97.135,33.25947263461113,0.0
sample_positive.fasta,NR_043661,Streptococcus dysgalactiae subsp. equisimilis,GCF_016128095.1,119602,1334,1,2534,2534.0,97.443,97.443,32.92709949036118,0.0
sample_positive.fasta,NR_025148,Streptococcus iniae,GCF_000831485.1,1346,1346,1,2532,2532.0,96.422,96.422,34.05716818081099,0.0
sample_positive.fasta,NR_115802,Streptococcus ictaluri 707-05,GCF_000188015.2,764299,380397,1,2523,2523.0,97.627,97.627,32.68335918457789,0.0
sample_positive.fasta,NR_178901,Streptococcus penaeicida,GCF_053204055.1,1765960,1765960,1,2514,2514.0,96.217,96.217,33.96853534234434,0.0
sample_positive.fasta,NR_199923,Streptococcus tangpeifui,GCF_044503215.1,2709400,2709400,1,2494,2494.0,95.858,95.858,34.23443385774429,0.0
sample_positive.fasta,NR_040821,Streptococcus agalactiae ATCC 13813,GCF_001552035.1,888745,1311,1,2484,2484.0,96.536,96.536,33.25947263461113,0.0

Disclaimer: NCBI Datasets

Spestimator relies on ncbi-datasets-pyclient to retrieve genomes. This library updates frequently, as does the underlying datasets tool. If you encounter unexpected errors during genome download (e.g., ApiException or connection drops), please submit an issue and use the accessions in the "refseq_accession" column to download with datasets separately.

# download datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets && chmod +x datasets

# get a list of accessions
cut -f 4 -d , results.csv | grep GCF > id_list.txt

# use the list of accessions to download genomes
./datasets download genome accession --inputfile id_list.txt --filename ncbi_dataset.zip

# decompress file and use for additional analysis
unzip ncbi_dataset.zip

AI Usage

Spestimator began as a set of custom shell and Python scripts used for ad-hoc analysis. To make these tools more reliable and accessible, Google's Gemini was used to accelerate the transition into a Python package. Gemini assisted in modularizing the codebase, replacing parsing logic with Pandas operations, and implementing a "mocked" testing suite that allows for safe CI/CD without hitting NCBI servers. Gemini also created the logo.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.0.234

Mar 24, 2026

0.3.0.233

Feb 4, 2026

0.1.0.232

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spestimator-0.3.0.234.tar.gz (13.7 MB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spestimator-0.3.0.234-py3-none-any.whl (14.7 MB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file spestimator-0.3.0.234.tar.gz.

File metadata

Download URL: spestimator-0.3.0.234.tar.gz
Upload date: Mar 24, 2026
Size: 13.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spestimator-0.3.0.234.tar.gz
Algorithm	Hash digest
SHA256	`b8ee88db8289e6b36c4426816af2f79c6708abacea9b1dc557af047846020308`
MD5	`5c00796d710ad8b9b8bbf285d14f18a1`
BLAKE2b-256	`dcda59b0d8ddae23e48415e82ceabd8c8faf41742379bb5f9f9ecccbd993fbbc`

See more details on using hashes here.

File details

Details for the file spestimator-0.3.0.234-py3-none-any.whl.

File metadata

Download URL: spestimator-0.3.0.234-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 14.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spestimator-0.3.0.234-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62352636ea1ce5836febddd7c4a44d3d592e2d520a4478bb22a9b437749516ba`
MD5	`be72e50e67758fee781ef96c0ece69aa`
BLAKE2b-256	`e250801efae597703cd5d31f451e8e2ac8885eda09adc0060b97ab0a0160ea04`

See more details on using hashes here.

spestimator 0.3.0.234

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spestimator

Why Spestimator?

Installation

From Pip/From Source

From Conda (includes blast)

Quick Start

Updating the database

Usage & Options

How It Works

Output Format

Example

Disclaimer: NCBI Datasets

AI Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes