Skip to main content

A tool to predict bacterial species from fasta files using RefSeq 16S.

Project description

Spestimator Logo

Spestimator

PyPI - Version PyPI - Python Version GitHub Actions Workflow Status GitHub License

Species Estimator & Genome Retriever

Species + Estimator = Spestimator

Spestimator is a lightweight Python command-line tool designed to quickly identify bacterial species from input FASTA sequences (contigs or reads) and automatically retrieve their corresponding reference genomes from NCBI RefSeq.

Why Spestimator?

I needed a tool that could:

  1. Quickly estimate a range of potential reference genomes for an input sample.
  2. Use a database maintained by a trustworthy, authoritative group (NCBI RefSeq) rather than maintaining a custom, static database that gets outdated.

Installation

From Pip/From Source

Spestimator requires NCBI BLAST+ to be installed and available in your system path.

sudo apt-get install ncbi-blast+
# installation from pypi
pip install spestimator
# from source
git clone https://github.com/erinyoung/Spestimator.git
cd Spestimator
pip install .

From Conda (includes blast)

conda install -c bioconda spestimator

Quick Start

Identify organisms in your FASTA file(s)

spestimator -i *.fasta -o results.csv

Identify organisms and download the matched RefSeq genomes to a folder:

spestimator -i *.fasta -o results.csv -d genomes_dir/

Updating the database

RefSeq updates quarterly. There is an attempt to keep this package on a similar update schedule, but this may not be feasible. To get a current version of the database.

# downloads https://ftp.ncbi.nlm.nih.gov/refseq/TargetedLoci/Bacteria/bacteria.16SrRNA.fna.gz for blast database
# uses eutils to get taxids for blast database (NCBI api-key will speed this step up)
# downloads https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt for refseq accesions
spestimator --update-db --db-dir database --api-key NCBI_API_KEY

This database can then be used by

spestimator -i *.fasta -o results.csv -d genomes_dir/ --db-dir database

Usage & Options

usage: spestimator [-h] [-v] [-i INPUT [INPUT ...]] [-o OUTPUT] [-d [DIR]] [--db-dir DB_DIR] [--db-name DB_NAME] [-u] [--api-key API_KEY] [-t THREADS]
                   [--max-target-seqs MAX_TARGET_SEQS] [--min-identity MIN_IDENTITY] [--min-coverage MIN_COVERAGE] [--min-hits MIN_HITS]
                   [--min-alignment-len MIN_ALIGNMENT_LEN] [--top-k-taxa TOP_K_TAXA]

Spestimator: Predict bacterial TaxIDs from 16S and download genomes.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i, --input INPUT [INPUT ...]
                        Input FASTA files
  -o, --output OUTPUT   Output CSV file
  -d, --download-genomes [DIR]
                        Download found genomes. Defaults to 'genomes/' if flag is used without a path.
  --db-dir DB_DIR       Override path to BLAST database directory
  --db-name DB_NAME     Custom name for the database to appear in results (Default: DB filename)
  -u, --update-db       Download database and generate metadata
  --api-key API_KEY     NCBI API Key (Speeds up metadata generation)
  -t, --threads THREADS
                        BLAST threads

Filtering Options:
  --max-target-seqs MAX_TARGET_SEQS
                        BLAST: Hits to keep per read (Default: 10)
  --min-identity MIN_IDENTITY
                        Filter: Minimum Percent Identity (0-100). Default: 90.0
  --min-coverage MIN_COVERAGE
                        Filter: Minimum Query Coverage (0-100). Default: 0.0
  --min-hits MIN_HITS   Filter: Minimum reads required to report an organism
  --min-alignment-len MIN_ALIGNMENT_LEN
                        Filter: Minimum Alignment Length in bp (Default: 0/No Filter)
  --top-k-taxa TOP_K_TAXA
                        Report: Only keep the top K unique organisms per file (Default: 10)

How It Works

  • Runs blastn of your input against the 16S database in the repo.
  • Filters hits based on identity and coverage.
  • Aggregates hits to identify the most likely species present.
  • (Optional): Uses ncbi-datasets-pyclient to download the reference genome assembly (GCF_xxxx) associated with the identified species.

Output Format

The results CSV contains the following columns:

  • organism: The clean species name (e.g., E. coli).
  • taxid: NCBI Taxonomy ID for the organism.
  • species_taxid: NCBI Taxonomy ID for the species of the organism.
  • refseq_assembly_accession: The Assembly ID (e.g., GCF_000005845.2) used for downloading.
  • sacc: The specific 16S sequence accession hit.
  • count: Number of input sequences matching this organism.
  • avg_pident: Average percent identity of the matches.
  • avg_bitscore: A metric combining match quality and length.

Example

spestimator -i tests/sample_positive.fasta

The results should look like this

input file,sacc,organism,refseq_assembly_accession,taxid,species_taxid,count,total_bitscore,avg_bitscore,avg_pident,max_pident,avg_qcov,best_evalue
sample_positive.fasta,NR_112088,Streptococcus pyogenes,GCF_900475035.1,1314,1314,1,2772,2772.0,100.0,100.0,33.25947263461113,0.0
sample_positive.fasta,NR_115729,Streptococcus canis,GCF_900636575.1,1329,1329,1,2545,2545.0,98.085,98.085,32.395302459561265,0.0
sample_positive.fasta,NR_024633,Streptococcus canis,GCF_900636575.1,1329,1329,1,2536,2536.0,97.202,97.202,33.25947263461113,0.0
sample_positive.fasta,NR_037101,Streptococcus urinalis,GCF_900636885.1,149016,149016,1,2534,2534.0,97.135,97.135,33.25947263461113,0.0
sample_positive.fasta,NR_043661,Streptococcus dysgalactiae subsp. equisimilis,GCF_016128095.1,119602,1334,1,2534,2534.0,97.443,97.443,32.92709949036118,0.0
sample_positive.fasta,NR_025148,Streptococcus iniae,GCF_000831485.1,1346,1346,1,2532,2532.0,96.422,96.422,34.05716818081099,0.0
sample_positive.fasta,NR_115802,Streptococcus ictaluri 707-05,GCF_000188015.2,764299,380397,1,2523,2523.0,97.627,97.627,32.68335918457789,0.0
sample_positive.fasta,NR_178901,Streptococcus penaeicida,GCF_053204055.1,1765960,1765960,1,2514,2514.0,96.217,96.217,33.96853534234434,0.0
sample_positive.fasta,NR_199923,Streptococcus tangpeifui,GCF_044503215.1,2709400,2709400,1,2494,2494.0,95.858,95.858,34.23443385774429,0.0
sample_positive.fasta,NR_040821,Streptococcus agalactiae ATCC 13813,GCF_001552035.1,888745,1311,1,2484,2484.0,96.536,96.536,33.25947263461113,0.0

Disclaimer: NCBI Datasets

Spestimator relies on ncbi-datasets-pyclient to retrieve genomes. This library updates frequently, as does the underlying datasets tool. If you encounter unexpected errors during genome download (e.g., ApiException or connection drops), please submit an issue and use the accessions in the "refseq_accession" column to download with datasets separately.

# download datasets
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets && chmod +x datasets

# get a list of accessions
cut -f 4 -d , results.csv | grep GCF > id_list.txt

# use the list of accessions to download genomes
./datasets download genome accession --inputfile id_list.txt --filename ncbi_dataset.zip

# decompress file and use for additional analysis
unzip ncbi_dataset.zip

AI Usage

Spestimator began as a set of custom shell and Python scripts used for ad-hoc analysis. To make these tools more reliable and accessible, Google's Gemini was used to accelerate the transition into a Python package. Gemini assisted in modularizing the codebase, replacing parsing logic with Pandas operations, and implementing a "mocked" testing suite that allows for safe CI/CD without hitting NCBI servers. Gemini also created the logo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spestimator-0.3.0.234.tar.gz (13.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spestimator-0.3.0.234-py3-none-any.whl (14.7 MB view details)

Uploaded Python 3

File details

Details for the file spestimator-0.3.0.234.tar.gz.

File metadata

  • Download URL: spestimator-0.3.0.234.tar.gz
  • Upload date:
  • Size: 13.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for spestimator-0.3.0.234.tar.gz
Algorithm Hash digest
SHA256 b8ee88db8289e6b36c4426816af2f79c6708abacea9b1dc557af047846020308
MD5 5c00796d710ad8b9b8bbf285d14f18a1
BLAKE2b-256 dcda59b0d8ddae23e48415e82ceabd8c8faf41742379bb5f9f9ecccbd993fbbc

See more details on using hashes here.

File details

Details for the file spestimator-0.3.0.234-py3-none-any.whl.

File metadata

File hashes

Hashes for spestimator-0.3.0.234-py3-none-any.whl
Algorithm Hash digest
SHA256 62352636ea1ce5836febddd7c4a44d3d592e2d520a4478bb22a9b437749516ba
MD5 be72e50e67758fee781ef96c0ece69aa
BLAKE2b-256 e250801efae597703cd5d31f451e8e2ac8885eda09adc0060b97ab0a0160ea04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page