Skip to main content

Python package for efficiently parsing NCBI's taxdump database

Project description

Taxdumpy

A high-performance Python toolkit for parsing NCBI Taxonomy databases with lineage resolution and fuzzy search

CI PyPI version License: MIT Python 3.10+ Development Status codecov

Features

  • Fast Parsing: Optimized loading of NCBI taxdump files with pickle caching for ~3x speedup
  • Dual Backends: In-memory (TaxDb) for speed or SQLite (TaxSQLite) for memory efficiency
  • Lineage Resolution: Complete lineage tracing with automatic handling of merged/deleted nodes
  • Rank Codes: Comparable RC class for efficient rank-based operations and grouping
  • Fuzzy Search: Rapid approximate name matching using rapidfuzz
  • Batch Grouping: Group taxids by rank level with group_taxids_by_rank
  • CLI: Ready-to-use commands for caching, searching, and lineage resolution

Installation

pip install taxdumpy

Development Installation:

git clone https://github.com/omegahh/taxdumpy.git
cd taxdumpy
pip install -e .

With Development Dependencies:

pip install -e .[dev]

Quick Start

Basic Usage

from taxdumpy import TaxDb, Taxon

# Initialize database (uses pickle cache if available)
taxdb = TaxDb("/path/to/taxdump")

# Create taxon objects
human = Taxon(9606, taxdb)  # Homo sapiens
ecoli = Taxon(511145, taxdb)  # E. coli K-12

# Access lineage information
print(human.name_lineage)
# ['Homo sapiens', 'Homo', 'Hominidae', 'Primates', ..., 'cellular organisms']

print(human.rank_lineage)
# ['species', 'genus', 'family', 'order', ..., 'superkingdom']

# Check taxonomic properties
print(f"Rank: {human.rank}")           # species
print(f"Division: {human.division}")   # Primates
print(f"Is legacy: {human.is_legacy}") # False

Using the Factory Function

from taxdumpy import create_database, Taxon

# Create database with preferred backend
db = create_database(backend="sqlite", taxdump_dir="/path/to/taxdump")
# Or: backend="dict" for in-memory

taxon = Taxon(9606, db)
print(taxon.name)  # Homo sapiens

Fuzzy Search

# Search with typos and partial matches
results = taxdb._rapid_fuzz("Escherichia coli", limit=5)
for match in results:
    print(f"{match['name']} (TaxID: {match['taxid']}, Score: {match['score']})")

Rank Codes and Grouping

from taxdumpy import TaxDb, Taxon, RC, group_taxids_by_rank

taxdb = TaxDb("/path/to/taxdump")
human = Taxon(9606, taxdb)

# RC (Rank Code) - comparable rank representation
print(human.rc)              # RC('S') for species
print(human.rankcode_lineage)  # [RC('S'), RC('G'), RC('F1'), RC('F'), ...]

# RC comparison (lower value = higher rank)
RC("R") < RC("D") < RC("G") < RC("S")  # True
RC("F") < RC("F1")  # True (F1 is below F)

# Group taxids by rank
taxids = [9606, 9598, 9597]  # Human, Chimp, Bonobo
by_family = group_taxids_by_rank(taxids, "family", taxdb)
# {9604: [9606, 9598, 9597]}  # All in Hominidae

by_genus = group_taxids_by_rank(taxids, "genus", taxdb)
# {9605: [9606], 9598: [9598], 9597: [9597]}  # Different genera

# MPA-style lineage representation
print(human.mpa_repr)
# 'd__Eukaryota|k__Metazoa|p__Chordata|...|g__Homo|s__Homo_sapiens'

Command Line Interface

Check Version

taxdumpy --version

Cache Database

# Cache full NCBI taxonomy database
taxdumpy cache -d /path/to/taxdump

# Create fast cache with specific organisms
taxdumpy cache -d /path/to/taxdump -f important_taxids.txt

Search Operations

# Search for organisms (with fuzzy matching)
taxdumpy search --fast "Escherichia coli"
taxdumpy search "Homo sapiens"

# Limit search results
taxdumpy search --limit 5 "Influenza A"

# Search with custom database path
taxdumpy search -d /custom/path "Influenza A"

Lineage Tracing

# Get complete lineage for TaxID
taxdumpy lineage --fast 511145  # E. coli K-12 MG1655
taxdumpy lineage 9606           # Homo sapiens

# With custom database path
taxdumpy lineage -d /custom/path 9606

Database Setup

1. Download NCBI Taxonomy Data

# Create directory for taxonomy data
mkdir -p ~/.taxonkit

# Download latest taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz -P ~/.taxonkit

# Extract files
tar -xzf ~/.taxonkit/taxdump.tar.gz -C ~/.taxonkit

2. Initialize Database

# Create full cache (recommended for regular use)
taxdumpy cache -d ~/.taxonkit

# Or create fast cache with specific organisms
echo -e "9606\n511145\n7227" > important_species.txt
taxdumpy cache -d ~/.taxonkit -f important_species.txt

3. Set Environment Variable (Optional)

export TAXDB_PATH=~/.taxonkit

Advanced Usage

SQLite Backend

from taxdumpy import TaxSQLite, Taxon

# Use SQLite for memory-efficient storage
db = TaxSQLite("/path/to/taxdump")  # Creates taxonomy.db in that directory

# Same API as TaxDb
taxon = Taxon(9606, db)
print(taxon.name_lineage)

# Use as context manager
with TaxSQLite("/path/to/taxdump") as db:
    taxon = Taxon(9606, db)
    print(taxon.name)

Batch Processing

from taxdumpy import TaxDb, Taxon

# Reuse database instance for efficiency
taxdb = TaxDb("/path/to/taxdump", fast=True)

taxids = [9606, 511145, 7227, 4932]  # Human, E.coli, Fly, Yeast
for taxid in taxids:
    taxon = Taxon(taxid, taxdb)
    print(f"{taxon.name}: {' > '.join(taxon.name_lineage[:3])}")

Handling Merged/Legacy TaxIDs

from taxdumpy import TaxDb, Taxon

taxdb = TaxDb("/path/to/taxdump")

# Merged taxids are automatically resolved
taxon = Taxon(old_taxid, taxdb)
print(taxon.is_legacy)  # True if taxid was merged
print(taxon.taxid)      # Original taxid
print(taxon.node.taxid) # Current/resolved taxid

Exception Handling

from taxdumpy import TaxDb, Taxon, TaxidError, TaxRankError, TaxDbError

taxdb = TaxDb("/path/to/taxdump")

try:
    taxon = Taxon(999999999, taxdb)
except TaxidError as e:
    print(f"Invalid taxid: {e}")
    # Includes suggestions for similar taxids

try:
    rank_id = upper_rank_id(9606, "invalid_rank", taxdb)
except TaxRankError as e:
    print(f"Invalid rank: {e}")
    # Includes list of valid ranks

API Reference

Core Classes

TaxDb: In-memory dictionary-based database with pickle caching

TaxDb(taxdump_dir: str, fast: bool = False)

TaxSQLite: SQLite-based persistent database

TaxSQLite(taxdump_dir: str)

Taxon: Taxonomic unit with lineage resolution

Taxon(taxid: int, taxdb: TaxDb | TaxSQLite)

# Properties
.name: str               # Scientific name
.rank: str               # Taxonomic rank
.division: str           # NCBI division
.parent: int             # Parent taxid
.lineage: list[Node]     # Complete lineage as Node objects
.name_lineage: list[str] # Names from taxid to root
.rank_lineage: list[str] # Ranks from taxid to root
.taxid_lineage: list[int] # Taxids from taxid to root
.is_legacy: bool         # True if taxid was merged
.has_species_level: bool # True if lineage contains species rank
.species_taxid: int | None # Taxid at species level

Factory Function

from taxdumpy import create_database

# backend: "sqlite", "sql", "dict", "memory", or "pickle"
db = create_database(backend="sqlite", taxdump_dir="/path/to/taxdump")

Rank Functions

from taxdumpy import (
    upper_rank_id,      # Get taxid at rank (raises TaxRankError if not found)
    get_rank_taxid,     # Get taxid at rank (returns None if not found)
    get_canonical_ranks, # Get dict of all canonical ranks in lineage
    get_closest_rank,   # Get closest rank at or above target level
    get_rank_distance,  # Calculate distance between two ranks
    RANKNAMES,          # 8 canonical ranks: species → realm
    EXTENDED_RANKS,     # 30+ ranks including sub-ranks
)

Exceptions

from taxdumpy import (
    TaxdumpyError,          # Base exception
    TaxDbError,             # Database issues
    TaxidError,             # Invalid/unknown taxid
    TaxRankError,           # Invalid rank
    TaxdumpFileError,       # Taxdump file issues
    DatabaseCorruptionError, # Corrupted database
    ValidationError,        # Input validation failures
)

Performance Tips

  • Use Fast Mode: TaxDb(path, fast=True) provides ~3x speedup with pre-cached data
  • Reuse Instances: Create one TaxDb instance and reuse for multiple operations
  • Environment Variables: Set TAXDB_PATH to avoid repeating database paths
  • Choose Backend: Use TaxSQLite for large datasets with limited memory
  • Batch Operations: Process multiple TaxIDs in batches rather than individual calls

Use Cases

  • Metagenomics: Classify and annotate environmental sequences
  • Phylogenetics: Build taxonomic trees and study evolutionary relationships
  • Bioinformatics: Pipeline integration for taxonomy-aware analysis
  • Data Validation: Verify and standardize organism names in datasets
  • Research: Large-scale taxonomic studies and biodiversity analysis

Requirements

  • Python: 3.10+
  • Dependencies: rapidfuzz, tqdm
  • Data: NCBI taxdump files (~300MB compressed, ~2GB extracted)
  • Memory: ~500MB RAM for full database (less with SQLite backend)

Development & Contributing

Development Setup

git clone https://github.com/omegahh/taxdumpy.git
cd taxdumpy
pip install -e .[dev]

Testing

# Run all tests (requires 80% coverage)
pytest

# Run specific test categories
pytest -m "not slow and not integration"  # Fast tests only
pytest -m integration                      # Integration tests only

# With coverage report
pytest --cov=taxdumpy --cov-report=html

Code Quality

# Format code and sort imports
ruff format src/ tests/
ruff check src/ tests/ --fix

Release Process

# Automated release
python scripts/release.py --version 1.2.3 --upload

# Or manual: create tag and push (GitHub Actions publishes to PyPI)
git tag v1.2.3 && git push origin v1.2.3

License

MIT License - see LICENSE for details.


Related Projects: TaxonKit (Go), ete3 (Python), taxizedb (R)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxdumpy-1.2.1.tar.gz (173.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxdumpy-1.2.1-py3-none-any.whl (30.2 kB view details)

Uploaded Python 3

File details

Details for the file taxdumpy-1.2.1.tar.gz.

File metadata

  • Download URL: taxdumpy-1.2.1.tar.gz
  • Upload date:
  • Size: 173.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for taxdumpy-1.2.1.tar.gz
Algorithm Hash digest
SHA256 904da4c2a0d59d4b0691dc02595508e4fde4d1cb6f65b98401fe67e146a58318
MD5 c32bfbd76dce8d2cae9ba6f92a3e82d5
BLAKE2b-256 2497285a2a74b4fc82cf98cba8181cc1330dbc2970f277de23e7738be3a97781

See more details on using hashes here.

File details

Details for the file taxdumpy-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: taxdumpy-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 30.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for taxdumpy-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e446d28a12c6a4cd5e231d2bd60ad8c037255edc97296ddcf7e5acccf5006cea
MD5 607e0d7783f2b593225e9c6eaf60b83a
BLAKE2b-256 752fa78261f057cda7d48ee906b7f54bd6e3f5c90efda4f064e7ec7e9ede7b67

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page