Python package for efficiently parsing NCBI's taxdump database

These details have not been verified by PyPI

Project links

Project description

Taxdumpy

A high-performance Python toolkit for parsing NCBI Taxonomy databases with lineage resolution and fuzzy search

Features

Fast Parsing: Optimized loading of NCBI taxdump files with pickle caching for ~3x speedup
Dual Backends: In-memory (TaxDb) for speed or SQLite (TaxSQLite) for memory efficiency
Lineage Resolution: Complete lineage tracing with automatic handling of merged/deleted nodes
Rank Codes: Comparable RC class for efficient rank-based operations and grouping
Fuzzy Search: Rapid approximate name matching using rapidfuzz
Batch Grouping: Group taxids by rank level with group_taxids_by_rank
CLI: Ready-to-use commands for caching, searching, and lineage resolution

Installation

pip install taxdumpy

Development Installation:

git clone https://github.com/omegahh/taxdumpy.git
cd taxdumpy
pip install -e .

With Development Dependencies:

pip install -e .[dev]

Quick Start

Basic Usage

from taxdumpy import TaxDb, Taxon

# Initialize database (uses pickle cache if available)
taxdb = TaxDb("/path/to/taxdump")

# Create taxon objects
human = Taxon(9606, taxdb)  # Homo sapiens
ecoli = Taxon(511145, taxdb)  # E. coli K-12

# Access lineage information
print(human.name_lineage)
# ['Homo sapiens', 'Homo', 'Hominidae', 'Primates', ..., 'cellular organisms']

print(human.rank_lineage)
# ['species', 'genus', 'family', 'order', ..., 'superkingdom']

# Check taxonomic properties
print(f"Rank: {human.rank}")           # species
print(f"Division: {human.division}")   # Primates
print(f"Is legacy: {human.is_legacy}") # False

Using the Factory Function

from taxdumpy import create_database, Taxon

# Create database with preferred backend
db = create_database(backend="sqlite", taxdump_dir="/path/to/taxdump")
# Or: backend="dict" for in-memory

taxon = Taxon(9606, db)
print(taxon.name)  # Homo sapiens

Fuzzy Search

# Search with typos and partial matches
results = taxdb._rapid_fuzz("Escherichia coli", limit=5)
for match in results:
    print(f"{match['name']} (TaxID: {match['taxid']}, Score: {match['score']})")

Rank Codes and Grouping

from taxdumpy import TaxDb, Taxon, RC, group_taxids_by_rank

taxdb = TaxDb("/path/to/taxdump")
human = Taxon(9606, taxdb)

# RC (Rank Code) - comparable rank representation
print(human.rc)              # RC('S') for species
print(human.rankcode_lineage)  # [RC('S'), RC('G'), RC('F1'), RC('F'), ...]

# RC comparison (lower value = higher rank)
RC("R") < RC("D") < RC("G") < RC("S")  # True
RC("F") < RC("F1")  # True (F1 is below F)

# Group taxids by rank
taxids = [9606, 9598, 9597]  # Human, Chimp, Bonobo
by_family = group_taxids_by_rank(taxids, "family", taxdb)
# {9604: [9606, 9598, 9597]}  # All in Hominidae

by_genus = group_taxids_by_rank(taxids, "genus", taxdb)
# {9605: [9606], 9598: [9598], 9597: [9597]}  # Different genera

# MPA-style lineage representation
print(human.mpa_repr)
# 'd__Eukaryota|k__Metazoa|p__Chordata|...|g__Homo|s__Homo_sapiens'

Command Line Interface

Check Version

taxdumpy --version

Cache Database

# Cache full NCBI taxonomy database
taxdumpy cache -d /path/to/taxdump

# Create fast cache with specific organisms
taxdumpy cache -d /path/to/taxdump -f important_taxids.txt

Search Operations

# Search for organisms (with fuzzy matching)
taxdumpy search --fast "Escherichia coli"
taxdumpy search "Homo sapiens"

# Limit search results
taxdumpy search --limit 5 "Influenza A"

# Search with custom database path
taxdumpy search -d /custom/path "Influenza A"

Lineage Tracing

# Get complete lineage for TaxID
taxdumpy lineage --fast 511145  # E. coli K-12 MG1655
taxdumpy lineage 9606           # Homo sapiens

# With custom database path
taxdumpy lineage -d /custom/path 9606

Database Setup

1. Download NCBI Taxonomy Data

# Create directory for taxonomy data
mkdir -p ~/.taxonkit

# Download latest taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz -P ~/.taxonkit

# Extract files
tar -xzf ~/.taxonkit/taxdump.tar.gz -C ~/.taxonkit

2. Initialize Database

# Create full cache (recommended for regular use)
taxdumpy cache -d ~/.taxonkit

# Or create fast cache with specific organisms
echo -e "9606\n511145\n7227" > important_species.txt
taxdumpy cache -d ~/.taxonkit -f important_species.txt

3. Set Environment Variable (Optional)

export TAXDB_PATH=~/.taxonkit

Advanced Usage

SQLite Backend

from taxdumpy import TaxSQLite, Taxon

# Use SQLite for memory-efficient storage
db = TaxSQLite("/path/to/taxdump")  # Creates taxonomy.db in that directory

# Same API as TaxDb
taxon = Taxon(9606, db)
print(taxon.name_lineage)

# Use as context manager
with TaxSQLite("/path/to/taxdump") as db:
    taxon = Taxon(9606, db)
    print(taxon.name)

Batch Processing

from taxdumpy import TaxDb, Taxon

# Reuse database instance for efficiency
taxdb = TaxDb("/path/to/taxdump", fast=True)

taxids = [9606, 511145, 7227, 4932]  # Human, E.coli, Fly, Yeast
for taxid in taxids:
    taxon = Taxon(taxid, taxdb)
    print(f"{taxon.name}: {' > '.join(taxon.name_lineage[:3])}")

Handling Merged/Legacy TaxIDs

from taxdumpy import TaxDb, Taxon

taxdb = TaxDb("/path/to/taxdump")

# Merged taxids are automatically resolved
taxon = Taxon(old_taxid, taxdb)
print(taxon.is_legacy)  # True if taxid was merged
print(taxon.taxid)      # Original taxid
print(taxon.node.taxid) # Current/resolved taxid

Exception Handling

from taxdumpy import TaxDb, Taxon, TaxidError, TaxRankError, TaxDbError

taxdb = TaxDb("/path/to/taxdump")

try:
    taxon = Taxon(999999999, taxdb)
except TaxidError as e:
    print(f"Invalid taxid: {e}")
    # Includes suggestions for similar taxids

try:
    rank_id = upper_rank_id(9606, "invalid_rank", taxdb)
except TaxRankError as e:
    print(f"Invalid rank: {e}")
    # Includes list of valid ranks

API Reference

Core Classes

TaxDb: In-memory dictionary-based database with pickle caching

TaxDb(taxdump_dir: str, fast: bool = False)

TaxSQLite: SQLite-based persistent database

TaxSQLite(taxdump_dir: str)

Taxon: Taxonomic unit with lineage resolution

Taxon(taxid: int, taxdb: TaxDb | TaxSQLite)

# Properties
.name: str               # Scientific name
.rank: str               # Taxonomic rank
.division: str           # NCBI division
.parent: int             # Parent taxid
.lineage: list[Node]     # Complete lineage as Node objects
.name_lineage: list[str] # Names from taxid to root
.rank_lineage: list[str] # Ranks from taxid to root
.taxid_lineage: list[int] # Taxids from taxid to root
.is_legacy: bool         # True if taxid was merged
.has_species_level: bool # True if lineage contains species rank
.species_taxid: int | None # Taxid at species level

Factory Function

from taxdumpy import create_database

# backend: "sqlite", "sql", "dict", "memory", or "pickle"
db = create_database(backend="sqlite", taxdump_dir="/path/to/taxdump")

Rank Functions

from taxdumpy import (
    upper_rank_id,      # Get taxid at rank (raises TaxRankError if not found)
    get_rank_taxid,     # Get taxid at rank (returns None if not found)
    get_canonical_ranks, # Get dict of all canonical ranks in lineage
    get_closest_rank,   # Get closest rank at or above target level
    get_rank_distance,  # Calculate distance between two ranks
    RANKNAMES,          # 8 canonical ranks: species → realm
    EXTENDED_RANKS,     # 30+ ranks including sub-ranks
)

Exceptions

from taxdumpy import (
    TaxdumpyError,          # Base exception
    TaxDbError,             # Database issues
    TaxidError,             # Invalid/unknown taxid
    TaxRankError,           # Invalid rank
    TaxdumpFileError,       # Taxdump file issues
    DatabaseCorruptionError, # Corrupted database
    ValidationError,        # Input validation failures
)

Performance Tips

Use Fast Mode: TaxDb(path, fast=True) provides ~3x speedup with pre-cached data
Reuse Instances: Create one TaxDb instance and reuse for multiple operations
Environment Variables: Set TAXDB_PATH to avoid repeating database paths
Choose Backend: Use TaxSQLite for large datasets with limited memory
Batch Operations: Process multiple TaxIDs in batches rather than individual calls

Use Cases

Metagenomics: Classify and annotate environmental sequences
Phylogenetics: Build taxonomic trees and study evolutionary relationships
Bioinformatics: Pipeline integration for taxonomy-aware analysis
Data Validation: Verify and standardize organism names in datasets
Research: Large-scale taxonomic studies and biodiversity analysis

Requirements

Python: 3.10+
Dependencies: rapidfuzz, tqdm
Data: NCBI taxdump files (~300MB compressed, ~2GB extracted)
Memory: ~500MB RAM for full database (less with SQLite backend)

Development & Contributing

Development Setup

git clone https://github.com/omegahh/taxdumpy.git
cd taxdumpy
pip install -e .[dev]

Testing

# Run all tests (requires 80% coverage)
pytest

# Run specific test categories
pytest -m "not slow and not integration"  # Fast tests only
pytest -m integration                      # Integration tests only

# With coverage report
pytest --cov=taxdumpy --cov-report=html

Code Quality

# Format code and sort imports
ruff format src/ tests/
ruff check src/ tests/ --fix

Release Process

# Automated release
python scripts/release.py --version 1.2.3 --upload

# Or manual: create tag and push (GitHub Actions publishes to PyPI)
git tag v1.2.3 && git push origin v1.2.3

License

MIT License - see LICENSE for details.

Related Projects: TaxonKit (Go), ete3 (Python), taxizedb (R)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.1

Feb 6, 2026

1.1.8

Jan 3, 2026

1.1.7

Dec 31, 2025

1.1.6

Sep 2, 2025

1.1.5

Aug 18, 2025

1.1.4

Aug 18, 2025

1.1.3

Aug 18, 2025

0.1.1

Aug 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxdumpy-1.2.1.tar.gz (173.9 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

taxdumpy-1.2.1-py3-none-any.whl (30.2 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file taxdumpy-1.2.1.tar.gz.

File metadata

Download URL: taxdumpy-1.2.1.tar.gz
Upload date: Feb 6, 2026
Size: 173.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for taxdumpy-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`904da4c2a0d59d4b0691dc02595508e4fde4d1cb6f65b98401fe67e146a58318`
MD5	`c32bfbd76dce8d2cae9ba6f92a3e82d5`
BLAKE2b-256	`2497285a2a74b4fc82cf98cba8181cc1330dbc2970f277de23e7738be3a97781`

See more details on using hashes here.

File details

Details for the file taxdumpy-1.2.1-py3-none-any.whl.

File metadata

Download URL: taxdumpy-1.2.1-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 30.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for taxdumpy-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e446d28a12c6a4cd5e231d2bd60ad8c037255edc97296ddcf7e5acccf5006cea`
MD5	`607e0d7783f2b593225e9c6eaf60b83a`
BLAKE2b-256	`752fa78261f057cda7d48ee906b7f54bd6e3f5c90efda4f064e7ec7e9ede7b67`

See more details on using hashes here.

taxdumpy 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Taxdumpy

Features

Installation

Quick Start

Basic Usage

Using the Factory Function

Fuzzy Search

Rank Codes and Grouping

Command Line Interface

Check Version

Cache Database

Search Operations

Lineage Tracing

Database Setup

1. Download NCBI Taxonomy Data

2. Initialize Database

3. Set Environment Variable (Optional)

Advanced Usage

SQLite Backend

Batch Processing

Handling Merged/Legacy TaxIDs

Exception Handling

API Reference

Core Classes

Rank Functions

Exceptions

Performance Tips

Use Cases

Requirements

Development & Contributing

Development Setup

Testing

Code Quality

Release Process

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes