Skip to main content

ncbi-tree is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt)

Project description

ncbi-tree

PyPI version Python 3.8+ License: CC BY-NC 4.0

ncbi-tree is an open source, cross-platform command-line tool for downloading the latest NCBI taxonomy database and converting it to Newick tree format (.tre), with optional plain-text visualization (.txt).

Quick Start

pip install ncbi-tree
ncbi-tree ./output

That's it! The tool will download the latest NCBI taxonomy, generate phylogenetic trees, and create detailed reports.

Features

  • Automatic Download: Fetches the latest taxonomy data from NCBI FTP servers
  • Version Tracking: Automatically detects and records the exact server version
  • Smart Caching: Skips re-download and re-extraction when files already exist
  • Merged Taxa Support: Handles merged taxonomy IDs from merged.dmp
  • Name Sanitization: By default inital letter is capitalized and space is replaced by -. Configurable name formatting with --no-sanitize option
  • Server-side compatibility: No more blocking on user input in automated environments. Use ncbi-tree ./output --no-prompt-1 to automatically generate all files (core + optional); Use ncbi-tree ./output --no-prompt-0 to generate only core files, skip optional files.

Installation

pip install ncbi-tree

Usage

Basic Usage

# Download and build taxonomy tree with default settings
ncbi-tree ./output

# Clean up intermediate files after processing
ncbi-tree ./output --no-cache

# Disable name sanitization (keep original spaces)
ncbi-tree ./output --no-sanitize

# Use custom download URL
ncbi-tree ./output --url https://custom-mirror.org/taxdump.tar.gz

# Combined options
ncbi-tree ./output --no-cache --no-sanitize

Server-side, non-blocking, no-interaction

# Automatically generate all files (core + optional)
ncbi-tree ./output --no-prompt-1

# Generate only core files, skip optional files
ncbi-tree ./output --no-prompt-0

Help

ncbi-tree --help
ncbi-tree --version

Output Files

Core Files (Generated Automatically)

  1. output.NCBI.tree.tre - Newick tree with NCBI taxonomy IDs only
  2. output.NCBI.report.txt - Exploratory taxonomy analysis and statistics
  3. version.txt - Server timestamped version for downloaded taxdump.tar.gz

Optional Files (User Prompted)

After core files are generated, you will be prompted:

Would you like to generate optional files (output.NCBI.tree.txt, output.NCBI.named.tree.tre, output.NCBI.ID.to.name.tsv)? [y/N]:

If you answer y, additional files will be generated without re-reading data:

  1. output.NCBI.tree.txt - Plain-text tree with Unicode box-drawing
  2. output.NCBI.named.tree.tre - Newick tree with rank:id:name labels
  3. output.NCBI.ID.to.name.tsv - TSV mapping of IDs to names (TaxID, Name, Rank)

Name Sanitization

By default, taxon names are sanitized for consistent display:

  • Spaces replaced with -
  • Existing - escaped as <->, which will be eventually escaped back to -. Configurable by changing name = name.replace('-', '<->') in sanitize_name in core.py.
  • Title case applied
  • Special characters removed

Default (sanitized):

"Human;Homo-Sapiens"
"Norway-Rat;Rattus-Norvegicus"

With --no-sanitize flag:

"human; Homo sapiens"
"Norway rat; Rattus norvegicus"

Advanced Configuration

Custom Name Display

To customize which name types are displayed, edit NAME_PRIORITIES in ncbi_tree/core.py:

# Default: both common and scientific names
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": 1}
# Result: "Human; Homo sapiens"

# Scientific name only (disable common name)
NAME_PRIORITIES = {"genbank common name": -1, "scientific name": 0}
# Result: "Homo sapiens"

# Common name only (disable scientific name)
NAME_PRIORITIES = {"genbank common name": 0, "scientific name": -1}
# Result: "Human"

Note: Priority value -1 disables that name type, >= 0 enables it (lower number = higher priority).

Requirements

  • Python 3.8 or higher
  • requests >= 2.25.0
  • tqdm >= 4.50.0

Technical Details

Data Source

  • Primary: NCBI Taxonomy Database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)
  • Updates: Automatic detection of latest version with timestamp tracking
  • Size: ~70-100 MB compressed, ~2.7M+ taxonomy entries at the time of writing (October 2025)
  • Format: NCBI taxdump format (nodes.dmp, names.dmp, merged.dmp)

Output Formats

  1. Newick (.tre): Standard phylogenetic tree format compatible with all major tree viewers
  2. Text Tree (.txt): Unicode-based visualization for terminal/text viewing
  3. TSV Mapping (.tsv): Tabular format for database integration and lookups
  4. Report (.txt): Statistical analysis with rank distribution and depth metrics

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgments

  • Schoch CL, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020: baaa062. PubMed
  • Sayers EW, et al. GenBank. Nucleic Acids Res. 2019. 47(D1):D94-D99. PubMed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncbi_tree-1.1.0.tar.gz (21.8 kB view details)

Uploaded Source

File details

Details for the file ncbi_tree-1.1.0.tar.gz.

File metadata

  • Download URL: ncbi_tree-1.1.0.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for ncbi_tree-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5c12af945d3081ddc5d9a36f15836411441b6e608798718ce09863a106eed7a3
MD5 b5f308ba3559f1954022261f527c4d7b
BLAKE2b-256 c6d316f35bf997c00428360ff51c929814dd1a72f7f490cd9722257c13768a9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page