Skip to main content

Routines for loading, saving, and manipulating taxonomic trees

Project description

Taxonomy

PyPI version Crates version CI

This is a Rust library for reading, writing, and editing biological taxonomies. There are associated Python bindings for accessing most of the functionality from Python.

This library was developed initially as a component in One Codex's metagenomic classification pipeline before being refactored out, expanded, and open-sourced. It is designed such that it can be used as is with a number of taxonomic formats or the Taxonomy trait it provides can be used to add last common ancestor, traversal, etc. methods to a downstream package's taxonomy implementation.

The library ships with a number of features:

  • Common support for taxonomy handling across Rust and Python
  • Fast and low(er) memory usage
  • NCBI taxonomy, JSON ("tree" and "node_link_data" formats), Newick, and PhyloXML support
  • Easily extensible (in Rust) to support other formats and operations

Installation

Rust

This library can be added to an existing Cargo.toml file and installed straight from crates.io.

Python

You can install the Python bindings directly from PyPI (binaries are only built for select architectures) with:

pip install taxonomy

Python Usage

The Python taxonomy API can open and manipulate all of the formats from the Rust library. Note that Taxonomy IDs in NCBI format are integers, but they're converted to strings on import. We find working with "string taxonomy IDs" greatly simplifies inter-operation between different taxonomy systems.

Loading a taxonomy

Taxonomy can be loaded from a variety of sources.

  1. Taxonomy.from_newick(value: str): loads a Taxonomy from a Newick-encoded string.

  2. Taxonomy.from_ncbi(ncbi_filder: str): loads a Taxonomy from a pair of NCBI dump files. The folder needs to contain the individual files in the NCBI taxonomy directory (e.g. nodes.dmp and names.dmp).

  3. Taxonomy.from_json(value: str, /, json_pointer: str): loads a Taxonomy from a JSON-encoded string. The format can either be of the tree or node_link_data types and will be automatically detected. If path is specified, the JSON will be traversed to that sub-object before being parsed as a taxonomy.

  4. Taxonomy.from_phyloxml(value: &str): loads a Taxonomy from a PhyloXML-encoded string. Experimental

Exporting a taxonomy

Assuming that the taxonomy has been instantiated as a variable named tax.

  1. tax.to_newick(): exports a Taxonomy as a Newick-encoded byte string.
  2. tax.to_json_tree(): exports a Taxonomy as a JSON-encoded byte string in a tree format
  3. tax.to_json_node_links(): exports a Taxonomy as a JSON-encoded byte string in a node links format

Using a taxonomy

Assuming that the taxonomy has been instantiated as a variable named tax. Note that TaxonomyNode is a class with the following schema:

class TaxonomyNode:
    id: str
    name: str
    parent: Optional[str]
    rank: str

Note that tax_id in parameters passed in functions described below are string but for example in the case of NCBI need to be essentially quoting integers: 562 -> "562". If you loaded a taxonomy via JSON and you had additional data in your file, you can access it via indexing, node["readcount"] for example.

tax.root -> TaxonomyNode

Points to the root of the taxonomy

tax.parent(tax_id: str, /, at_rank: str) -> Optional[TaxonomyNode]

Return the immediate parent TaxonomyNode of the node id.

If at_rank is provided, scan all the nodes in the node's lineage and return the parent id at that rank.

Examples:

parent = tax.parent("612")
parent = tax.parent("612", at_rank="species")
parent = tax.parent("612")
# Both variables will be `None` if we can't find the parent
parent = tax.parent("unknown")

tax.parent_with_distance(tax_id: str, /, at_rank: str) -> (Optional[TaxonomyNode], Optional[float])

Same as parent but return the distance in addition, as a (TaxonomyNode, float) tuple.

tax.node(tax_id: str) -> Optional[TaxonomyNode]

Returns the node at that id. Returns None if not found. You can also use indexing to accomplish that: tax["some_id"] but this will raise an exception if the node is not found.

tax.find_by_name(name: str) -> Optional[TaxonomyNode]

Returns the node with that name. Returns None if not found. In NCBI, it only accounts for scientific names and not synonyms.

tax.children(tax_id: str) -> List[TaxonomyNode]

Returns all nodes below the given tax id.

tax.lineage(tax_id: str) -> List[TaxonomyNode]

Returns all nodes above the given tax id, including itself.

tax.parents(tax_id: str) -> List[TaxonomyNode]

Returns all nodes above the given tax id.

tax.lca(id1: str, id2: str) -> Optional[TaxonomyNode]

Returns the lowest common ancestor for the 2 given nodes.

tax.prune(keep: List[str], remove: List[str])-> Taxonomy

Return a copy of the taxonomy containing:

  • only the nodes in keep and their parents if provided
  • all of the nodes except those in remove and their children if provided

tax.remove_node(tax_id: str)

Remove the node from the tree, re-attaching parents as needed: only a single node is removed.

tax.add_node(parent_tax_id: str, new_tax_id: str)

Add a new node to the tree at the parent provided.

edit_node(tax_id: str, /, name: str, rank: str, parent_id: str, parent_dist: float)

Edit properties on a taxonomy node.

Exceptions

Only one exception is raised intentionally by the library: TaxonomyError. If you get a pyo3_runtime.PanicException (or anything with pyo3 in its name), this is a bug in the underlying Rust library, please open an issue.

Development

Rust

There is a test suite runable with cargo test. To test the Python-bindings you need to use the additional python_test feature: cargo test --features python_test.

Python

To work on the Python library on a Mac OS X/Unix system (requires Python 3):

# you need the nightly version of Rust installed
curl https://sh.rustup.rs -sSf | sh

# finally, install the library in the local virtualenv
maturin develop --cargo-extra-args="--features=python"

# or using pip
pip install .

Building binary wheels and pushing to PyPI

# The Mac build requires switching through a few different python versions
maturin build --cargo-extra-args="--features=python" --release --strip

# The linux build requires switching through different python versions and linux compatibility targets.
# For example, to build for Python 3.10 and manylinux2010 compatibility:
docker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin:main build --features=python --release --strip --interpreter=python3.10 --compatibility=manylinux2010

# Upload the wheels to PyPI:
twine upload target/wheels/*

Other Taxonomy Libraries

There are taxonomic toolkits for other programming languages that offer different features and provided some inspiration for this library:

ETE Toolkit (http://etetoolkit.org/) A Python taxonomy library

Taxize (https://ropensci.github.io/taxize-book/) An R toolkit for working with taxonomic data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonomy-0.9.0.tar.gz (78.7 kB view details)

Uploaded Source

Built Distributions

taxonomy-0.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (395.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

taxonomy-0.9.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (427.4 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

taxonomy-0.9.0-cp310-cp310-macosx_11_0_arm64.whl (357.9 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

taxonomy-0.9.0-cp310-cp310-macosx_10_7_x86_64.whl (394.4 kB view details)

Uploaded CPython 3.10 macOS 10.7+ x86-64

taxonomy-0.9.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (395.2 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

taxonomy-0.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (427.1 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

taxonomy-0.9.0-cp39-cp39-macosx_10_7_x86_64.whl (394.5 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

taxonomy-0.9.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (395.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

File details

Details for the file taxonomy-0.9.0.tar.gz.

File metadata

  • Download URL: taxonomy-0.9.0.tar.gz
  • Upload date:
  • Size: 78.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for taxonomy-0.9.0.tar.gz
Algorithm Hash digest
SHA256 cf5aca5ce4ae2c139ec6b2c6efad8df6226e217435cc7a89bbbef39a945cbec2
MD5 7fe26380e944bc2c4fb75968f4a689fc
BLAKE2b-256 41b49aa3c6edbe2f1d9b3ef847a5ce9b912aa7a3746a41d795ec1447f061a43d

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cda2fd5fd1bd59d6db903ef62eabdab5ede524650593dd0ca00c241bc220dc32
MD5 1e7345a76df4506f650d68f57688b6f4
BLAKE2b-256 b451fe8fbc3ec5ae6daf5abcbe68f62884ea50a9c3197be05a91cf26472cf9b0

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9607ed13adf3cbeed4c4ea6de312b1010e5a155b25ce85a1892f854bcdd58bef
MD5 294113cde09672d4201972f07edabf09
BLAKE2b-256 16fbd5dd19ae7736edfb49255c471dc82995ff27cb222c9ccbe2e52b6446c4d4

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6f3ed76487b40de65d44da4ee5e1d0c1fe38585d1fa67885c33d6b988d537335
MD5 67b8c73a39a7016910988e081771ae61
BLAKE2b-256 43aaa43466bc53b4bef88644de63cca68d594514cc8c12ddeca39cc93ce1d475

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 cc86ed06d9f0f2e8548433bde0911de89be9c3e35db9ae75d5accf230ab0223e
MD5 0faf0b7f53e5e5c9ef97b93a9bc225d1
BLAKE2b-256 b78b24d456b1e25ae8c8e855e7e32a275fddfb63fc5f574f8c616b4e43b20091

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 db266340f7a6ac1df8b1930fce62eb1cf77906a7e6198e82381dd2adc0926421
MD5 bc51e4dd0c6750c8fd26edfca8594841
BLAKE2b-256 503c2a907dd829b793427e38f192864b83fc02b80dab1973a2f4f0e9a82580f7

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0ddd5bdf2b7c9809bfb6ad03d1ad8ade75d48e6d604881aef00e276534091604
MD5 2362963aa5224ec7fb3ed3dd28e8c587
BLAKE2b-256 2e6837446ba1b08d7c393c88bc9cae74d2aa2ebbd9109e1cf5a624a9dd731815

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 351a148db316142bed47de97835ff12970749e8faeae235a90d3f0cd7aa6afee
MD5 64ce603b05e069ed14fe593d50bced3a
BLAKE2b-256 fc0512b46a449c0b1cc5cada4ed6fb817b60af5489caba04ea0fdb131129196b

See more details on using hashes here.

File details

Details for the file taxonomy-0.9.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for taxonomy-0.9.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8f353ab0dcca5107a787661e6a992e55f0c81894fb9c6a4affc28d73051f3f0e
MD5 2a9b4e9f2ec64909419870cc10ac4726
BLAKE2b-256 e6f3836e27ff51fc4c84c8aa8bc7bf21397b0f9c41ed96fb3e7688a13dc4b302

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page