Skip to main content

Routines for loading, saving, and manipulating taxonomic trees

Project description

Taxonomy

PyPI version Crates version CI

This is a Rust library for reading, writing, and editing biological taxonomies. There are associated Python bindings for accessing most of the functionality from Python.

This library was developed initially as a component in One Codex's metagenomic classification pipeline before being refactored out, expanded, and open-sourced. It is designed such that it can be used as is with a number of taxonomic formats or the Taxonomy trait it provides can be used to add last common ancestor, traversal, etc. methods to a downstream package's taxonomy implementation.

The library ships with a number of features:

  • Common support for taxonomy handling across Rust and Python
  • Fast and low(er) memory usage
  • NCBI taxonomy, JSON ("tree" and "node_link_data" formats), Newick, and PhyloXML support
  • Easily extensible (in Rust) to support other formats and operations

Installation

Rust

This library can be added to an existing Cargo.toml file and installed straight from crates.io.

Python

You can install the Python bindings directly from PyPI (binaries are only built for select architectures) with:

pip install taxonomy

Python Usage

The Python taxonomy API can open and manipulate all of the formats from the Rust library. Note that Taxonomy IDs in NCBI format are integers, but they're converted to strings on import. We find working with "string taxonomy IDs" greatly simplifies inter-operation between different taxonomy systems.

Loading a taxonomy

Taxonomy can be loaded from a variety of sources.

  1. Taxonomy.from_newick(value: str): loads a Taxonomy from a Newick-encoded string.

  2. Taxonomy.from_ncbi(nodes_path: str, names_path: str): loads a Taxonomy from a pair of NCBI dump files. The paths specified are to the individual files in the NCBI taxonomy directory (e.g. nodes.dmp and names.dmp).

  3. Taxonomy.from_json(value: str, /, path: List[str]): loads a Taxonomy from a JSON-encoded string. The format can either be of the tree or node_link_data types and will be automatically detected. If path is specified, the JSON will be traversed to that sub-object before being parsed as a taxonomy.

  4. Taxonomy.from_phyloxml(value: &str): loads a Taxonomy from a PhyloXML-encoded string. Experimental

Exporting a taxonomy

Assuming that the taxonomy has been instantiated as a variable named tax.

  1. tax.to_newick(): exports a Taxonomy as a Newick-encoded byte string.

  2. tax.to_json(/, as_node_link_data: bool): exports a Taxonomy as a JSON-encoded byte string. By default, the JSON format is a tree format unless the as_node_link_data parameter is set to True.

Using a taxonomy

Assuming that the taxonomy has been instantiated as a variable named tax. Note that TaxonomyNode is a class with the following schema:

class TaxonomyNode:
    id: str
    name: str
    parent: Optional[str]
    rank: str

Note that tax_id in parameters passed in functions described below are string but for example in the case of NCBI need to be essentially quoting integers: 562 -> "562". In that case, passing something that can't be converted to a number will raise an exception even if the documentation below does not mention it.

tax.root -> TaxonomyNode

Points to the root of the taxonomy

tax.parent(tax_id: str, /, at_rank: str) -> Optional[TaxonomyNode]

Return the immediate parent TaxonomyNode of the node id.

If at_rank is provided, scan all the nodes in the node's lineage and return the parent id at that rank.

Examples:

parent = tax.parent("612")
parent = tax.parent("612", at_rank="species")
parent = tax.parent("612")
# Both variables will be `None` if we can't find the parent
parent = tax.parent("unknown")

tax.parent_with_distance(tax_id: str, /, at_rank: str) -> (Optional[TaxonomyNode], Optional[float])

Same as parent but return the distance in addition, as a (TaxonomyNode, float) tuple.

tax.node(tax_id: str) -> Optional[TaxonomyNode]

Returns the node at that id. Returns None if not found. You can also use indexing to accomplish that: tax["some_id"] but this will raise an exception if the node is not found.

tax.find_by_name(name: str) -> Optional[TaxonomyNode]

Returns the node with that name. Returns None if not found. In NCBI, it only accounts for scientific names and not synonyms.

tax.children(tax_id: str) -> List[TaxonomyNode]

Returns all nodes below the given tax id.

tax.lineage(tax_id: str) -> List[TaxonomyNode]

Returns all nodes above the given tax id, including itself.

tax.parents(tax_id: str) -> List[TaxonomyNode]

Returns all nodes above the given tax id.

tax.lca(id1: str, id2: str) -> Optional[TaxonomyNode]

Returns the lowest common ancestor for the 2 given nodes.

tax.prune(keep: List[str], remove: List[str])-> Taxonomy

Return a copy of the taxonomy containing:

  • only the nodes in keep and their parents if provided
  • all of the nodes except those in remove and their children if provided

tax.remove_node(tax_id: str)

Remove the node from the tree, re-attaching parents as needed: only a single node is removed.

tax.add_node(parent_tax_id: str, new_tax_id: str)

Add a new node to the tree at the parent provided.

edit_node(tax_id: str, /, name: str, rank: str, parent_id: str, parent_dist: float)

Edit properties on a taxonomy node.

Exceptions

Only one exception is raised intentionally by the library: TaxonomyError. If you get a pyo3_runtime.PanicException (or anything with pyo3 in its name), this is a bug in the underlying Rust library, please open an issue.

Development

Rust

There is a test suite runable with cargo test. To test the Python-bindings you need to use the additional python_test feature: cargo test --features python_test.

Python

To work on the Python library on a Mac OS X/Unix system (requires Python 3):

# you need the nightly version of Rust installed
curl https://sh.rustup.rs -sSf | sh
rustup default nightly

# finally, install the library in the local virtualenv
maturin develop --cargo-extra-args="--features=python"

# or using pip
pip install .

Building binary wheels and pushing to PyPI

# The Mac build requires switching through a few different python versions
maturin build --cargo-extra-args="--features=python" --release --strip

# The linux build is automated through cross-compiling in a docker image
docker run --rm -v $(pwd):/io konstin2/maturin:master build --cargo-extra-args="--features=python" --release --strip
twine upload target/wheels/*

Other Taxonomy Libraries

There are taxonomic toolkits for other programming languages that offer different features and provided some inspiration for this library:

ETE Toolkit (http://etetoolkit.org/) A Python taxonomy library

Taxize (https://ropensci.github.io/taxize-book/) An R toolkit for working with taxonomic data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonomy-0.5.0.tar.gz (83.0 kB view details)

Uploaded Source

Built Distributions

taxonomy-0.5.0-cp38-cp38-manylinux1_x86_64.whl (346.2 kB view details)

Uploaded CPython 3.8

taxonomy-0.5.0-cp37-cp37m-manylinux1_x86_64.whl (346.2 kB view details)

Uploaded CPython 3.7m

taxonomy-0.5.0-cp37-cp37m-macosx_10_7_x86_64.whl (314.3 kB view details)

Uploaded CPython 3.7m macOS 10.7+ x86-64

taxonomy-0.5.0-cp36-cp36m-manylinux1_x86_64.whl (346.5 kB view details)

Uploaded CPython 3.6m

taxonomy-0.5.0-cp35-cp35m-manylinux1_x86_64.whl (346.3 kB view details)

Uploaded CPython 3.5m

File details

Details for the file taxonomy-0.5.0.tar.gz.

File metadata

  • Download URL: taxonomy-0.5.0.tar.gz
  • Upload date:
  • Size: 83.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0.tar.gz
Algorithm Hash digest
SHA256 1d0c75e031ca6207c1d47cff082ca1a544835a28f08af1a9b26461bdacb730e8
MD5 89163e7d871ae66a620833e2a2d48c4b
BLAKE2b-256 b6148df5a2ff8a14c549cb2d585db9783d6302a3596f2b5188091884dd26cc80

See more details on using hashes here.

File details

Details for the file taxonomy-0.5.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: taxonomy-0.5.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 346.2 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f4b30d26f46c94e8fdaa7870f59a18e73a5b62641a01bc7190e0dc9ac17a88bb
MD5 ff36920b8b6fbd0fe95989883e40e605
BLAKE2b-256 bb3d2f57b7fcfeddcd444bc4d9ac8b5bd946e11bbb836531b0a19912ac8a53e7

See more details on using hashes here.

File details

Details for the file taxonomy-0.5.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: taxonomy-0.5.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 346.2 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 07bbd9a5562ae316759dac3445791fbf4742994d333871edb4127a5476e4a2e7
MD5 eb02a5bdcf4dce03caa4d9d7c17eb83b
BLAKE2b-256 869a3d89ca6350855851376b8eb8951abc31b3b15a9f0f97382df0d9e1bed2c9

See more details on using hashes here.

File details

Details for the file taxonomy-0.5.0-cp37-cp37m-macosx_10_7_x86_64.whl.

File metadata

  • Download URL: taxonomy-0.5.0-cp37-cp37m-macosx_10_7_x86_64.whl
  • Upload date:
  • Size: 314.3 kB
  • Tags: CPython 3.7m, macOS 10.7+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0-cp37-cp37m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 5efbba288de3f791c5671721b3e4bfedf920693f3356273be8a5587f31cbda16
MD5 1271168c65b9820c3ad57a9515677870
BLAKE2b-256 e4b0911f5bdecc74c327586e8669f62161bd12956781f9059770107550930e07

See more details on using hashes here.

File details

Details for the file taxonomy-0.5.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: taxonomy-0.5.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 346.5 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a1598c5281e9e06b9ebf6848745ca205705fb8f505cca4d49f78f0aebc05ac72
MD5 0970cf73066031a478021a2641f42262
BLAKE2b-256 6199dd375f390bb3a15ff61c7dc905e643f270bb1616c20ee4e1294c97db5ea2

See more details on using hashes here.

File details

Details for the file taxonomy-0.5.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: taxonomy-0.5.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 346.3 kB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for taxonomy-0.5.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 adba4f4aa970fa7583384b8e2a507cfc34da763401b38e24e581462618b5e073
MD5 bfdf9742c3c845ed7565a4cc18bd78b9
BLAKE2b-256 200695f76877fa809ee374351d6fca6ea71fa44644fdcf0ad98eb28c33e81895

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page