Skip to main content

Utilities for working with taxonomic data.

Project description

taxutils

Utilities for working with NCBI taxonomic data, accession-to-taxon mappings, taxonomy branches, corrected ranks, and pathogen target taxa.

Setup

taxutils stores downloaded taxonomy files (names.dmp, nodes.dmp), pathogen target metadata, and accession-to-taxon mappings in a global save directory. Set TAXUTILS_GLOBALS before importing the package if you want to control where these files live:

export TAXUTILS_GLOBALS=/path/to/taxutils/saves

If TAXUTILS_GLOBALS is not set, taxutils defaults to ./taxutils/ in the current working directory.

The first run downloads NCBI taxonomy files. Accession lookups also use the NCBI accession-to-taxon mapping, which is large. By default, taxutils uses low-memory mode and scans the compressed mapping directly. For faster repeated lookups, use low_memory=False to build or reuse a local SQLite database:

from taxutils import taxutils

tu = taxutils(low_memory=False)

Core usage

Core functions are listed here. See the example notebook for a fuller walkthrough.

# Build object
tu = taxutils(accessions=None, low_memory=True, targets_json=None)

# Accession parsing and mapping
tu.parse_accession(header_strings, version=True)
tu.load_a2t(accessions, low_memory=None, extend=False)
tu.get_t2a(taxa, low_memory=None)

# Tree queries
tu.get_branch(taxon)
tu.get_subtree(taxon)
tu.get_lca(taxon_a, taxon_b)
tu.sort_taxa(taxa)

# Rank utilities
tu.get_rank_order()
tu.higher_than_rank(taxa, rank)

In taxutils, accessions=list/of/accessions can be passed to call load_a2t on construction of the taxutils object. A custom targets_json can similarly be passed in lieu of the default json explained below. load_a2t overwrites tu.a2t by default; pass extend=True to add missing mappings without discarding existing ones. Method-level low_memory=None uses the mode set when tu was built.

Rank correction

taxutils keeps the raw NCBI rank in rank and adds corrected rank columns. Canonical ranks (R, D, K, P, C, O, F, G, S) are used as anchors only when they move deeper than the corrected parent rank. Noncanonical ranks such as no rank, clade, and other unusual labels inherit position from the tree. If a child would be ranked at the same or a higher level than its parent, it is assigned a subrank such as S2, S3, or F2. The canonical name for the corrected rank is stored in new_rank.

Target taxa

In ZarLab, we are working on metagenomics in the clinical setting, with the goal of creating an "agnostic diagnostic". We often want to look at broad array of taxa (tu.target_taxa) that could cause harm to people. In June 2024, CZI did the work of compiling a list of pathogenic taxa. I did the easy work of turning this into a json and uploading it to my website, so that it is available and easily accessed for all time (in case that link ever breaks). taxutils will extend the taxa list to include subtrees of each of those pathogenic taxa. It will additionally include SARS-CoV2, since it was excluded from CZI's list. If you find any other obvious, missing pathogens, please send me a note, so I can update my json. You can also update the target_taxa member variable yourself, or store an entirely different set of targets, if you wanted.

Contact

Author: Will O'Brien
Affiliation: Computer Science Department, UCLA
Email: wob@cs.ucla.edu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxutils-1.0.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxutils-1.0.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file taxutils-1.0.0.tar.gz.

File metadata

  • Download URL: taxutils-1.0.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for taxutils-1.0.0.tar.gz
Algorithm Hash digest
SHA256 77194e7c98c740e4a004fd066e231d54426b41d2622803babc25a72ee6214fae
MD5 a554e1a27c035e0afe6904ead26ed51e
BLAKE2b-256 486b502062d8db3a7dc4a0d12be5b099a00a34b362bdcf16f43df0427d710b42

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxutils-1.0.0.tar.gz:

Publisher: workflow.yml on SwabSeq/taxutils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file taxutils-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: taxutils-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for taxutils-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62984bace2c869d6d0e91a67bbe15169c98d786ea68a3236e1ed3f9cde56d48c
MD5 eee7d831157a19f8b0906154b782eec2
BLAKE2b-256 e1bae0e5daf958d395a96f52f24493776d224dd9261bd61c53e6c5873c550b0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxutils-1.0.0-py3-none-any.whl:

Publisher: workflow.yml on SwabSeq/taxutils

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page