Utilities for working with taxonomic data.
Project description
taxutils
Utilities for working with NCBI taxonomic data, accession-to-taxon mappings, taxonomy branches, corrected ranks, and pathogen target taxa.
Setup
taxutils stores downloaded taxonomy files (names.dmp, nodes.dmp), pathogen target metadata, and accession-to-taxon mappings in a global save directory. Set TAXUTILS_GLOBALS before importing the package if you want to control where these files live:
export TAXUTILS_GLOBALS=/path/to/taxutils/saves
If TAXUTILS_GLOBALS is not set, taxutils defaults to ./taxutils/ in the current working directory.
The first run downloads NCBI taxonomy files. Accession lookups also use the NCBI accession-to-taxon mapping, which is large. By default, taxutils uses low-memory mode and scans the compressed mapping directly. For faster repeated lookups, use low_memory=False to build or reuse a local SQLite database:
from taxutils import taxutils
tu = taxutils(low_memory=False)
Core usage
Core functions are listed here. See the example notebook for a fuller walkthrough.
# Build object
tu = taxutils(accessions=None, low_memory=True, targets_json=None)
# Accession parsing and mapping
tu.parse_accession(header_strings, version=True)
tu.load_a2t(accessions, low_memory=None, extend=False)
tu.get_t2a(taxa, low_memory=None)
# Tree queries
tu.get_branch(taxon)
tu.get_subtree(taxon)
tu.get_lca(taxon_a, taxon_b)
tu.sort_taxa(taxa)
# Rank utilities
tu.get_rank_order()
tu.higher_than_rank(taxa, rank)
In taxutils, accessions=list/of/accessions can be passed to call load_a2t on construction of the taxutils object. A custom targets_json can similarly be passed in lieu of the default json explained below. load_a2t overwrites tu.a2t by default; pass extend=True to add missing mappings without discarding existing ones. Method-level low_memory=None uses the mode set when tu was built.
Rank correction
taxutils keeps the raw NCBI rank in rank and adds corrected rank columns. Canonical ranks (R, D, K, P, C, O, F, G, S) are used as anchors only when they move deeper than the corrected parent rank. Noncanonical ranks such as no rank, clade, and other unusual labels inherit position from the tree. If a child would be ranked at the same or a higher level than its parent, it is assigned a subrank such as S2, S3, or F2. The canonical name for the corrected rank is stored in new_rank.
Target taxa
In ZarLab, we are working on metagenomics in the clinical setting, with the goal of creating an "agnostic diagnostic". We often want to look at broad array of taxa (tu.target_taxa) that could cause harm to people. In June 2024, CZI did the work of compiling a list of pathogenic taxa. I did the easy work of turning this into a json and uploading it to my website, so that it is available and easily accessed for all time (in case that link ever breaks). taxutils will extend the taxa list to include subtrees of each of those pathogenic taxa. It will additionally include SARS-CoV2, since it was excluded from CZI's list. If you find any other obvious, missing pathogens, please send me a note, so I can update my json. You can also update the target_taxa member variable yourself, or store an entirely different set of targets, if you wanted.
Contact
Author: Will O'Brien
Affiliation: Computer Science Department, UCLA
Email: wob@cs.ucla.edu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taxutils-1.0.0.tar.gz.
File metadata
- Download URL: taxutils-1.0.0.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77194e7c98c740e4a004fd066e231d54426b41d2622803babc25a72ee6214fae
|
|
| MD5 |
a554e1a27c035e0afe6904ead26ed51e
|
|
| BLAKE2b-256 |
486b502062d8db3a7dc4a0d12be5b099a00a34b362bdcf16f43df0427d710b42
|
Provenance
The following attestation bundles were made for taxutils-1.0.0.tar.gz:
Publisher:
workflow.yml on SwabSeq/taxutils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
taxutils-1.0.0.tar.gz -
Subject digest:
77194e7c98c740e4a004fd066e231d54426b41d2622803babc25a72ee6214fae - Sigstore transparency entry: 1799623108
- Sigstore integration time:
-
Permalink:
SwabSeq/taxutils@3436c07a76afe3e2d182fe4ba12d53ba876d2e72 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/SwabSeq
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@3436c07a76afe3e2d182fe4ba12d53ba876d2e72 -
Trigger Event:
push
-
Statement type:
File details
Details for the file taxutils-1.0.0-py3-none-any.whl.
File metadata
- Download URL: taxutils-1.0.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62984bace2c869d6d0e91a67bbe15169c98d786ea68a3236e1ed3f9cde56d48c
|
|
| MD5 |
eee7d831157a19f8b0906154b782eec2
|
|
| BLAKE2b-256 |
e1bae0e5daf958d395a96f52f24493776d224dd9261bd61c53e6c5873c550b0c
|
Provenance
The following attestation bundles were made for taxutils-1.0.0-py3-none-any.whl:
Publisher:
workflow.yml on SwabSeq/taxutils
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
taxutils-1.0.0-py3-none-any.whl -
Subject digest:
62984bace2c869d6d0e91a67bbe15169c98d786ea68a3236e1ed3f9cde56d48c - Sigstore transparency entry: 1799623284
- Sigstore integration time:
-
Permalink:
SwabSeq/taxutils@3436c07a76afe3e2d182fe4ba12d53ba876d2e72 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/SwabSeq
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@3436c07a76afe3e2d182fe4ba12d53ba876d2e72 -
Trigger Event:
push
-
Statement type: