Python package to obtain, parse and explore biological taxonomies
Project description
MultiTax
Python package to obtain, parse and explore biological taxonomies
Description
MultiTax is a Python package that provides a common and generalized set of functions to download, parse, filter and explore multiple biological taxonomies (GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy) and custom formatted taxonomies. Main goals are:
- Be fast, intuitive, generalized and easy to use
- Explore different taxonomies with same set of commands
- Enable integration and compatibility with multiple taxonomies
- Translate and convert taxonomies (not yet implemented)
Installation
pip
pip install multitax
conda
conda install -c bioconda multitax
local
git clone https://github.com/pirovc/multitax.git
cd multitax
python setup.py install --record files.txt
Documentation
https://pirovc.github.io/multitax/
Basic Example with GTDB
from multitax import GtdbTx
# Download and parse taxonomy
tax = GtdbTx()
# Get lineage for the Escherichia genus
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']
Further Examples
from multitax import GtdbTx
# Download and parse in memory
tax = GtdbTx()
# Parse local files
tax = GtdbTx(files=["bac120_taxonomy.tsv.gz", "ar122_taxonomy.tsv.gz"])
# Download, write and parse files
tax = GtdbTx(output_prefix="my/path/")
# Download and filter only specific branch
tax = GtdbTx(root_node="p__Proteobacteria")
# List parent node
tax.parent("g__Escherichia")
# f__Enterobacteriaceae
# List children nodes
tax.children("g__Escherichia")
# ['s__Escherichia flexneri', 's__Escherichia coli', 's__Escherichia dysenteriae', 's__Escherichia coli_D', 's__Escherichia albertii', 's__Escherichia marmotae', 's__Escherichia coli_C', 's__Escherichia sp005843885', 's__Escherichia sp000208585', 's__Escherichia fergusonii', 's__Escherichia sp001660175', 's__Escherichia sp004211955', 's__Escherichia sp002965065']
# Get specific rank parent node
tax.parent_rank("s__Lentisphaera araneosa", "phylum")
# 'p__Verrucomicrobiota'
# Get lineage
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']
# Get lineage of names
tax.name_lineage("g__Escherichia")
# ['root', 'Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacterales', 'Enterobacteriaceae', 'Escherichia']
# Get lineage of ranks
tax.rank_lineage("g__Escherichia")
# ['root', 'domain', 'phylum', 'class', 'order', 'family', 'genus']
# Get lineage with specific ranks and root
tax.lineage("g__Escherichia", root_node="p__Proteobacteria", ranks=["phylum", "class", "family", "genus"])
# ['p__Proteobacteria', 'c__Gammaproteobacteria', 'f__Enterobacteriaceae', 'g__Escherichia']
# Build lineages in memory for faster access
tax.build_lineages()
# Get leaf nodes
tax.leaves("p__Hadarchaeota")
# ['s__DG-33 sp004375695', 's__DG-33 sp001515185', 's__Hadarchaeum yellowstonense', 's__B75-G9 sp003661465', 's__WYZ-LMO6 sp004347925', 's__B88-G9 sp003660555']
# Search names and filter by rank
tax.search_name("Escherichia", exact=False, rank="genus")
# ['g__Escherichia', 'g__Escherichia_C']
# Show stats of loaded tax
tax.stats()
#{'leaves': 31910,
# 'names': 45503,
# 'nodes': 45503,
# 'ranked_leaves': Counter({'species': 31910}),
# 'ranked_nodes': Counter({'species': 31910,
# 'genus': 9428,
# 'family': 2600,
# 'order': 1034,
# 'class': 379,
# 'phylum': 149,
# 'domain': 2,
# 'root': 1}),
# 'ranks': 45503}
# Filter ancestors (desc=True for descendants)
tax.filter(['g__Escherichia', 's__Pseudomonas aeruginosa'])
tax.stats()
#{'leaves': 2,
# 'names': 11,
# 'nodes': 11,
# 'ranked_leaves': Counter({'genus': 1, 'species': 1}),
# 'ranked_nodes': Counter({'genus': 2,
# 'family': 2,
# 'order': 2,
# 'class': 1,
# 'phylum': 1,
# 'domain': 1,
# 'species': 1,
# 'root': 1}),
# 'ranks': 11}
# Write tax to file
tax.write("custom_tax.tsv", cols=["node", "rank", "name_lineage"])
#g__Escherichia genus root|Bacteria|Proteobacteria|Gammaproteobacteria|Ent#erobacterales|Enterobacteriaceae|Escherichia
#f__Enterobacteriaceae family root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae
#o__Enterobacterales order root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales
#c__Gammaproteobacteria class root|Bacteria|Proteobacteria|Gammaproteobacteria
#...
The same goes for the other taxonomies
# NCBI
from multitax import NcbiTx
tax = NcbiTx()
tax.lineage("561")
# ['1', '131567', '2', '1224', '1236', '91347', '543', '561']
# Silva
from multitax import SilvaTx
tax = SilvaTx()
tax.lineage("46463")
# ['1', '3', '2375', '3303', '46449', '46454', '46463']
# Open Tree taxonomy
from multitax import OttTx
tax = OttTx()
tax.lineage("474503")
# ['805080', '93302', '844192', '248067', '822744', '768012', '424023', '474503']
# GreenGenes
from multitax import GreengenesTx
tax = GreengenesTx()
tax.lineage("f__Enterobacteriaceae")
# ['1', 'k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacteriales', 'f__Enterobacteriaceae']
LCA integration
Using pylca: https://github.com/pirovc/pylca
from pylca.pylca import LCA
from multitax import GtdbTx
# Download and parse GTDB Taxonomy
tax = GtdbTx()
# Build LCA structure
L = LCA(tax._nodes)
# Get LCA
L("s__Escherichia dysenteriae", "s__Pseudomonas aeruginosa")
# 'c__Gammaproteobacteria'
General information
- Taxonomies are parsed into nodes. Each node is annotated with a name and a rank.
- Some taxonomies have a numeric taxonomic identifier (NCBI, Silva) and other use the rank + name as an identifier. In MultiTax all identifiers are treated as strings.
- A single root node is defined by default for each taxonomy (or
1
when not defined). This can be changed withroot_node
when loading the taxonomy (as well as annotationsroot_parent
,root_name
,root_rank
). If theroot_node
already exists, the tree will be filtered. - Standard values for unknown/undefined nodes can be configured with
undefined_node
,undefined_name
andundefined_rank
. Those are default values returned when nodes/names/ranks are not found. - Taxonomy files are automatically download or can be loaded from disk (
files
parameter). Alternativeurls
can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk withoutput_prefix
.
Translation between taxonomies
Not yet implemented. The goal here is to map different taxonomies if the linkage data is available. That's what I think will be possible.
from/to | NCBI | GTDB | SILVA | OTT | GG |
---|---|---|---|---|---|
NCBI | - | part | part | part | no |
GTDB | full | - | no | no | no |
SILVA | full | no | - | part | no |
OTT | part | no | part | - | no |
GG | no | no | no | no | - |
Further ideas
- Add/remove/update nodes
- Conversion between taxonomies (write on specific files/format)
Similar projects
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
multitax-1.1.0.tar.gz
(16.9 kB
view hashes)
Built Distribution
multitax-1.1.0-py3-none-any.whl
(18.4 kB
view hashes)