Skip to main content

Python library that provides a common interface for biological taxonomies

Project description

MultiTax Build Status codecov install with bioconda

Python library that provides a common interface to obtain, parse and interact with biological taxonomies (GTDB, NCBI, Silva, Greengenes, Open Tree taxonomy) + "Custom" formatted taxonomies.

Goals

  • Common interface to use different taxonomies
  • Fast, intuitive, generalized and easy to use
  • Enable integration and compatibility with multiple taxonomies without any effort
  • Translation and conversion between taxonomies (not yet implemented)

Installation

pip

pip install multitax

conda

conda install -c bioconda multitax

local

git clone https://github.com/pirovc/multitax.git
cd multitax
python setup.py install --record files.txt

Documentation

https://pirovc.github.io/multitax/

Basic Example with GTDB

from multitax import GtdbTx

# Download taxonomy
tax = GtdbTx()

# Get lineage for the Escherichia genus  
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']

Further Examples

List of all functions

from multitax import GtdbTx

# Download and parse in memory
tax = GtdbTx()

# Parse local files
tax = GtdbTx(files=["bac120_taxonomy.tsv.gz", "ar122_taxonomy.tsv.gz"])

# Download, write and parse files
tax = GtdbTx(output_prefix="my/path/") 

# Download and filter only specific branch
tax = GtdbTx(root_node="p__Proteobacteria") 

# List parent node
tax.parent("g__Escherichia")
# f__Enterobacteriaceae

# List children nodes
tax.children("g__Escherichia")
# ['s__Escherichia flexneri', 's__Escherichia coli', 's__Escherichia dysenteriae', 's__Escherichia coli_D', 's__Escherichia albertii', 's__Escherichia marmotae', 's__Escherichia coli_C', 's__Escherichia sp005843885', 's__Escherichia sp000208585', 's__Escherichia fergusonii', 's__Escherichia sp001660175', 's__Escherichia sp004211955', 's__Escherichia sp002965065']

# Get specific rank parent node
tax.parent_rank("s__Lentisphaera araneosa", "phylum")
# 'p__Verrucomicrobiota'

# Get lineage
tax.lineage("g__Escherichia")
# ['1', 'd__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacterales', 'f__Enterobacteriaceae', 'g__Escherichia']

# Get lineage of names
tax.name_lineage("g__Escherichia")
# ['root', 'Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacterales', 'Enterobacteriaceae', 'Escherichia']

# Get lineage of ranks
tax.rank_lineage("g__Escherichia")
# ['root', 'domain', 'phylum', 'class', 'order', 'family', 'genus']

# Get lineage with specific ranks and root
tax.lineage("g__Escherichia", root_node="p__Proteobacteria", ranks=["phylum", "class", "family", "genus"])
# ['p__Proteobacteria', 'c__Gammaproteobacteria', 'f__Enterobacteriaceae', 'g__Escherichia']

# Build lineages in memory for faster access
tax.build_lineages()

# Get leaf nodes
tax.leaves("p__Hadarchaeota")
# ['s__DG-33 sp004375695', 's__DG-33 sp001515185', 's__Hadarchaeum yellowstonense', 's__B75-G9 sp003661465', 's__WYZ-LMO6 sp004347925', 's__B88-G9 sp003660555']

# Show stats of loaded tax
tax.stats()
#{'leaves': 31910,
# 'names': 45503,
# 'nodes': 45503,
# 'ranked_leaves': Counter({'species': 31910}),
# 'ranked_nodes': Counter({'species': 31910,
#                          'genus': 9428,
#                          'family': 2600,
#                          'order': 1034,
#                          'class': 379,
#                          'phylum': 149,
#                          'domain': 2,
#                          'root': 1}),
# 'ranks': 45503}

# Filter ancestors (desc=True for descendants)
tax.filter(['g__Escherichia', 's__Pseudomonas aeruginosa'])
tax.stats()
#{'leaves': 2,
# 'names': 11,
# 'nodes': 11,
# 'ranked_leaves': Counter({'genus': 1, 'species': 1}),
# 'ranked_nodes': Counter({'genus': 2,
#                          'family': 2,
#                          'order': 2,
#                          'class': 1,
#                          'phylum': 1,
#                          'domain': 1,
#                          'species': 1,
#                          'root': 1}),
# 'ranks': 11}

# Write tax to file
tax.write("custom_tax.tsv", cols=["node", "rank", "name_lineage"])

#g__Escherichia             genus    root|Bacteria|Proteobacteria|Gammaproteobacteria|Ent#erobacterales|Enterobacteriaceae|Escherichia
#f__Enterobacteriaceae      family   root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae
#o__Enterobacterales        order    root|Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales
#c__Gammaproteobacteria     class    root|Bacteria|Proteobacteria|Gammaproteobacteria
#...

The same goes for the other taxonomies

# NCBI
from multitax import NcbiTx
tax = NcbiTx()
tax.lineage("561")    
# ['1', '131567', '2', '1224', '1236', '91347', '543', '561']

# Silva
from multitax import SilvaTx
tax = SilvaTx()
tax.lineage("46463")    
# ['1', '3', '2375', '3303', '46449', '46454', '46463']

# Open Tree taxonomy
from multitax import OttTx
tax = OttTx()
tax.lineage("474503")
# ['805080', '93302', '844192', '248067', '822744', '768012', '424023', '474503']

# GreenGenes
from multitax import GreengenesTx
tax = GreengenesTx()
tax.lineage("f__Enterobacteriaceae")
# ['1', 'k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Enterobacteriales', 'f__Enterobacteriaceae']

LCA integration

Using pylca: https://github.com/pirovc/pylca

from pylca.pylca import LCA
from multitax import GtdbTx

# Download and parse GTDB Taxonomy
tax = GtdbTx()

# Build LCA structure
L = LCA(tax._nodes)

# Get LCA
L("s__Escherichia dysenteriae", "s__Pseudomonas aeruginosa")
# 'c__Gammaproteobacteria'

General information

  • Taxonomies are parsed into nodes. Each node is annotated with a name and a rank.
  • Some taxonomies have a numeric taxonomic identifier (NCBI, Silva) and other use the rank + name as an identifier. In MultiTax all identifiers are treated as strings.
  • A single root node is defined by default for each taxonomy (or 1 when not defined). This can be changed with root_node when loading the taxonomy (as well as annotations root_parent, root_name, root_rank). If the root_node already exists, the tree will be filtered.
  • Standard values for unknown/undefined nodes can be configured with undefined_node,undefined_name and undefined_rank. Those are default values returned when nodes/names/ranks are not found.
  • Taxonomy files are automatically download or can be loaded from disk (files parameter). Alternative urls can be provided. When downloaded, files are handled in memory. It is possible to save the downloaded file to disk with output_prefix.

Translation between taxonomies

Not yet implemented. The goal here is to map different taxonomies if the linkage data is available. That's what I think will be possible.

from/to NCBI GTDB SILVA OTT GG
NCBI - part part part no
GTDB full - no no no
SILVA full no - part no
OTT part no part - no
GG no no no no -

Further ideas

  • Advanced name search
  • Add/remove/update nodes
  • Conversion between taxonomies (write on specific files/format)

Similar projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multitax-1.0.0.tar.gz (15.2 kB view hashes)

Uploaded Source

Built Distribution

multitax-1.0.0-py3-none-any.whl (17.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page