Skip to main content

A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers.

Project description

taxopy

DOI

A Python package for manipulating NCBI-formatted taxonomic databases. Allows you to obtain complete lineages, determine lowest common ancestors (LCAs), get taxa names from their taxids, and more!

Installation

There are two ways to install taxopy:

  • Using pip:
pip install taxopy
  • Using conda:
conda install -c conda-forge -c bioconda taxopy

[!NOTE] To enable fuzzy matching of taxonomic names, you need to install the fuzzy-matching extra. This can be done by adding [fuzzy-matching] to the installation command in pip:

pip install taxopy[fuzzy-matching]

Alternatively, you can install the rapidfuzz library:

# Using pip
pip install rapidfuzz
# Using conda
conda install -c conda-forge rapidfuzz

Usage

import taxopy

First you need to download taxonomic information from NCBI's servers and put this data into a TaxDb object:

taxdb = taxopy.TaxDb()
# You can also use your own set of taxonomy files:
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp")
# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp", merged_dmp="taxdb/merged.dmp")

The TaxDb object stores the name, rank and parent-child relationships of each taxonomic identifier:

print(taxdb.taxid2name[2])
print(taxdb.taxid2parent[2])
print(taxdb.taxid2rank[2])
Bacteria
131567
superkingdom

If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the oldtaxid2newtaxid attribute:

print(taxdb.oldtaxid2newtaxid[260])
143224

To get information of a given taxon you can create a Taxon object using its taxonomic identifier:

saccharomyces = taxopy.Taxon(4930, taxdb)
human = taxopy.Taxon(9606, taxdb)
gorilla = taxopy.Taxon(9593, taxdb)
lagomorpha = taxopy.Taxon(9975, taxdb)

Each Taxon object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:

print(lagomorpha.rank)
print(lagomorpha.name)
print(lagomorpha.name_lineage)
print(lagomorpha.ranked_name_lineage)
print(lagomorpha.rank_name_dictionary)
order
Lagomorpha
['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']
[('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]
{'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}

You can use the parent method to get a Taxon object of the parent node of a given taxon:

lagomorpha_parent = lagomorpha.parent(taxdb)
print(lagomorpha_parent.rank)
print(lagomorpha_parent.name)
clade
Glires

LCA and majority vote

You can get the lowest common ancestor of a list of taxa using the find_lca function:

human_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)
print(human_lagomorpha_lca.name)
Euarchontoglires

You may also use the find_majority_vote to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:

majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)
print(majority_vote.name)
Homininae

The find_majority_vote function allows you to control its stringency via the fraction parameter. For instance, if you would set fraction to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, fraction is 0.5.

majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)
print(majority_vote.name)
Euarchontoglires

You can also assign weights to each input lineage:

majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)
weighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])
print(majority_vote.name)
print(weighted_majority_vote.name)
Euarchontoglires
Opisthokonta

To check the level of agreement between the taxa that were aggregated using find_majority_vote and the output taxon, you can check the agreement attribute.

print(majority_vote.agreement)
print(weighted_majority_vote.agreement)
0.75
1.0

Taxid from name

If you only have the name of a taxon, you can get its corresponding taxid using the taxid_from_name function:

taxid = taxopy.taxid_from_name('Homininae', taxdb)
print(taxid)
[207598]

This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:

taxid = taxopy.taxid_from_name('Aotus', taxdb)
print(taxid)
[9504, 114498]

In case a list of names is provided as input, the function will return a list of lists.

taxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)
print(taxid)
[[207598], [9504, 114498]]

In certain situations, it is useful to allow slight variations between the input name and the matches in the database. This can be accomplished by setting the fuzzy parameter of taxid_from_name to True. For instance, in GTDB, some taxa have suffixes. The fuzzy parameter enables you to locate the taxonomic identifiers of all taxa with a name similar to the input, so you don't need to know the exact name beforehand.

# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)
gtdb_taxdb = taxopy.TaxDb(taxdump_url="https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz")
for t in taxopy.taxid_from_name("Myxococcota", gtdb_taxdb, fuzzy=True):
    print(taxopy.Taxon(t, gtdb_taxdb).name)
Myxococcota_A
Myxococcota

You can control the minimum similarity between the input name and the matches in the database by setting the score_cutoff parameter (default is 0.9).

for t in taxopy.taxid_from_name(
    "Myxococcota", gtdb_taxdb, fuzzy=True, score_cutoff=0.7
):
    print(taxopy.Taxon(t, gtdb_taxdb).name)
Myxococcales
Myxococcota_A
Myxococcus
Myxococcia
Myxococcota
Myxococcaceae

Acknowledgements

Some of the code used in taxopy was taken from the CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxopy-0.13.0.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

taxopy-0.13.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file taxopy-0.13.0.tar.gz.

File metadata

  • Download URL: taxopy-0.13.0.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.9

File hashes

Hashes for taxopy-0.13.0.tar.gz
Algorithm Hash digest
SHA256 0f56b8470864bc8c44cda1edb00dd210a0105e5aec25dfb9f6bb725b0f753f5c
MD5 5be7787d661f1f74bf13ece7832d2151
BLAKE2b-256 f91201d5c7b29d2ba6fd2bcb6cfcb649cc677fd3e586cad3662f212f50a58910

See more details on using hashes here.

File details

Details for the file taxopy-0.13.0-py3-none-any.whl.

File metadata

  • Download URL: taxopy-0.13.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.9

File hashes

Hashes for taxopy-0.13.0-py3-none-any.whl
Algorithm Hash digest
SHA256 296f2dda519b2c85cbbb893c2a9240e44d6c669f5e30c3f98a8690ac7e41db3a
MD5 8df2cea8c21729a3ec645bafb64a874e
BLAKE2b-256 d3f53618edd3eb2117c04f1865da12dca4d190436f42938779f4d79e362267c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page