A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers.
Project description
taxopy
taxopy is a Python package that provides an interface for assessing NCBI-formatted taxonomic databases. It enables various operations on taxonomic data, such as obtaining complete lineages, determining the lowest common ancestors (LCAs), retrieving taxa names from taxonomic identifiers, and more.
Installation
There are two ways to install taxopy:
- Using pip:
pip install taxopy
- Using conda:
conda install -c conda-forge -c bioconda taxopy
[!NOTE]
taxopysupports fuzzy string matching to search for taxa with names that are similar but not identical to the queries. This feature is not enabled by default to avoid additional dependencies. However, you can enable it by installing thefuzzy-matchingextra usingpip:pip install taxopy[fuzzy-matching]Alternatively, you can install the
rapidfuzzlibrary alongsidetaxopy:# Using pip pip install taxopy rapidfuzz # Using conda conda install -c conda-forge -c bioconda taxopy rapidfuzz
Usage
For a detailed guide on how to use taxopy, please refer to the documentation.
import taxopy
First you need to download taxonomic information from NCBI's servers and put this data into a TaxDb object:
taxdb = taxopy.TaxDb()
# You can also use your own set of taxonomy files:
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp")
# If you want to support legacy taxonomic identifiers (that were merged to other identifier), you also need to provide a `merged.dmp` file. This is not necessary if the data is being downloaded from NCBI.
taxdb = taxopy.TaxDb(nodes_dmp="taxdb/nodes.dmp", names_dmp="taxdb/names.dmp", merged_dmp="taxdb/merged.dmp")
The TaxDb object stores the name, rank and parent-child relationships of each taxonomic identifier:
print(taxdb.taxid2name[2])
print(taxdb.taxid2parent[2])
print(taxdb.taxid2rank[2])
Bacteria
131567
superkingdom
If you want to retrieve the new taxonomic identifier of a legacy identifier you can use the oldtaxid2newtaxid attribute:
print(taxdb.oldtaxid2newtaxid[260])
143224
To get information of a given taxon you can create a Taxon object using its taxonomic identifier:
saccharomyces = taxopy.Taxon(4930, taxdb)
human = taxopy.Taxon(9606, taxdb)
gorilla = taxopy.Taxon(9593, taxdb)
lagomorpha = taxopy.Taxon(9975, taxdb)
Each Taxon object stores a variety of information, such as the rank, identifier and name of the input taxon, and the identifiers and names of all the parent taxa:
print(lagomorpha.rank)
print(lagomorpha.name)
print(lagomorpha.name_lineage)
print(lagomorpha.ranked_name_lineage)
print(lagomorpha.rank_name_dictionary)
order
Lagomorpha
['Lagomorpha', 'Glires', 'Euarchontoglires', 'Boreoeutheria', 'Eutheria', 'Theria', 'Mammalia', 'Amniota', 'Tetrapoda', 'Dipnotetrapodomorpha', 'Sarcopterygii', 'Euteleostomi', 'Teleostomi', 'Gnathostomata', 'Vertebrata', 'Craniata', 'Chordata', 'Deuterostomia', 'Bilateria', 'Eumetazoa', 'Metazoa', 'Opisthokonta', 'Eukaryota', 'cellular organisms', 'root']
[('order', 'Lagomorpha'), ('clade', 'Glires'), ('superorder', 'Euarchontoglires'), ('clade', 'Boreoeutheria'), ('clade', 'Eutheria'), ('clade', 'Theria'), ('class', 'Mammalia'), ('clade', 'Amniota'), ('clade', 'Tetrapoda'), ('clade', 'Dipnotetrapodomorpha'), ('superclass', 'Sarcopterygii'), ('clade', 'Euteleostomi'), ('clade', 'Teleostomi'), ('clade', 'Gnathostomata'), ('clade', 'Vertebrata'), ('subphylum', 'Craniata'), ('phylum', 'Chordata'), ('clade', 'Deuterostomia'), ('clade', 'Bilateria'), ('clade', 'Eumetazoa'), ('kingdom', 'Metazoa'), ('clade', 'Opisthokonta'), ('superkingdom', 'Eukaryota'), ('no rank', 'cellular organisms'), ('no rank', 'root')]
{'order': 'Lagomorpha', 'clade': 'Opisthokonta', 'superorder': 'Euarchontoglires', 'class': 'Mammalia', 'superclass': 'Sarcopterygii', 'subphylum': 'Craniata', 'phylum': 'Chordata', 'kingdom': 'Metazoa', 'superkingdom': 'Eukaryota'}
You can use the parent method to get a Taxon object of the parent node of a given taxon:
saccharomyces_parent = saccharomyces.parent(taxdb)
print(saccharomyces_parent.rank)
print(saccharomyces_parent.name)
family
Saccharomycetaceae
LCA and majority vote
You can get the lowest common ancestor of a list of taxa using the find_lca function:
human_lagomorpha_lca = taxopy.find_lca([human, lagomorpha], taxdb)
print(human_lagomorpha_lca.name)
Euarchontoglires
You may also use the find_majority_vote to discover the most specific taxon that is shared by more than half of the lineages of a list of taxa:
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb)
print(majority_vote.name)
Homininae
The find_majority_vote function allows you to control its stringency via the fraction parameter. For instance, if you would set fraction to 0.75 the resulting taxon would be shared by more than 75% of the input lineages. By default, fraction is 0.5.
majority_vote = taxopy.find_majority_vote([human, gorilla, lagomorpha], taxdb, fraction=0.75)
print(majority_vote.name)
Euarchontoglires
You can also assign weights to each input lineage:
majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb)
weighted_majority_vote = taxopy.find_majority_vote([saccharomyces, human, gorilla, lagomorpha], taxdb, weights=[3, 1, 1, 1])
print(majority_vote.name)
print(weighted_majority_vote.name)
Euarchontoglires
Opisthokonta
To check the level of agreement between the taxa that were aggregated using find_majority_vote and the output taxon, you can check the agreement attribute.
print(majority_vote.agreement)
print(weighted_majority_vote.agreement)
0.75
1.0
Taxid from name
If you only have the name of a taxon, you can get its corresponding taxid using the taxid_from_name function:
taxid = taxopy.taxid_from_name('Homininae', taxdb)
print(taxid)
[207598]
This function returns a list of all taxonomic identifiers associated with the input name. In the case of homonyms, the list will contain multiple taxonomic identifiers:
taxid = taxopy.taxid_from_name('Aotus', taxdb)
print(taxid)
[9504, 114498]
In case a list of names is provided as input, the function will return a list of lists.
taxid = taxopy.taxid_from_name(['Homininae', 'Aotus'], taxdb)
print(taxid)
[[207598], [9504, 114498]]
When querying a TaxDb using a taxon name, you can enable fuzzy search by setting the fuzzy parameter of taxid_from_name to True. This allows the function to find taxa with names similar, but not identical, to the query string(s).
For a practical use case of this feature, consider the GTDB taxonomy. In GTDB some taxa have suffixes appended to their names because they are either not monophyletic in the GTDB reference tree or have unstable placements between different releases. By using fuzzy searches, you can find all the taxonomic identifiers representing a given taxon, such as Myxococcota, without needing to know in advance if any suffixes are appended to the name.
# The `taxdump_url` parameter of the `TaxDb` class can be used retrieve a custom taxdump from a URL. In this case, we will use a GTDB taxdump provided by Wei Shen (https://github.com/shenwei356/gtdb-taxdump)
gtdb_taxdb = taxopy.TaxDb(taxdump_url="https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.5.0/gtdb-taxdump-R220.tar.gz")
for t in taxopy.taxid_from_name("Myxococcota", gtdb_taxdb, fuzzy=True):
print(taxopy.Taxon(t, gtdb_taxdb).name)
Myxococcota_A
Myxococcota
You can adjust the minimum similarity threshold between the query string(s) and the matches in the database using the score_cutoff parameter, which determines how closely a name must match a query string to be considered a valid result. The default value is 0.9, but you can lower this threshold to find matches that are less similar to the queries.
for t in taxopy.taxid_from_name(
"Myxococcota", gtdb_taxdb, fuzzy=True, score_cutoff=0.7
):
print(taxopy.Taxon(t, gtdb_taxdb).name)
Myxococcales
Myxococcota_A
Myxococcus
Myxococcia
Myxococcota
Myxococcaceae
Acknowledgements
Some of the code used in taxopy was taken from the CAT/BAT tool for taxonomic classification of contigs and metagenome-assembled genomes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taxopy-0.14.0.tar.gz.
File metadata
- Download URL: taxopy-0.14.0.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c39c4c4330b322f2e0505de45c2980a53eb945f8129fc679dd655a093042dbe6
|
|
| MD5 |
faea5f83b4b2b51ef0f6299ccf09acdd
|
|
| BLAKE2b-256 |
2c6792ec274c4a445f8dc7043c0116af84c49152a5f6e007ee370faee21c74c3
|
File details
Details for the file taxopy-0.14.0-py3-none-any.whl.
File metadata
- Download URL: taxopy-0.14.0-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dabbd091d0be081462d03d57ad0143c73273897ec2e2dfba53758bd24e65ee5
|
|
| MD5 |
4db967d897926698610ee09869b15bcc
|
|
| BLAKE2b-256 |
849806ba4db905613fa24d227bc81219b7b2e6d48f803eac407d86f2ca54991d
|