Skip to main content

Fungi DNA barcoder based on semantic searching

Project description

TaxoTagger

pypi badge Static Badge

TaxoTagger is an open-source Python library for DNA taxonomy identification, which involves categorizing DNA sequences into their respective taxonomic groups. It is powered by deep learning and semantic search to provide efficient and accurate results.

Key Features:

  • 🚀 Build vector databases from DNA sequences with ease
  • ⚡ Conduct efficient semantic searches for precise results
  • 🛠 Extend support for custom embedding models effortlessly
  • 🌐 Interact seamlessly through a user-friendly web app

Installation

TaxoTagger requires Python 3.10 or later.

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install the `taxotagger` package
pip install --pre taxotagger

Usage

Build a vector database from a FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# creating the database will take ~30s
tt.create_db('data/database.fasta')

By default, the ~/.cache/mycoai folder is used to store the vector database and the embedding model. The MycoAI-CNN.pt model is automatically downloaded to this folder if it is not there, and the vector database is created and named after the model.

Conduct a semantic search with FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)

The data/query.fasta file contains two query sequences: KY106088 and KY106087.

The search results res will be a dictionary with taxonomic level names as keys and matched results as values for each of the two query sequences. For example, res['phylum'] will look like:

[
    [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
    [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

The id field is the sequence ID of the matched sequence. The distance field is the cosine similarity between the query sequence and the matched sequence with a value between 0 and 1, the closer to 1, the more similar. The entity field is the taxonomic information of the matched sequence.

We can see that the top 1 results for both query sequences are exactly themselves. This is because the query sequences are also in the database. You can try with different query sequences to see the search results.

Docs

Please visit the official documentation for more details.

Question and feedback

Please submit an issue if you have any question or feedback.

Citation

If you use TaxoTagger in your work, please cite it by clicking the Cite this repository on right top of this page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxotagger-0.0.1a7.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxotagger-0.0.1a7-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file taxotagger-0.0.1a7.tar.gz.

File metadata

  • Download URL: taxotagger-0.0.1a7.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a7.tar.gz
Algorithm Hash digest
SHA256 65c9a41f41613bf8a457a5d7f2f4b7deebfe01b14afa6de5391c139a204ff15b
MD5 30fd9676805a3522877c6862d5509145
BLAKE2b-256 96ec2b03df23a9924229e9ac9c4d92b57d23529549b1874fcfbf154ac2cc8bb6

See more details on using hashes here.

File details

Details for the file taxotagger-0.0.1a7-py3-none-any.whl.

File metadata

  • Download URL: taxotagger-0.0.1a7-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.14

File hashes

Hashes for taxotagger-0.0.1a7-py3-none-any.whl
Algorithm Hash digest
SHA256 79bd62c991c08e6ec39b7518d41e3a872aaa865a6a5639044db5fd7e9660a2e2
MD5 d88357f54f7ab653c07dbf7634412ba2
BLAKE2b-256 2c905a3ffa25b15d0fac6a7597903f8b875d02bf59517026e044aaa35b1b6506

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page