Skip to main content

Fungi DNA barcoder based on semantic searching

Project description

TaxoTagger

Fungi DNA taxonomy label identification using semantic searching.

Features:

  • Building vector databases directly from DNA sequences (FASTA file) with ease
  • Supporting various embedding models
  • Semantic searching with high efficiency

Installation

Install from PyPI:

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install the `taxotagger` package
pip install --pre taxotagger

Or install from source code:

# create an virtual environment
conda create -n venv-3.10 python=3.10
conda activate venv-3.10

# install from this repo
pip install git+https://github.com/MycoAI/taxotagger

Usage

Build a vector database from a FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# creating the database will take ~30s
tt.create_db('data/database.fasta')

By default, the model MycoAI-CNN.pt will be used as the embedding model, and the database will be created and stored in the default folder (~/.cache/mycoai) if you do not set a new value to config.mycoai_home. The embedding model is automatically downloaded to there.

Conduct a semantic search with FASTA file

from taxotagger import ProjectConfig
from taxotagger import TaxoTagger

config = ProjectConfig()
tt = TaxoTagger(config)

# semantic search and return the top 1 result for each query sequence
res = tt.search('data/query.fasta', limit = 1)

The search results res will be a dictionary with taxonomic level names as keys and matched results as values for each query sequence. For example, res['phylum'] will look like:

[
    [{"id": "KY106088", "distance": 1.0, "entity": {"phylum": "Ascomycota"}}],
    [{"id": "KY106087", "distance": 0.9999998807907104, "entity": {"phylum": "Ascomycota"}}]
]

The first inner list is the top results for the first query sequence, and the second inner list is the top results for the second query sequence.

Question and feedback

Please submit an issue if you have any question or feedback.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxotagger-0.0.1a1.tar.gz (15.6 kB view hashes)

Uploaded Source

Built Distribution

taxotagger-0.0.1a1-py3-none-any.whl (16.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page