Skip to main content

Biosaic: a lightweight DNA and protein k-mer tokenizer with pre-trained vocab support.

Project description

Biosaic

Overview

Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.

Features

  • Tokenization: converts the sequences into K-Mers.
  • Encoding: converts sequences into embeddings for classification, training purposes.
  • Easy use: it's very basic and easy to use library.
  • SoTA encoder: Evoformer & VQ-VAE model are inspired from the AlphaFold-2

Prerequisites

System Requirements

  • Operating System: Linux, macOS, or Windows with support for GCC or Clang.
  • Python: Version 3.9 or higher.

Dependencies

  • Python Modules:
    • pickle: for loading and saving model files.
    • os: for file and path handling.
    • urllib: for loading the vocabs from repo.
    • tempfile: for loading the vocabs from repo.

Installation

1. From PyPI:

  pip install biosaic

2. Clone the Repo:

git clone https://github.com/delveopers/biosaic.git
cd biosaic

Usage

Create an instance of the tokenizer with a specified k-mer size, & split them into tokens, encode & decode them fastly:

import biosaic
from biosaic import tokenizer

token = tokenizer(mode="dna", kmer=3, continuous=True)
print(token.vocab_size)

sequence = "TCTTACATAGAAAGGAGCGGTATTTGGTATGAATTTATTTGCAACTGACTG"
encoded = token.encode(sequence)
decoded = token.decode(encoded)
tokenized = token.tokenize(sequence)

print(tokenized)
print(encoded[:100])
print(decoded[:300])
print(decoded == sequence)

For more information refer to the docs:

Output

84

['TCT', 'CTT', 'TTA', 'TAC', 'ACA', 'CAT', 'ATA', 'TAG', 'AGA', 'GAA', 'AAA', 'AAG', 'AGG', 'GGA', 'GAG', 'AGC', 'GCG', 'CGG', 'GGT', 'GTA', 'TAT', 'ATT', 'TTT', 'TTG', 'TGG', 'GGT', 'GTA', 'TAT', 'ATG', 'TGA', 'GAA', 'AAT', 'ATT', 'TTT', 'TTA', 'TAT', 'ATT', 'TTT', 'TTG', 'TGC', 'GCA', 'CAA', 'AAC', 'ACT', 'CTG', 'TGA', 'GAC', 'ACT', 'CTG']

[75, 51, 80, 69, 24, 39, 32, 70, 28, 52, 20, 22, 30, 60, 54, 29, 58, 46, 63, 64, 71, 35, 83, 82, 78, 63, 64, 71, 34, 76, 52, 23, 35, 83, 80, 71, 35, 83, 82, 77, 56, 36, 21, 27, 50, 76, 53, 27, 50]

TCTTACATAGAAAGGAGCGGTATTTGGTATGAATTTATTTGCAACTGACTG

True

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biosaic-0.1.5.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biosaic-0.1.5-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file biosaic-0.1.5.tar.gz.

File metadata

  • Download URL: biosaic-0.1.5.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for biosaic-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a821aac3debb59fa5d5a4e765c98df2e0c6133b56ba7ed6f37f439d72096e7e7
MD5 e0e6f51bd4f7c1f5197545e87921dd8d
BLAKE2b-256 f15432795547636285f548111d718aa61b503d7e31fd0edc2d7cdcfebf4b0adb

See more details on using hashes here.

File details

Details for the file biosaic-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: biosaic-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for biosaic-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bd829c1040a38cecac4c5d415d0b18b2754867e9f9f69465e602dd211b8b86d3
MD5 49873f840b73d753fed42d704361be58
BLAKE2b-256 0d27ccef022ec452b2161c6c286631d13bc284bcddda7054e073e3bd9940c2dd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page