Biosaic: a lightweight DNA and protein k-mer tokenizer with pre-trained vocab support.
Project description
Biosaic
Overview
Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
Features
- Tokenization: converts the sequences into K-Mers.
- Encoding: converts sequences into embeddings for classification, training purposes.
- Easy use: it's very basic and easy to use library.
- SoTA encoder: Evoformer & VQ-VAE model are inspired from the AlphaFold-2
Prerequisites
System Requirements
- Operating System: Linux, macOS, or Windows with support for GCC or Clang.
- Python: Version 3.9 or higher.
Dependencies
- Python Modules:
pickle: for loading and saving model files.os: for file and path handling.urllib: for loading the vocabs from repo.tempfile: for loading the vocabs from repo.
Installation
1. From PyPI:
pip install biosaic
2. Clone the Repo:
git clone https://github.com/delveopers/biosaic.git
cd biosaic
Usage
Create an instance of the tokenizer with a specified k-mer size, & split them into tokens, encode & decode them fastly:
import biosaic
from biosaic import tokenizer
token = tokenizer(mode="dna", kmer=3, continuous=True)
print(token.vocab_size)
sequence = "TCTTACATAGAAAGGAGCGGTATTTGGTATGAATTTATTTGCAACTGACTG"
encoded = token.encode(sequence)
decoded = token.decode(encoded)
tokenized = token.tokenize(sequence)
print(tokenized)
print(encoded[:100])
print(decoded[:300])
print(decoded == sequence)
For more information refer to the docs:
Output
84
['TCT', 'CTT', 'TTA', 'TAC', 'ACA', 'CAT', 'ATA', 'TAG', 'AGA', 'GAA', 'AAA', 'AAG', 'AGG', 'GGA', 'GAG', 'AGC', 'GCG', 'CGG', 'GGT', 'GTA', 'TAT', 'ATT', 'TTT', 'TTG', 'TGG', 'GGT', 'GTA', 'TAT', 'ATG', 'TGA', 'GAA', 'AAT', 'ATT', 'TTT', 'TTA', 'TAT', 'ATT', 'TTT', 'TTG', 'TGC', 'GCA', 'CAA', 'AAC', 'ACT', 'CTG', 'TGA', 'GAC', 'ACT', 'CTG']
[75, 51, 80, 69, 24, 39, 32, 70, 28, 52, 20, 22, 30, 60, 54, 29, 58, 46, 63, 64, 71, 35, 83, 82, 78, 63, 64, 71, 34, 76, 52, 23, 35, 83, 80, 71, 35, 83, 82, 77, 56, 36, 21, 27, 50, 76, 53, 27, 50]
TCTTACATAGAAAGGAGCGGTATTTGGTATGAATTTATTTGCAACTGACTG
True
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biosaic-0.1.5.tar.gz.
File metadata
- Download URL: biosaic-0.1.5.tar.gz
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a821aac3debb59fa5d5a4e765c98df2e0c6133b56ba7ed6f37f439d72096e7e7
|
|
| MD5 |
e0e6f51bd4f7c1f5197545e87921dd8d
|
|
| BLAKE2b-256 |
f15432795547636285f548111d718aa61b503d7e31fd0edc2d7cdcfebf4b0adb
|
File details
Details for the file biosaic-0.1.5-py3-none-any.whl.
File metadata
- Download URL: biosaic-0.1.5-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd829c1040a38cecac4c5d415d0b18b2754867e9f9f69465e602dd211b8b86d3
|
|
| MD5 |
49873f840b73d753fed42d704361be58
|
|
| BLAKE2b-256 |
0d27ccef022ec452b2161c6c286631d13bc284bcddda7054e073e3bd9940c2dd
|