Genomic Tokenizer
Project description
:chains: Genomic Tokenizer
About
This is a tokenizer for DNA :chains: that aligns with the central dogma of molecular biology. You can use this tokenizer for training genomic transformer models. See the BERT and GPT2 models trained on human genome. This is not tested yet, but feel free to try it and improve it. Please cite / contact me if you use it in your research.
🚀 Installation
pip install git+https://github.com/dermatologist/genomic-tokenizer.git
🔧 Example usage
from genomic_tokenizer import GenomicTokenizer
# Fasta header if present is ignored.
fasta = """
AGGCGAGGCGCGGGCGGAGGCGGTGCGCGGGCGGAGGCGGGGCGCGGAGATGTGGCGGAGGTGGAGGCGG
AGGCGTAGCCGCCCCTGGGGACGTCATTGGTGGCGGAAGCAATCGCCGGCAACCAGCTGTAAGCGAGGTA
GGCTCACTCGGGCACGGAGGGTGCGGGTGAGAAAGGGAACGATTTGCTAGGAGTGTATGCGCCCGTGCTA
"""
model_max_length = 2048
tokenizer = GenomicTokenizer(model_max_length)
tokens = tokenizer(fasta)
print(tokens)
✨ Output
{'input_ids': [2, 7, 12, 17, 19, 16, 1, 7, 20, 6, 12, 21, 16, 12, 20, 12, 12, 8, 12, 1, 10, 20, 10, 20, 11, 7, 20, 21, 23, 8, 7, 20, 7, 6, 12, 21, 19, 10, 11, 16, 19, 7, 1, 22, 7, 1, 19, 21, 7, 16, 1, 21, 12, 23, 19, 12, 20, 6, 1],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
🔧 Tokenization algorithm
- Identify the first occurence of the start codon
ATG. - Split the sequence into codons of length 3 starting from the start codon.
- Convert synonymous codons to the same token.
- Convert stop codons to
[SEP]token.
🧠 Inspired by
- https://github.com/HazyResearch/hyena-dna/blob/main/src/dataloaders/datasets/hg38_char_tokenizer.py
- https://github.com/dariush-bahrami/character-tokenizer/blob/master/charactertokenizer/core.py
- And the CanineTokenizer in transformers package.
- Read this article for details on more elaborate tokenization strategies.
:books: Cite
@misc{genomic-tokenizer,
author = {Bell Raj Eapen},
title = {Genomic Tokenizer},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{
https://github.com/dermatologist/genomic-tokenizer
}},
}
Give us a star ⭐️
If you find this project useful, give us a star. It helps others discover the project.
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genomic_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: genomic_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b16a4dcf2aefd5381388fc8a2bc5c5d2b141ad8d5236c1f6bb1c9308efb07d26
|
|
| MD5 |
1a700392fb9dbb0fb73e974c388bdd12
|
|
| BLAKE2b-256 |
0dc1b74653445ea5ae7071c0d4936190204de8e839e908d6dc0e90f64eaa33de
|
File details
Details for the file genomic_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: genomic_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97996c9e9910df9b0d33ea99cf662bf6994798adf27127b493c4142fc0aca639
|
|
| MD5 |
ef43c2ecbeda4f2bb82cde15ee5b2e02
|
|
| BLAKE2b-256 |
508abac126a9b59175231003b8755506602b88c81e3cf50f133dea136bfec60e
|