Skip to main content

Genomic Tokenizer

Project description

:chains: Genomic Tokenizer

About

This is a tokenizer for DNA :chains: that aligns with the central dogma of molecular biology. You can use this tokenizer for training genomic transformer models. See the BERT and GPT2 models trained on human genome. This is not tested yet, but feel free to try it and improve it. Please cite / contact me if you use it in your research.

🚀 Installation

pip install git+https://github.com/dermatologist/genomic-tokenizer.git

🔧 Example usage

from genomic_tokenizer import GenomicTokenizer
# Fasta header if present is ignored.
fasta = """
AGGCGAGGCGCGGGCGGAGGCGGTGCGCGGGCGGAGGCGGGGCGCGGAGATGTGGCGGAGGTGGAGGCGG
AGGCGTAGCCGCCCCTGGGGACGTCATTGGTGGCGGAAGCAATCGCCGGCAACCAGCTGTAAGCGAGGTA
GGCTCACTCGGGCACGGAGGGTGCGGGTGAGAAAGGGAACGATTTGCTAGGAGTGTATGCGCCCGTGCTA
"""
model_max_length = 2048
tokenizer = GenomicTokenizer(model_max_length)
tokens = tokenizer(fasta)
print(tokens)

✨ Output

{'input_ids': [2, 7, 12, 17, 19, 16, 1, 7, 20, 6, 12, 21, 16, 12, 20, 12, 12, 8, 12, 1, 10, 20, 10, 20, 11, 7, 20, 21, 23, 8, 7, 20, 7, 6, 12, 21, 19, 10, 11, 16, 19, 7, 1, 22, 7, 1, 19, 21, 7, 16, 1, 21, 12, 23, 19, 12, 20, 6, 1],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

🔧 Tokenization algorithm

  • Identify the first occurence of the start codon ATG.
  • Split the sequence into codons of length 3 starting from the start codon.
  • Convert synonymous codons to the same token.
  • Convert stop codons to [SEP] token.

🧠 Inspired by

:books: Cite

@misc{genomic-tokenizer,
  author = {Bell Raj Eapen},
  title = {Genomic Tokenizer},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{
    https://github.com/dermatologist/genomic-tokenizer
    }},
}

Give us a star ⭐️

If you find this project useful, give us a star. It helps others discover the project.

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomic_tokenizer-0.1.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genomic_tokenizer-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file genomic_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: genomic_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for genomic_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b16a4dcf2aefd5381388fc8a2bc5c5d2b141ad8d5236c1f6bb1c9308efb07d26
MD5 1a700392fb9dbb0fb73e974c388bdd12
BLAKE2b-256 0dc1b74653445ea5ae7071c0d4936190204de8e839e908d6dc0e90f64eaa33de

See more details on using hashes here.

File details

Details for the file genomic_tokenizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: genomic_tokenizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for genomic_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 97996c9e9910df9b0d33ea99cf662bf6994798adf27127b493c4142fc0aca639
MD5 ef43c2ecbeda4f2bb82cde15ee5b2e02
BLAKE2b-256 508abac126a9b59175231003b8755506602b88c81e3cf50f133dea136bfec60e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page