A domain-specific Byte Pair Encoding tokenizer for medical and general NLP corpora

These details have not been verified by PyPI

Project description

Domain-Specific-BPE-Tokenizer

A from-scratch implementation of a scalable domain-specific Byte Pair Encoding (BPE) tokenizer for medical and general NLP corpora.

This project implements a complete byte-level BPE tokenizer pipeline without relying on external tokenizer frameworks. It includes scalable corpus collection, corpus cleaning, weighted BPE training, merge-rule optimization, serialization, evaluation tooling, benchmarking against GPT-2/tiktoken, automated testing, and PyPI-ready packaging.

The tokenizer was designed for large-scale medical-domain language modeling experiments and was later integrated into a custom GPT-style language model training pipeline.

Project Overview

This repository focuses on building and evaluating a domain-specific BPE tokenizer from scratch.

Core goals:

Implement byte-level BPE training from scratch
Support scalable weighted word-frequency training
Train tokenizers on medical and general corpora
Optimize merge-rule lookup and encoding speed
Provide serialization and checkpoint support
Benchmark against GPT-2/tiktoken
Evaluate fragmentation quality for biomedical terms
Create a reusable PyPI-ready tokenizer package

Key Features

Byte-level BPE tokenizer
Weighted word-frequency BPE training
Domain-specific tokenizer training
Medical corpus preprocessing pipeline
Incremental merge-rule learning
Optimized ranked merge lookup
Save/load tokenizer support
Checkpointed tokenizer training
Scalable corpus chunk processing
Special token support
Serialization utilities
PyPI-ready package structure
Automated unit tests
Tokenizer evaluation pipeline
GPT-2/tiktoken benchmarking

Repository Structure

Domain-Specific-BPE-Tokenizer/
├── domain_specific_bpe_tokenizer/
│   ├── __init__.py
│   ├── bpe_tokenizer.py
│   ├── trainer.py
│   ├── encoder.py
│   ├── decoder.py
│   ├── serialization.py
│   └── vocab.py
│
├── tests/
│   ├── test_bpe_tokenizer.py
│   ├── test_vocab.py
│   ├── test_trainer.py
│   ├── test_encoder.py
│   ├── test_decoder.py
│   ├── test_serialization.py
│   ├── test_save_load.py
│   ├── test_encode_decode.py
│   ├── test_special_tokens.py
│   └── test_invalid_inputs.py
│
├── examples/
│   ├── basic_usage.py
│   ├── load_pretrained_tokenizer.py
│   └── medical_tokenization_demo.py
│
├── scripts/
│   ├── prepare_medical_corpus.py
│   ├── collect_general_corpus.py
│   ├── collect_pubmed_corpus.py
│   ├── collect_pmc_open_corpus.py
│   ├── clean_corpus.py
│   ├── build_word_frequencies.py
│   ├── train_tokenizer.py
│   ├── tokenize_corpus.py
│   ├── evaluate_tokenizer.py
│   └── benchmark_against_tiktoken.py
│
├── resources/
│   ├── raw_corpus/
│   ├── cleaned_corpus/
│   ├── tokenized_corpus/
│   ├── trained_tokenizer/
│   └── evaluation/
│
├── docs/
│   ├── architecture.md
│   ├── evaluation.md
│   └── benchmarks.md
│
├── .github/
│   └── workflows/
│       └── tests.yml
│
├── pyproject.toml
├── requirements.txt
├── .gitignore
└── README.md

Installation

1. Clone the Repository

git clone https://github.com/adithya-prabhu-22/Domain-Specific-BPE-Tokenizer.git

cd Domain-Specific-BPE-Tokenizer

2. Install the Package

pip install -e .

3. Install Optional Benchmark Dependency

pip install tiktoken

Package Usage

Import the Tokenizer

from domain_specific_bpe_tokenizer import BPETokenizer

Train a Tokenizer

from domain_specific_bpe_tokenizer import BPETokenizer

tokenizer = BPETokenizer(
    vocab_size=5000,
    min_frequency=2,
)

text = "medical corpus text goes here"

tokenizer.train(text)

Encode Text

encoded = tokenizer.encode(
    "myocardial infarction"
)

print(encoded)

Decode Tokens

decoded = tokenizer.decode(encoded)

print(decoded)

Save Tokenizer

tokenizer.save(
    "resources/trained_tokenizer/bpe_tokenizer.json"
)

Load Tokenizer

loaded_tokenizer = BPETokenizer.load(
    "resources/trained_tokenizer/bpe_tokenizer.json"
)

Special Tokens

The tokenizer supports the following built-in special tokens:

Token	Purpose
`[UNK]`	Unknown token
`[PAD]`	Padding token
`[BOS]`	Beginning of sequence
`[EOS]`	End of sequence

Training Pipeline

The tokenizer training pipeline supports scalable corpus preprocessing.

Pipeline stages:

Corpus collection
Corpus cleaning
Word-frequency building
Weighted BPE training
Merge-rule optimization
Tokenization
Evaluation
Benchmarking

Corpus Sources

The project includes scripts for collecting:

General English corpora
PubMed abstracts
PMC Open Access biomedical papers

Example:

python -m scripts.collect_pubmed_corpus

Weighted BPE Training

The tokenizer supports weighted word-frequency BPE training for scalable preprocessing.

Features:

Frequency-based merge learning
Rare-word filtering
Checkpoint saving
Large-corpus subset selection
Optimized pair-frequency updates

Example:

python -m scripts.train_tokenizer

Tokenizer Evaluation

The repository includes tokenizer evaluation tooling.

Evaluation metrics include:

Compression ratio
Average characters per token
Unknown token rate
Medical term fragmentation
Token count statistics

Run evaluation:

python -m scripts.evaluate_tokenizer

Benchmark Against GPT-2/tiktoken

The repository includes benchmarking against GPT-2/tiktoken.

Comparison metrics:

Token count
Compression ratio
Medical-term fragmentation
Domain-token efficiency

Run benchmark:

python -m scripts.benchmark_against_tiktoken

Example Benchmark Result

Medical Term Fragmentation

Medical Term	Custom BPE Tokens	GPT-2 Tokens
electrocardiogram	1	5
pneumothorax	1	5
myocardial infarction	3	6
hepatocellular carcinoma	3	7

The custom medical-domain tokenizer significantly reduces fragmentation for biomedical terminology compared with GPT-2/tiktoken.

Tokenizer Training Scale

The tokenizer training pipeline evolved across multiple scalability stages:

Stage	Vocabulary Size	Corpus Scale
Initial Prototype	500	Small sample corpus
Intermediate Training	24K	Multi-million token corpus
Final Biomedical Tokenizer	52K	~240M-token medical corpus

Optimization Features

The tokenizer includes several performance optimizations:

Ranked merge lookup
Pair-frequency indexing
Safe affected-word updates
Chunk-based corpus processing
Cached frequency-table loading
Incremental merge updates

These optimizations significantly improved scalability during large-corpus tokenizer training.

Testing

The repository includes a comprehensive automated test suite.

Current coverage includes:

Tokenizer initialization
Encoding
Decoding
Save/load serialization
Special tokens
Invalid input handling
End-to-end tokenizer consistency

Run tests:

pytest

Current Status

26 tests passed

PyPI Packaging

The repository is structured as a reusable Python package using pyproject.toml.

Editable installation:

pip install -e .

The public API is exposed through:

from domain_specific_bpe_tokenizer import BPETokenizer

Documentation

Additional documentation is available in:

docs/

Including:

tokenizer architecture
evaluation methodology
benchmark reports

Integration with LLM Training

This tokenizer was later integrated into a custom GPT-style medical language model training pipeline.

The tokenizer was used for:

domain-specific tokenization
autoregressive GPT training
tokenizer-quality experiments
KV-cache benchmarking studies

Future Work

Future improvements include:

Rust implementation for faster preprocessing
Parallel BPE training
Memory-efficient billion-token preprocessing
Streaming tokenizer training
Unicode normalization improvements
Faster serialization formats
Hugging Face tokenizer interoperability
Tokenizer visualization tools
Vocabulary pruning experiments

Conclusion

This project demonstrates a complete from-scratch implementation of a scalable domain-specific BPE tokenizer.

The repository evolved from a simple educational tokenizer into a scalable biomedical NLP preprocessing pipeline featuring weighted BPE training, optimized merge learning, tokenizer evaluation, benchmarking against GPT-2/tiktoken, automated testing, and PyPI-ready packaging.

The strongest results appeared in medical-domain tokenization, where the tokenizer significantly reduced fragmentation for biomedical terminology compared with GPT-2/tiktoken.

Overall, the project demonstrates the importance of tokenizer design, merge-rule optimization, scalable preprocessing, and domain-specific vocabulary construction for modern NLP and LLM systems.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adithya_domain_specific_bpe_tokenizer-0.1.0.tar.gz (11.3 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adithya_domain_specific_bpe_tokenizer-0.1.0-py3-none-any.whl (10.9 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file adithya_domain_specific_bpe_tokenizer-0.1.0.tar.gz.

File metadata

Download URL: adithya_domain_specific_bpe_tokenizer-0.1.0.tar.gz
Upload date: May 30, 2026
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for adithya_domain_specific_bpe_tokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ff5957be28dcdfce09c052401d262a0b6dab99e05c7ed034379454e55a1fdca0`
MD5	`66f036677b93cd5534e65ba2fd6151f7`
BLAKE2b-256	`8d04b2b027ac23a2c1c5f939224bba9c72fc14d1c39cc0fadc293c034804c9df`

See more details on using hashes here.

File details

Details for the file adithya_domain_specific_bpe_tokenizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: adithya_domain_specific_bpe_tokenizer-0.1.0-py3-none-any.whl
Upload date: May 30, 2026
Size: 10.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for adithya_domain_specific_bpe_tokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dee1f73e9633f04377d041d63e1f301a3dca0e3b37b1680b4165c6675fe1e91c`
MD5	`4ccc791a4921c692725b196f53c6abf6`
BLAKE2b-256	`91c796cfb20115a48ab99ce90311443087675acf901f6f668479ed9059b920a3`

See more details on using hashes here.

adithya-domain-specific-bpe-tokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Domain-Specific-BPE-Tokenizer

Project Overview

Key Features

Repository Structure

Installation

1. Clone the Repository

2. Install the Package

3. Install Optional Benchmark Dependency

Package Usage

Import the Tokenizer

Train a Tokenizer

Encode Text

Decode Tokens

Save Tokenizer

Load Tokenizer

Special Tokens

Training Pipeline

Corpus Sources

Weighted BPE Training

Tokenizer Evaluation

Benchmark Against GPT-2/tiktoken

Example Benchmark Result

Medical Term Fragmentation

Tokenizer Training Scale

Optimization Features

Testing

Current Status

PyPI Packaging

Documentation

Integration with LLM Training

Future Work

Conclusion

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes