Custom Byte-Level BPE Tokenizer built from scratch in Python
Project description
Sujit-Tokenizer
A custom Byte-Level BPE (Byte Pair Encoding) Tokenizer implemented from scratch in Python.
Features
- Byte-level tokenization
- Custom BPE training
- Text encoding and decoding
- Save and load tokenizer models
- UTF-8 support
- Lightweight and dependency-free
Installation
Clone Repository
git clone https://github.com/your-username/Sujit-Tokenizer.git
cd Sujit-Tokenizer
Install Package
pip install -e .
Project Structure
Sujit-Tokenizer/
│
├── Sujit_Tokenizer/
│ ├── __init__.py
│ └── tokenizer.py
│
├── corpus.txt
├── train_and_test.py
│
├── README.md
├── setup.py
├── pyproject.toml
└── LICENSE
Quick Start
Import Tokenizer
from Sujit_Tokenizer import CustomByteLevelBPETokenizer
Train Tokenizer
corpus = [
"Transformers are amazing.",
"Machine Learning is powerful.",
"Python is widely used in AI."
]
tokenizer = CustomByteLevelBPETokenizer(
vocab_size=1000
)
tokenizer.train(corpus)
Save Model
tokenizer.save_model(
"tokenizer.model"
)
Load Model
tokenizer = CustomByteLevelBPETokenizer()
tokenizer.load_model(
"tokenizer.model"
)
Encode Text
encoded = tokenizer.encode(
"Transformers use attention."
)
print(encoded)
Example Output:
[2, 451, 723, 812, 3]
Decode Text
decoded = tokenizer.decode(
encoded
)
print(decoded)
Output:
Transformers use attention.
Training Workflow
Corpus
↓
Byte Conversion
↓
Pair Frequency Counting
↓
BPE Merging
↓
Vocabulary Construction
↓
Tokenizer Model
Special Tokens
| Token | ID |
|---|---|
| 0 | |
| 1 | |
| 2 | |
| 3 |
Example
from Sujit_Tokenizer import CustomByteLevelBPETokenizer
tokenizer = CustomByteLevelBPETokenizer(
vocab_size=1000
)
corpus = [
"Artificial Intelligence",
"Machine Learning",
"Deep Learning"
]
tokenizer.train(corpus)
tokenizer.save_model(
"tokenizer.model"
)
tokenizer.load_model(
"tokenizer.model"
)
text = "Machine Learning"
encoded = tokenizer.encode(text)
print(encoded)
decoded = tokenizer.decode(encoded)
print(decoded)
Use Cases
- NLP experiments
- Tokenization research
- Educational projects
- Understanding BPE internals
- Building custom language models
- Learning tokenizer architecture
Future Enhancements
- Faster BPE training
- WordPiece tokenizer
- SentencePiece tokenizer
- Parallel processing
- Vocabulary statistics
- Token frequency analysis
- Hugging Face compatibility
Author
Sujit Maity
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sujit_tokenizer-1.0.0.tar.gz
(4.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sujit_tokenizer-1.0.0.tar.gz.
File metadata
- Download URL: sujit_tokenizer-1.0.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcd81ac43fc4fcf5408fc175ad4ccc52ec410b2c024690606ed0e8e2ca7f4f39
|
|
| MD5 |
6672f67f7ef3f566b9ffcbbdeafdfe09
|
|
| BLAKE2b-256 |
d87ce23487b11bed6edf2b271620bf5c9cfaf55f1a96e72ffa5cbe4cb6af71a6
|
File details
Details for the file sujit_tokenizer-1.0.0-py3-none-any.whl.
File metadata
- Download URL: sujit_tokenizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8ee42af4ed35c0e9f6eb1ed8fb34921dbde0712081c45d96693fe718d85e74c
|
|
| MD5 |
ba446678e1d1ac0576f770f85067b40c
|
|
| BLAKE2b-256 |
228fb3ed225413aee5ce804896ab5965eb60e92ca4251bc5a20064ef4887c071
|