A GPT-4 compatible Byte Pair Encoding (BPE) tokenizer.
Project description
SmolBPE
Overview
SmolBPE is a lightweight and efficient Byte Pair Encoding (BPE) tokenizer designed for deep learning applications and large language models (LLMs) such as GPT-4. It provides a simple interface to tokenize textual data, facilitating better handling of out-of-vocabulary words and improving the performance of language models.
Features
- Efficient Tokenization: Implements the BPE algorithm for effective subword tokenization.
- Customizable Vocabulary Size: Allows you to specify the desired vocabulary size according to your needs.
- Unicode Support: Handles a wide range of characters, including Unicode characters, enabling multilingual tokenization.
- Easy Integration: Designed for seamless integration with existing Python projects and NLP pipelines.
- Command-Line Interface: Provides a CLI tool for training and using the tokenizer without writing additional code.
- Open Source: Licensed under the MIT License, promoting openness and collaboration.
Installation
You can install SmolBPE using pip
:
pip install smolbpe
Alternatively, you can install it directly from the source code:
git clone https://github.com/T4ras123/SmolBPE.git
cd SmolBPE
pip install .
Quick Start Guide
Using the Tokenizer in Python
1.Importing the Tokenizer
from smolbpe.tokenizer import Tokenizer
2.Initializing the Tokenizer
tokenizer = Tokenizer()
You can specify a custom output file to save the vocab file to and regex pattern if needed:
tokenizer = Tokenizer(output='vocab.json', pattern=r"\p{L}+|\p{Z}+|\p{N}+|[\p{P}&&[^.]]")
3.Training the Tokenizer
Train the tokenizer on your dataset to build the vocabulary and merge rules:
with open("path_to_your_data", "r", encoding="utf-8") as f:
text = f.read()
tokenizer.train(text, vocab_size=400)
4.Encoding Text
Convert text into a list of token IDs:
encoded_tokens = tokenizer.encode("Tokenizing isn't real")
print(encoded_tokens)
5.Decoding Tokens
Convert token IDs back into human-readable text:
decoded_text = tokenizer.decode(encoded_tokens)
print(decoded_text)
Command-Line Interface
SmolBPE provides a command-line interface for easy tokenization tasks.
Training the Tokenizer
tokenizer --text smth.txt --vocab_size 400 --output vocab.json
Advanced Usage
Loading a Pre-trained Vocabulary
If you have a pre-trained vocabulary and merges file, you can load them directly:
tokenizer = Tokenizer()
tokenizer.load_vocab('vocab.json')
Custom Regex Pattern
Customize the tokenization by providing a different regex pattern:
custom_pattern = r"\w+|\s+|[^\s\w]+"
tokenizer = Tokenizer(pattern=custom_pattern)
Custom special tokens
Add custom special tokens that appear in your dataset
special_tokens = ['<|start_text|>', '<|good_luck|>']
tokenizer = Tokenizer(special_tokens=special_tokens)
Project Structure
SmolBPE/
├── smolbpe/
│ ├── __init__.py
│ └── tokenizer.py
├── LICENSE
├── MANIFEST.in
├── README.md
└── setup.py
Contributing
Contributions are welcome! To contribute:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix.
- Commit your changes with descriptive commit messages.
- Push your branch to your forked repository.
- Open a pull request on the main repository.
Please ensure your code adheres to the project's coding standards and includes appropriate tests.
License
This project is licensed under the MIT License. You are free to use, modify, and distribute this software in accordance with the license.
Contact
For any inquiries or feedback, please contact the author:
- Author: Vover
- Email: vovatara123@gmail.com
- GitHub: T4ras123
Acknowledgments
- Inspired by tokenization techniques used in GPT models.
- Special thanks to the open-source community for continuous support.
Happy tokenizing with SmolBPE!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file smolbpe-0.3.2.tar.gz
.
File metadata
- Download URL: smolbpe-0.3.2.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 162387ef7e8d06c72d47f389b9eafacb0868ffe84623e62d5a6be277d841ea9f |
|
MD5 | 91332cb1e4b03b008d2d179aa099099e |
|
BLAKE2b-256 | 674b1a824b5fbfa2ddc2b76649a86006192c85eaaa707c1b1cfe5b424806a32e |
File details
Details for the file smolbpe-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: smolbpe-0.3.2-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cccd2c66601313e5b8ff555f181215330b9f2882be7fce0ce8917ae4f4f3ff0c |
|
MD5 | c95f0a57a6a2c51da643bb431a08dd26 |
|
BLAKE2b-256 | 947532a243d4d4a578cd38f4445a835ae19eff85efd162bb6f3dcb67430f87ed |