High performance BPE tokenizer written in Rust with Python bindings
Project description
MayaTok
MayaTok is a Byte-Pair Encoding (BPE) tokenizer written in Rust. Built with performance and extensibility in mind. I made this project just because I wanted to study how Byte Pair Encoding Works.
Version: 2.1.4
⚡️ Features (More optimizations in Progress)
-
Multithreaded training for fast vocab generation
-
Persistent merges
-
Checkpoint saving
-
Focus on raw speed — built for performance benchmarking
🚀 Installation
Prerequisites
- Rust (required)
- Python 3.9+ (for Python bindings)
PIP Installation
pip install mayatok
From Source
git clone https://github.com/AlgoBrother/MayaTok-BPE.git
cd mayatok-bpe
Use maturin for building wheels.
pip install maturin
maturin build --release
pip install target/wheels/*.whl
Quick Start
Using with Python
To use MayaTok with Python:
import mayatok as bpe
my_tokenizer = bpe.get_tokenizer("v2-100k") # or 'mayatok-base' if you wish to use v1 tokenizer
test = "Hello, world!"
tokens = my_tokenizer.encode(test)
print(tokens)
decoded_text = my_tokenizer.decode(tokens)
print(decoded_text)
Output of the sample code above
[11608, 77, 3641, 62]
Hello, world!
If you want to create your own Vocab
If you are using HuggingFace Datasets, refer to this for creating your own vocab.
If your dataset is in your local machine
Make sure you have forked/cloned the rust tokenizer code and have built the /target/wheels as mentioned in previous steps
stream method - If you have a large dataset and want to stream your data in chunks to not overload your machine. Use this.
non-stream method - If you have a dataset which your RAM can handle after being loaded, use this for much faster training.
📈 Benchmarks
Batch Encoding
| Tokenizer | Tokens/sec | Avg Compression Ratio |
|---|---|---|
| MayaTok-BPE | 6,757,698 | 2.75 |
| tiktoken-cl100k | 262,016 | 3.36 |
| tiktoken-p50k | 288,657 | 3.27 |
| GPT2 | 1,940,899 | 2.94 |
| Falcon-7B | 1,554,393 | 3.26 |
Normal Encoding
| Tokenizer | Tokens/sec | Compression Ratio |
|---|---|---|
| MayaTok | 1,249,426 | 2.75 |
| tiktoken-cl100k | 2,318,683 | 3.36 |
| tiktoken-p50k | 2,670,190 | 3.27 |
| GPT2 | 519,369 | 2.94 |
| Falcon-7B | 346,040 | 3.26 |
Note: Performance optimizations are ongoing (MAY CHANGE SINCE I AM APPLYING NEW BENCHMARK METHOD.)
💽 Corpus Used for V2
cosmopedia-v2
c4-english
wikipedia
openwebtext
github-top-code
arxiv-papers
Check dataset_training/train.py for more details
🙌 Contributing
Pull requests and suggestions are welcome! Feel free to open issues for bugs, feature requests, or optimizations.
📄 License
Apache-2.0
Future Targets [COMPLETED]
-
[✓] PyPI package distribution
-
[✓] the
examplesfolder has lot of python implementation. Will experiment to integrate this in rust side of code and make python side of code more smaller and easier for users. -
[✓] Enhanced CPU utilization and faster merging algorithms
-
[✓] Improved merge quality and compression ratios
-
[✓] Bincode support for faster model loading
-
[✓] New Line Format support
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mayatok-2.1.4.tar.gz.
File metadata
- Download URL: mayatok-2.1.4.tar.gz
- Upload date:
- Size: 477.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7fb5b84c53ca2ad3c9861c4dd720c5ec6ee9caa17f256f20465e3e1d8852982
|
|
| MD5 |
448371c0ba0052d7e69a45929463e7ce
|
|
| BLAKE2b-256 |
3bd26867e571c9d3b9258dc4d1dc5da8fc2ccaec2ae2570848c2b6f8ef67b584
|
File details
Details for the file mayatok-2.1.4-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a558fae75ebef50565a8e0251bdd5b2e5d5d53649d594c616d750f3c7c4cc81
|
|
| MD5 |
0b869e748594f369e58960672bfbfef9
|
|
| BLAKE2b-256 |
1ed414194434825885b04b618d47c673da2061fecd98559e6df09f5a60fc4800
|
File details
Details for the file mayatok-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 19.9 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54d56bcc7b11f3345f9792081f5c2e8f6c20f6b9631c210698793b074d72055f
|
|
| MD5 |
9ef82379b3c65f9977f90423dc2d1b93
|
|
| BLAKE2b-256 |
d7a9a740637065e0b4eed77e41af4e5a7727f47ac9d2dabf67b7c30158a26e05
|
File details
Details for the file mayatok-2.1.4-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
003cb2872e2e881f546a58dab3ebaea3650258da095e3c3da9127b068db02450
|
|
| MD5 |
2846b5b4951f5d5c0a66e2b4225324a7
|
|
| BLAKE2b-256 |
aeedd83a84e86a698de5f30a2fd993115c968db785315c5c35947aaceeb1f99b
|
File details
Details for the file mayatok-2.1.4-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45c13ad80e8dbedaa748b70f6c308b5f76a70c59ffebf8c5641216aca5355686
|
|
| MD5 |
d01cafaa0f403cc50ee0ffadaa02284b
|
|
| BLAKE2b-256 |
5b724ba8ed9ae7109d19396b0c679f652e283d52f9bec8cf95f6dd841b25092e
|
File details
Details for the file mayatok-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 19.9 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93654b295982286ebe6f1289787c0e77df329dbc8c3564ccffcfb49e7a527eb6
|
|
| MD5 |
71b42af2d52a7762eb68c8117a764685
|
|
| BLAKE2b-256 |
b1dd814c8fd96dce5c7d935c084bac48d137e890d54960cf7c93795dd91302ac
|
File details
Details for the file mayatok-2.1.4-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0cadceecf5c54ceafecaf192c52b0748794e09ec15d7b5599d8293a83b83478
|
|
| MD5 |
e7f120203141e9ae00990a78cdce0cf0
|
|
| BLAKE2b-256 |
a534b76f212ec7c7434391ea770ba92068a1a2dbe9b96e253bb3c52d31b520cd
|
File details
Details for the file mayatok-2.1.4-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1796b92c0ed90e19892fe69258f3612566f40773c2f8d3302a144571ade9c10
|
|
| MD5 |
8df46c896c11b8bcddbb9dd91bc234b0
|
|
| BLAKE2b-256 |
ff41952c777a86effdeae96266d52ffa78d1091962623ea1699cdcf6ad9bdc9b
|
File details
Details for the file mayatok-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 19.9 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29e57a5d1ce055064f50aec194f2a8727cda0b8d83c30b5afd5b293f76020a7c
|
|
| MD5 |
4936aa95a93fae47618e126e801e30ab
|
|
| BLAKE2b-256 |
e19a73f5e3c49489e11bfbf60a047a981d07fa08c1c704b36c1faeb2ddcd99e6
|
File details
Details for the file mayatok-2.1.4-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f139be1342fff5819dfd649fe8dcfb09835b7afa586f8f31858ff1052b14efe
|
|
| MD5 |
dc46874623900d2071ef580071320262
|
|
| BLAKE2b-256 |
d19762f47d66f670c929257f9d8cb1401b19d405ec60b0ade1074e1fce1e9525
|
File details
Details for the file mayatok-2.1.4-cp310-cp310-macosx_10_12_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp310-cp310-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.10, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25ed07076ddb6f1e2c08e80df38ae79af8db69fa03729b7d45f8ea25f8719a26
|
|
| MD5 |
2f37d99564411fa8f4d20d9752cae417
|
|
| BLAKE2b-256 |
fca84e0e9669a45d32263733c9753d2d1aea8afb9d2f1d8550ba9e6bfc45a069
|
File details
Details for the file mayatok-2.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 19.9 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dce54ac8fecbb2709c45a8c6d7e2d4b4ae489d0e7ad40bb03f2d679053e5aaf2
|
|
| MD5 |
3277ad5f2bd2182e63c5c2527b81aa99
|
|
| BLAKE2b-256 |
306534ed9efee4967c4f6f0805cd0cdf5a4251a6e963843e45ef9e0f8f3742b9
|
File details
Details for the file mayatok-2.1.4-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a55091cf1ffbf784f03deb5b9fbe1fd92580f68e1b09f7a641b4928dd633fa8
|
|
| MD5 |
e31cbc63002d0968ac55964951cad75f
|
|
| BLAKE2b-256 |
3ffbe27b669ca6009aadb5e6b836205086c444649574d75f3891835aa9914249
|
File details
Details for the file mayatok-2.1.4-cp39-cp39-macosx_10_12_x86_64.whl.
File metadata
- Download URL: mayatok-2.1.4-cp39-cp39-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.9, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1498b4800a09ca0e9537c0c84511827ddf519d52a4e73bfd6cc8db687bd7320
|
|
| MD5 |
d277c83f4a6f2b2edb4f58a5f1f02e63
|
|
| BLAKE2b-256 |
f3cdd3dfa9035f5b1f56238bbde6a36fc3d2b7d4f7369fbeb6886b8eb96df388
|