Skip to main content

High performance BPE tokenizer written in Rust with Python bindings

Project description

MayaTok

MayaTok is a Byte-Pair Encoding (BPE) tokenizer written in Rust. Built with performance and extensibility in mind. I made this project just because I wanted to study how Byte Pair Encoding Works.

Version: V2

⚡️ Features (More optimizations in Progress)

  • Multithreaded training for fast vocab generation

  • Persistent merges

  • Checkpoint saving

  • Focus on raw speed — built for performance benchmarking

🚀 Installation

Prerequisites

  • Rust (required)
  • Python 3.7+ (for Python bindings)

PIP Installation

pip install mayatok

From Source

git clone https://github.com/AlgoBrother/MayaTok-BPE.git
cd mayatok-bpe

Use maturin for building wheels.

pip install maturin
maturin build --release
pip install target/wheels/*.whl

Quick Start

Using with Python

To use MayaTok with Python:

import mayatok as bpe

my_tokenizer =  bpe.get_tokenizer("v2-100k") # or 'mayatok-base' if you wish to use v1 tokenizer
test = "Hello, world!"
tokens = my_tokenizer.encode(test)
print(tokens)
decoded_text = my_tokenizer.decode(tokens)
print(decoded_text)

Output of the sample code above

[11608, 77, 3641, 62]
Hello, world!

If you want to create your own Vocab

If you are using HuggingFace Datasets, refer to this for creating your own vocab.

If your dataset is in your local machine

Make sure you have forked/cloned the rust tokenizer code and have built the /target/wheels as mentioned in previous steps

stream method - If you have a large dataset and want to stream your data in chunks to not overload your machine. Use this.

non-stream method - If you have a dataset which your RAM can handle after being loaded, use this for much faster training.

📈 Benchmarks

Batch Encoding

Tokenizer Tokens/sec Avg Compression Ratio
MayaTok-BPE 7,306,114 2.75
tiktoken-cl100k 262,016 3.36
tiktoken-p50k 288,657 3.27
GPT2 1,227,199 2.94
Falcon-7B 946,393 3.26

Normal Encoding

Tokenizer Tokens/sec Compression Ratio
MayaTok 1,181,709 2.75
tiktoken-cl100k 1,184,446 3.36
tiktoken-p50k 1,591,801 3.27
GPT2 252,369 2.94
Falcon-7B 172,114 3.26

Note: Performance optimizations are ongoing (MAY CHANGE SINCE I AM APPLYING NEW BENCHMARK METHOD.)

💽 Corpus Used for V2

cosmopedia-v2

c4-english

wikipedia

openwebtext

github-top-code

arxiv-papers

Check dataset_training/train.py for more details

🙌 Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bugs, feature requests, or optimizations.

📄 License

Apache-2.0

Future Targets [COMPLETED]

  • [✓] PyPI package distribution

  • [✓] the examples folder has lot of python implementation. Will experiment to integrate this in rust side of code and make python side of code more smaller and easier for users.

  • [✓] Enhanced CPU utilization and faster merging algorithms

  • [✓] Improved merge quality and compression ratios

  • [✓] Bincode support for faster model loading

  • [✓] New Line Format support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mayatok-0.2.2.tar.gz (477.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mayatok-0.2.2-cp39-abi3-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.9+Windows x86-64

mayatok-0.2.2-cp39-abi3-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mayatok-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

mayatok-0.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.8 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file mayatok-0.2.2.tar.gz.

File metadata

  • Download URL: mayatok-0.2.2.tar.gz
  • Upload date:
  • Size: 477.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mayatok-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c0565f1a06fdbc9a2a4dc7003767cd1604f952ef605b6a40e748fa1343e365ca
MD5 775be1f9bfe531e744869fd1480e1e62
BLAKE2b-256 236ba6c55c927f5571b740644ebcd7e28b6d08fcac864baf59fae7638995fdc1

See more details on using hashes here.

File details

Details for the file mayatok-0.2.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: mayatok-0.2.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mayatok-0.2.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 200f9d0fd08a74f2a0c9fd8b7023637a5dad245a4c734952a63d28fbbed8bbc3
MD5 8449078c6fbae7ffad4cfed430ae1eba
BLAKE2b-256 9e579a46c4c1a317c2ed5775be993134085de883d77b825870b4f561523dc178

See more details on using hashes here.

File details

Details for the file mayatok-0.2.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 49fde22a995c43fa959abf0312faad7bca02677685ae5778c39daf5ec1088cd7
MD5 d3f9b27ca97682960d49f6d1aeae6d8c
BLAKE2b-256 592b67ecc3a4e82087bcba45d86c7609ee43ba267581fd13fe518e41b817aefc

See more details on using hashes here.

File details

Details for the file mayatok-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7b53085a6c3a6363a27969a10a3e754d524db5c95abc18a0c2dbd191a9c013ee
MD5 e24da28305d95de4d1baca0e14baad0f
BLAKE2b-256 698caaec465456c074533333ed14704d51d68c4440dc2ee0bae5b18f4ca97190

See more details on using hashes here.

File details

Details for the file mayatok-0.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 da43028ce1c8316e20f72ee3251746ec7d31f3134f284b401ea0703e7e06a3d0
MD5 371cddaed10a041ca4ba437ab16ed55f
BLAKE2b-256 a4e8989fa970edc7a7c0a02677f79e361c2aa11a94fc892123272cb321250e77

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page