Skip to main content

High performance BPE tokenizer written in Rust with Python bindings

Project description

MayaTok

MayaTok is a Byte-Pair Encoding (BPE) tokenizer written in Rust. Built with performance and extensibility in mind. I made this project just because I wanted to study how Byte Pair Encoding Works.

Version: V2

⚡️ Features (More optimizations in Progress)

  • Multithreaded training for fast vocab generation

  • Persistent merges

  • Checkpoint saving

  • Focus on raw speed — built for performance benchmarking

🚀 Installation

Prerequisites

  • Rust (required)
  • Python 3.9+ (for Python bindings)

PIP Installation

pip install mayatok

From Source

git clone https://github.com/AlgoBrother/MayaTok-BPE.git
cd mayatok-bpe

Use maturin for building wheels.

pip install maturin
maturin build --release
pip install target/wheels/*.whl

Quick Start

Using with Python

To use MayaTok with Python:

import mayatok as bpe

my_tokenizer =  bpe.get_tokenizer("v2-100k") # or 'mayatok-base' if you wish to use v1 tokenizer
test = "Hello, world!"
tokens = my_tokenizer.encode(test)
print(tokens)
decoded_text = my_tokenizer.decode(tokens)
print(decoded_text)

Output of the sample code above

[11608, 77, 3641, 62]
Hello, world!

If you want to create your own Vocab

If you are using HuggingFace Datasets, refer to this for creating your own vocab.

If your dataset is in your local machine

Make sure you have forked/cloned the rust tokenizer code and have built the /target/wheels as mentioned in previous steps

stream method - If you have a large dataset and want to stream your data in chunks to not overload your machine. Use this.

non-stream method - If you have a dataset which your RAM can handle after being loaded, use this for much faster training.

📈 Benchmarks

Batch Encoding

Tokenizer Tokens/sec Avg Compression Ratio
MayaTok-BPE 7,306,114 2.75
tiktoken-cl100k 262,016 3.36
tiktoken-p50k 288,657 3.27
GPT2 1,227,199 2.94
Falcon-7B 946,393 3.26

Normal Encoding

Tokenizer Tokens/sec Compression Ratio
MayaTok 1,181,709 2.75
tiktoken-cl100k 1,184,446 3.36
tiktoken-p50k 1,591,801 3.27
GPT2 252,369 2.94
Falcon-7B 172,114 3.26

Note: Performance optimizations are ongoing (MAY CHANGE SINCE I AM APPLYING NEW BENCHMARK METHOD.)

💽 Corpus Used for V2

cosmopedia-v2

c4-english

wikipedia

openwebtext

github-top-code

arxiv-papers

Check dataset_training/train.py for more details

🙌 Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bugs, feature requests, or optimizations.

📄 License

Apache-2.0

Future Targets [COMPLETED]

  • [✓] PyPI package distribution

  • [✓] the examples folder has lot of python implementation. Will experiment to integrate this in rust side of code and make python side of code more smaller and easier for users.

  • [✓] Enhanced CPU utilization and faster merging algorithms

  • [✓] Improved merge quality and compression ratios

  • [✓] Bincode support for faster model loading

  • [✓] New Line Format support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mayatok-2.1.3.tar.gz (477.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mayatok-2.1.3-cp312-cp312-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.12Windows x86-64

mayatok-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mayatok-2.1.3-cp312-cp312-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

mayatok-2.1.3-cp312-cp312-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

mayatok-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mayatok-2.1.3-cp311-cp311-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

mayatok-2.1.3-cp311-cp311-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

mayatok-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

mayatok-2.1.3-cp310-cp310-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

mayatok-2.1.3-cp310-cp310-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

mayatok-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

mayatok-2.1.3-cp39-cp39-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

mayatok-2.1.3-cp39-cp39-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

File details

Details for the file mayatok-2.1.3.tar.gz.

File metadata

  • Download URL: mayatok-2.1.3.tar.gz
  • Upload date:
  • Size: 477.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for mayatok-2.1.3.tar.gz
Algorithm Hash digest
SHA256 89946286f576499c89cc71514ae7e5ea449a37aef18d4a2c45fbd3a757327f99
MD5 4e6c9397d5c6f8673033146f99e33097
BLAKE2b-256 40083a6496b382919a3dc5c3d422dde916f87027b128e33aa06cc25b4d108f63

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: mayatok-2.1.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for mayatok-2.1.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 031e6c78dfc93d04cd00a5ea91a214915ae6829115713598cc9148325a6487e9
MD5 68b81d7de8554a20762bb364f4557c4d
BLAKE2b-256 ae5e4b2628dc02e3e2d85621231984adb1cea296d508b3f6b76a3d112b14ee83

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 17ebc73d4cee294e028c628a42f19427c40f74535a761559567881475686fdbe
MD5 0af486ee4925055b84ab4c39e768b8cc
BLAKE2b-256 fde81d5637111e8a2eaff76aadacc229141b937031d6e113e816313aa83f8974

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b1e38016b428f1637bc20ae9fdbb5c5b07b41e3301533c0b4d61bee295a2be45
MD5 da57085e0f3a9fe1a837415833863d7f
BLAKE2b-256 ed58b78f2aa00cc8789c7e1fd6a0305f04a9fdb1291c187d428327dc6c3b5c71

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 772a88386841d73dbb16d497b18fcd58bebd3283021462827bfa25d083c7d680
MD5 d4b3e2f9e6b13fb5fa5e1b2d55c0da09
BLAKE2b-256 2a1ff9717d04918ab692581ca067d3514723a746d5e6b7cd200e724e671717ed

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 23e62fb043c65b07a8c202d0198e4acb76d395dcb6bd8c43aac24fb87d095c17
MD5 9c871ea9537db1820d652f403313f50f
BLAKE2b-256 3d62eaf9e7a9e78e9272e3178d03349a881c450cb9725ff74de5ba2d0d9d42db

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cad544e410a011cb3c7cf8b4c0505e06b2ac8c8a78fa992372e0735a96585d04
MD5 4a64eb64ae17dff3aafcbba5c10c5440
BLAKE2b-256 0ef84b5bf367a3bc8f8c791e0922252328bc04895d050fc0ea1575360484f361

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 688706382d97cfe5816476c57b18a92e29a63170af213a423ebe75c1440d828e
MD5 84be5c1fb8ef1903c245898c4cfeff35
BLAKE2b-256 a979fba20f9f4b551543a510d2f306a57694ef243c9dd27e4819bb32ad81189a

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 88059932ad7e1f1ed8a825919cb7c4c9f358e3879ab7f2239604bae8030786d0
MD5 6d4bdc35a54d6addc8e262dff065f6a6
BLAKE2b-256 4315582feda0d402bb17e185bf78e43eebec3aaeac03a6e7aa0147a9fbb27379

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b95a0af7c6560395f6d82910e988d68df10c7317e5af1956881db740baed165c
MD5 81bda98e05e8dca3d69605446b99b8a6
BLAKE2b-256 8a2c77cc483935568bc306805e89095790c1261abe0657a9c6b80958a985ce88

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0d9f5e084d4e288d57c38a3fbfce3883b24eaa721ca1b83dfb7e6a56f1ac5734
MD5 db9d0284aef8dbb17d9686c81fa49331
BLAKE2b-256 ab98981d38d6e30c456a8e5ef91366955f82e1cebbd893e47b1252b3933f7b7a

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 06d5968837ed1590c7d4701e17d9bbfd19303fd5cecb2fda48d138ef921230dc
MD5 875cbb88df15f121e282c722e48a6093
BLAKE2b-256 dde55daf20e6edf8e8178b2c5e6d9a8e24fcc57aaff26ee332dfc12620ea65ea

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c25b2fe8e021a4e2070749bdd3741f4c10aecc93311656fa00e7f881b8aba91d
MD5 6e61e34389e16c20b3857a42de02bbcb
BLAKE2b-256 15e9b182fc1efd1387620ba458777a701cfd19409a858930b39d44df76fdc459

See more details on using hashes here.

File details

Details for the file mayatok-2.1.3-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.3-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c3a0944013a851c4aa0e9d04066f94f1d6e0e19d3aa5971985206961942bc6a4
MD5 6efdff76716ab9b39b39a6dc69449f6f
BLAKE2b-256 cc339f94c2b12a28f0dad982e250ba43aeea5c86d6ba8d1aa41b70cc96883250

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page