Skip to main content

High performance BPE tokenizer written in Rust with Python bindings

Project description

MayaTok

MayaTok is a Byte-Pair Encoding (BPE) tokenizer written in Rust. Built with performance and extensibility in mind. I made this project just because I wanted to study how Byte Pair Encoding Works.

Version: V2

⚡️ Features (More optimizations in Progress)

  • Multithreaded training for fast vocab generation

  • Persistent merges

  • Checkpoint saving

  • Focus on raw speed — built for performance benchmarking

🚀 Installation

Prerequisites

  • Rust (required)
  • Python 3.7+ (for Python bindings)

From Source

git clone https://github.com/AlgoBrother/MayaTok-BPE.git
cd mayatok-bpe

Use maturin for building wheels.

pip install maturin
maturin build --release
pip install target/wheels/*.whl

Quick Start

Create your own Vocab

If you wish to create your own vocab file with a different corpus file.

Make sure you have forked/cloned the rust tokenizer code and have built the /target/wheels as mentioned in previous steps

stream method - If you have a large dataset and want to stream your data in chunks to not overload your machine. Use this.

non-stream method - If you have a dataset which your RAM can handle after being loaded, use this for much faster training.

Using with Python

To use MayaTok with Python:

import mayatok as bpe

my_tokenizer =  bpe.get_tokenizer("v2-100k") # or 'mayatok-base' if you wish to use v1 tokenizer
test = "Hello, world!"
tokens = my_tokenizer.encode(test)
print(tokens)
decoded_text = my_tokenizer.decode(tokens)
print(decoded_text)

Output of the sample code above

[11608, 77, 3641, 62]
Hello, world!

📈 Benchmarks

Batch Encoding

Tokenizer Tokens/sec Avg Compression Ratio
MayaTok-BPE 7,306,114 2.75
tiktoken-cl100k 262,016 3.36
tiktoken-p50k 288,657 3.27
GPT2 1,227,199 2.94
Falcon-7B 946,393 3.26

Normal Encoding

Tokenizer Tokens/sec Compression Ratio
MayaTok 1,181,709 2.75
tiktoken-cl100k 1,184,446 3.36
tiktoken-p50k 1,591,801 3.27
GPT2 252,369 2.94
Falcon-7B 172,114 3.26

Note: Performance optimizations are ongoing (MAY CHANGE SINCE I AM APPLYING NEW BENCHMARK METHOD.)

💽 Corpus Used for V2

cosmopedia-v2

c4-english

wikipedia

openwebtext

github-top-code

arxiv-papers

Check dataset_training/train.py for more details

🙌 Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bugs, feature requests, or optimizations.

📄 License

Apache-2.0

Future Targets

  • PyPI package distribution

  • [✓] the examples folder has lot of python implementation. Will experiment to integrate this in rust side of code and make python side of code more smaller and easier for users.

  • [✓] Enhanced CPU utilization and faster merging algorithms

  • [✓] Improved merge quality and compression ratios

  • [✓] Bincode support for faster model loading

  • [✓] New Line Format support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mayatok-0.2.1.tar.gz (477.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mayatok-0.2.1-cp312-cp312-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.12Windows x86-64

mayatok-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mayatok-0.2.1-cp312-cp312-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

mayatok-0.2.1-cp312-cp312-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

mayatok-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mayatok-0.2.1-cp311-cp311-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

mayatok-0.2.1-cp311-cp311-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

mayatok-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

mayatok-0.2.1-cp310-cp310-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

mayatok-0.2.1-cp310-cp310-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

mayatok-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

mayatok-0.2.1-cp39-cp39-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

mayatok-0.2.1-cp39-cp39-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

mayatok-0.2.1-cp39-abi3-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.9+Windows x86-64

mayatok-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

mayatok-0.2.1-cp39-abi3-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

mayatok-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

mayatok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.8 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file mayatok-0.2.1.tar.gz.

File metadata

  • Download URL: mayatok-0.2.1.tar.gz
  • Upload date:
  • Size: 477.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mayatok-0.2.1.tar.gz
Algorithm Hash digest
SHA256 5ce0ee5b65b6e43e854a197caf3ad4eb8eefbf9f7586a42035f8266221d12536
MD5 01fc6d2fa504f5009f061385123c8d2d
BLAKE2b-256 50081f18c6a3e5dcc45b95cd64ca08feed3a46fd8623112bc2dc2f64855f7871

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: mayatok-0.2.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mayatok-0.2.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0348a0d0428fd62c57523cde302b2e819e58936624c940df831dba7bf1015882
MD5 9b2ab87855749f9eed5a151795316c4a
BLAKE2b-256 a3f7ae0b0717a60ef98a4dff884c130bb46e019c6e4834a59bdd17306bdd453a

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bc1a61d6f6d6a9c82c22f3d21165b141e666be9fbcf10e897395c6d78cd93440
MD5 811b8a902ef6eb901c376de3854b24ab
BLAKE2b-256 48411f43dc0b8023f4b9392040cf3d95a08dcd6d3a15d6744b8d79358f2e70a8

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c26b87bdee15a1d08a616088f896f7d880daa11443857841074af208f418e02
MD5 5f72a6773cf42cd6774c439d6eadff3b
BLAKE2b-256 2662af6d13a08fa9f7d3bf0f5577ceb32d308240204631119daa65e8346c7ee3

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0960ada0b777a5c9b6d2f5fc87edc313173dfc03fe4ccb558835c2ad81cc9059
MD5 528d2b44016d67843d2ae8247966e366
BLAKE2b-256 234a52064b02f17c4d530a4739f84ad65e0b5a8d7756346b11cf6ec98fb3c462

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5aa5eec8aaca234b94ceaa06a1d859b184ae176a00214c5feaa79a0218321231
MD5 8cfc97111b3a2c261276e668a20990ae
BLAKE2b-256 c2030cb083d0ed2d8d46a4878cfc4c0f7cc63a5a945797cd781a3c57922c349e

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1d40a19f62bb7c8e6d424c5f223496d80af3788b90b7a4adbf7291ba6977ff1b
MD5 5bf5a66ac264d5a12620cc2f6f396b15
BLAKE2b-256 1e668699d7551b45e22c407ae4f9a124d17b11106c8c8ee79e98fc224d1f06be

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c3602616a4e87d2c02b7feec9b9a2176c668d8e54a2f6e0032284c23f8f8b132
MD5 1506ebeb30b951410148a18b4286b800
BLAKE2b-256 c260af9ce1cb635d3ed1655b9ca37e6693658bc1c0ee97f9b9c3454b702dc925

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 27963ac500fdd3bf8d10b726f25ef0328e1b22c12b07871a3b79de0e6c9aa6cd
MD5 4a3a55853eea5c187061c10fa43518a7
BLAKE2b-256 7c8e149b42e43ded0c93bd6a3c91f02a92948b15500b7e0d2e0eb49969c04d17

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cfc93dff5998f0a5fd90a2bb5dd7635b14f7a7e519b3d65e231e2654e2662888
MD5 d06a589682797ad9d6d71cbe3fe55ed1
BLAKE2b-256 283e9f5fee308b9be1e8f7adb02514278877f708e2f205d8544da83742284db8

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 41956d61fe843df4559f386e13a7ed594a7c9cadb05c15f60c0c7a6eec945bba
MD5 599afebd3c31954843683dd355b76f07
BLAKE2b-256 68469e64a5123310e6ccf5ac172280930b242be20d83f2945a4d7d57ce79f530

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 750835294e5d984f2fae682e42a2f8540c8812fa6c5489f35d600fd57eca594d
MD5 58e3c68b631910d96754d5ac2935f5b5
BLAKE2b-256 d6955dd8f5b4c19332940c7269fb4a13450b9bf6df26314d14dff464e438f003

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cc8be47e1620df0cdc8d14b4d1ba79a9dd408ab0aab8d65ea542c003ad389fd6
MD5 ff99506ae6c1cd51ee01af3df10d5648
BLAKE2b-256 3f5022b2637f7861353742aa6bd25dbea3abb19fe030d2803037f26dc343fed0

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 305da55ad4902e16a2a77882d94ef63dfddde595bd372530ab8042bee143ba83
MD5 49f7159b1f5a26a0bb2f299f5efb07cb
BLAKE2b-256 ad7435482844be43ccfbda01c7b569ecc5cd503ee6c152176aabadb731c4e071

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: mayatok-0.2.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for mayatok-0.2.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 67786dec6e7c97f97801a82671f1a0179f7112a773b1250e5471e7b8f6d33e94
MD5 b0849ac143372a246d32f02f9d922977
BLAKE2b-256 05be6f941b08af4062cf033076f80869abc6b14d264d71ea8c366a304ee091ce

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7cc0fc1d4ed9f63f4fbd01f27b58fa398ed776de7e7e0782f36effbf83bc57a6
MD5 d6e54641c308fe44c0ccddad02fcf29f
BLAKE2b-256 1f313753cf8b612d7768565b3aae2cf13abae32b4582342eae2f72eac357de46

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eb1b47420d614ab8ba38358345acdc97a4e8a03ecac186386add73161973d8e1
MD5 f29485018027dad00b94e26d44bb7f82
BLAKE2b-256 1aa83bd3df170ea0cebbb05762360b99bfe8295b047f8f6d57173e6ed46c4cb1

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1f8104298221470a71d66262c89d4d9a34a849c148f37a955dd95f65a665472b
MD5 b76b658272cd61365a426ffd752fae2d
BLAKE2b-256 7c74a020ee2a26e990a36f4bf76fa4d9896915ea793973259c24029d55ddbdcd

See more details on using hashes here.

File details

Details for the file mayatok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5107ded381af3611c579a37dc7af8d6ad1cff9231f982c5af241bd5b76ddbed5
MD5 dcd43cf06de6d148e0ff8922ef93937f
BLAKE2b-256 3aa3ffd6d6eab990bcd7d850a588b9dc0a74fc5c23eb33b58e60d81cf7a10129

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page