Skip to main content

High performance BPE tokenizer written in Rust with Python bindings

Project description

MayaTok

MayaTok is a Byte-Pair Encoding (BPE) tokenizer written in Rust. Built with performance and extensibility in mind. I made this project just because I wanted to study how Byte Pair Encoding Works.

Version: 2.1.4

⚡️ Features (More optimizations in Progress)

  • Multithreaded training for fast vocab generation

  • Persistent merges

  • Checkpoint saving

  • Focus on raw speed — built for performance benchmarking

🚀 Installation

Prerequisites

  • Rust (required)
  • Python 3.9+ (for Python bindings)

PIP Installation

pip install mayatok

From Source

git clone https://github.com/AlgoBrother/MayaTok-BPE.git
cd mayatok-bpe

Use maturin for building wheels.

pip install maturin
maturin build --release
pip install target/wheels/*.whl

Quick Start

Using with Python

To use MayaTok with Python:

import mayatok as bpe

my_tokenizer =  bpe.get_tokenizer("v2-100k") # or 'mayatok-base' if you wish to use v1 tokenizer
test = "Hello, world!"
tokens = my_tokenizer.encode(test)
print(tokens)
decoded_text = my_tokenizer.decode(tokens)
print(decoded_text)

Output of the sample code above

[11608, 77, 3641, 62]
Hello, world!

If you want to create your own Vocab

If you are using HuggingFace Datasets, refer to this for creating your own vocab.

If your dataset is in your local machine

Make sure you have forked/cloned the rust tokenizer code and have built the /target/wheels as mentioned in previous steps

stream method - If you have a large dataset and want to stream your data in chunks to not overload your machine. Use this.

non-stream method - If you have a dataset which your RAM can handle after being loaded, use this for much faster training.

📈 Benchmarks

Batch Encoding

Tokenizer Tokens/sec Avg Compression Ratio
MayaTok-BPE 6,757,698 2.75
tiktoken-cl100k 262,016 3.36
tiktoken-p50k 288,657 3.27
GPT2 1,940,899 2.94
Falcon-7B 1,554,393 3.26

Normal Encoding

Tokenizer Tokens/sec Compression Ratio
MayaTok 1,249,426 2.75
tiktoken-cl100k 2,318,683 3.36
tiktoken-p50k 2,670,190 3.27
GPT2 519,369 2.94
Falcon-7B 346,040 3.26

Note: Performance optimizations are ongoing (MAY CHANGE SINCE I AM APPLYING NEW BENCHMARK METHOD.)

💽 Corpus Used for V2

cosmopedia-v2

c4-english

wikipedia

openwebtext

github-top-code

arxiv-papers

Check dataset_training/train.py for more details

🙌 Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bugs, feature requests, or optimizations.

📄 License

Apache-2.0

Future Targets [COMPLETED]

  • [✓] PyPI package distribution

  • [✓] the examples folder has lot of python implementation. Will experiment to integrate this in rust side of code and make python side of code more smaller and easier for users.

  • [✓] Enhanced CPU utilization and faster merging algorithms

  • [✓] Improved merge quality and compression ratios

  • [✓] Bincode support for faster model loading

  • [✓] New Line Format support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mayatok-2.1.4.tar.gz (477.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

mayatok-2.1.4-cp312-cp312-win_amd64.whl (2.1 MB view details)

Uploaded CPython 3.12Windows x86-64

mayatok-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

mayatok-2.1.4-cp312-cp312-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

mayatok-2.1.4-cp312-cp312-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

mayatok-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

mayatok-2.1.4-cp311-cp311-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

mayatok-2.1.4-cp311-cp311-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

mayatok-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

mayatok-2.1.4-cp310-cp310-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

mayatok-2.1.4-cp310-cp310-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

mayatok-2.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

mayatok-2.1.4-cp39-cp39-macosx_11_0_arm64.whl (2.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

mayatok-2.1.4-cp39-cp39-macosx_10_12_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.12+ x86-64

File details

Details for the file mayatok-2.1.4.tar.gz.

File metadata

  • Download URL: mayatok-2.1.4.tar.gz
  • Upload date:
  • Size: 477.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for mayatok-2.1.4.tar.gz
Algorithm Hash digest
SHA256 e7fb5b84c53ca2ad3c9861c4dd720c5ec6ee9caa17f256f20465e3e1d8852982
MD5 448371c0ba0052d7e69a45929463e7ce
BLAKE2b-256 3bd26867e571c9d3b9258dc4d1dc5da8fc2ccaec2ae2570848c2b6f8ef67b584

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: mayatok-2.1.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for mayatok-2.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8a558fae75ebef50565a8e0251bdd5b2e5d5d53649d594c616d750f3c7c4cc81
MD5 0b869e748594f369e58960672bfbfef9
BLAKE2b-256 1ed414194434825885b04b618d47c673da2061fecd98559e6df09f5a60fc4800

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54d56bcc7b11f3345f9792081f5c2e8f6c20f6b9631c210698793b074d72055f
MD5 9ef82379b3c65f9977f90423dc2d1b93
BLAKE2b-256 d7a9a740637065e0b4eed77e41af4e5a7727f47ac9d2dabf67b7c30158a26e05

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 003cb2872e2e881f546a58dab3ebaea3650258da095e3c3da9127b068db02450
MD5 2846b5b4951f5d5c0a66e2b4225324a7
BLAKE2b-256 aeedd83a84e86a698de5f30a2fd993115c968db785315c5c35947aaceeb1f99b

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 45c13ad80e8dbedaa748b70f6c308b5f76a70c59ffebf8c5641216aca5355686
MD5 d01cafaa0f403cc50ee0ffadaa02284b
BLAKE2b-256 5b724ba8ed9ae7109d19396b0c679f652e283d52f9bec8cf95f6dd841b25092e

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 93654b295982286ebe6f1289787c0e77df329dbc8c3564ccffcfb49e7a527eb6
MD5 71b42af2d52a7762eb68c8117a764685
BLAKE2b-256 b1dd814c8fd96dce5c7d935c084bac48d137e890d54960cf7c93795dd91302ac

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d0cadceecf5c54ceafecaf192c52b0748794e09ec15d7b5599d8293a83b83478
MD5 e7f120203141e9ae00990a78cdce0cf0
BLAKE2b-256 a534b76f212ec7c7434391ea770ba92068a1a2dbe9b96e253bb3c52d31b520cd

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b1796b92c0ed90e19892fe69258f3612566f40773c2f8d3302a144571ade9c10
MD5 8df46c896c11b8bcddbb9dd91bc234b0
BLAKE2b-256 ff41952c777a86effdeae96266d52ffa78d1091962623ea1699cdcf6ad9bdc9b

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29e57a5d1ce055064f50aec194f2a8727cda0b8d83c30b5afd5b293f76020a7c
MD5 4936aa95a93fae47618e126e801e30ab
BLAKE2b-256 e19a73f5e3c49489e11bfbf60a047a981d07fa08c1c704b36c1faeb2ddcd99e6

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0f139be1342fff5819dfd649fe8dcfb09835b7afa586f8f31858ff1052b14efe
MD5 dc46874623900d2071ef580071320262
BLAKE2b-256 d19762f47d66f670c929257f9d8cb1401b19d405ec60b0ade1074e1fce1e9525

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 25ed07076ddb6f1e2c08e80df38ae79af8db69fa03729b7d45f8ea25f8719a26
MD5 2f37d99564411fa8f4d20d9752cae417
BLAKE2b-256 fca84e0e9669a45d32263733c9753d2d1aea8afb9d2f1d8550ba9e6bfc45a069

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dce54ac8fecbb2709c45a8c6d7e2d4b4ae489d0e7ad40bb03f2d679053e5aaf2
MD5 3277ad5f2bd2182e63c5c2527b81aa99
BLAKE2b-256 306534ed9efee4967c4f6f0805cd0cdf5a4251a6e963843e45ef9e0f8f3742b9

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6a55091cf1ffbf784f03deb5b9fbe1fd92580f68e1b09f7a641b4928dd633fa8
MD5 e31cbc63002d0968ac55964951cad75f
BLAKE2b-256 3ffbe27b669ca6009aadb5e6b836205086c444649574d75f3891835aa9914249

See more details on using hashes here.

File details

Details for the file mayatok-2.1.4-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for mayatok-2.1.4-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a1498b4800a09ca0e9537c0c84511827ddf519d52a4e73bfd6cc8db687bd7320
MD5 d277c83f4a6f2b2edb4f58a5f1f02e63
BLAKE2b-256 f3cdd3dfa9035f5b1f56238bbde6a36fc3d2b7d4f7369fbeb6886b8eb96df388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page