Skip to main content

A fast library for Byte Pair Encoding (BPE) tokenization.

Project description

SmolToken

SmolToken is a fast library for tokenizing text using the Byte Pair Encoding (BPE) algorithm. Inspired by OpenAI's tiktoken, SmolToken is designed to fill a critical gap by enabling BPE training from scratch while maintaining high performance for encoding and decoding tasks.

Unlike tiktoken, SmolToken supports training tokenizers on custom data. Up to ~4x faster than the port of unoptimized educational implementation _educational.py in rust.

Benchmark Results

SmolToken is already faster than baseline educational implementation of BPE training:

Implementation Runtime (sec)
Unoptimized Implementation 36.94385
SmolToken Optimized 17.63223
SmolToken (with rayon) 7.489850

Tested on:

  • Vocabulary size: 500
  • Dataset: Tiny Stories (~18 MB)

Installation

Add smoltoken to your Rust project via crates.io:

cargo add smoltoken

Or add smoltoken to your Python project via PyPI:

pip install smoltoken

Roadmap

  • Concurrency: Add multi-threading support using rayon for faster training, encoding, and decoding.
  • Python Bindings: Integrate with Python using PyO3 to make it accessible for Python developers.
  • Serialization: Add serialization support to save/load trained tokenizer vocabulary.

Contributing

We very much welcome contributions to make Smoltoken fast, robust and efficient. Make a fork, create a feature branch if needed and sumbit your pull request. Since, the library itself is in its early release stage, I also expect to get community feedback to improve on. Just raise an issue here and we will fix them promptly.

License

SmolToken is open source and licensed under the MIT License.

Acknowledgements

Special thanks to OpenAI's tiktoken for inspiration and foundational ideas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

smoltoken-0.1.2-cp313-cp313-win_amd64.whl (960.0 kB view details)

Uploaded CPython 3.13Windows x86-64

smoltoken-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

smoltoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

smoltoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

smoltoken-0.1.2-cp313-cp313-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

smoltoken-0.1.2-cp312-cp312-win_amd64.whl (960.0 kB view details)

Uploaded CPython 3.12Windows x86-64

smoltoken-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

smoltoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

smoltoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

smoltoken-0.1.2-cp312-cp312-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

smoltoken-0.1.2-cp311-cp311-win_amd64.whl (960.0 kB view details)

Uploaded CPython 3.11Windows x86-64

smoltoken-0.1.2-cp311-cp311-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

smoltoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

smoltoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

smoltoken-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file smoltoken-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 960.0 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 e2356c5d63f2458e5d6b99a3dd4288a6ee37ab4cae9d807ac0e3eef0e52f9851
MD5 488b541e12bc6e0b2f7094822bdbe7c0
BLAKE2b-256 39d860e48aa4ea2fd2fc99214b2dbd2666ee7c4be07fd7af39f7f7e4193393f9

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d3fff3b116b195b02a5eddedfbb498f546a284ddb27424a4fcf350462fa47e06
MD5 18462f1e4523abec0825bc9066665907
BLAKE2b-256 15f3976edfc37b67134facc43c54b0feb69e61824d57adae9ddf506478431135

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 11c683b321a00bf5321d1b7c037ed55c5360d93991ff7d38ee2a8bed8bbcf289
MD5 b4e3b0e9053071f1d35031db40b3f0a3
BLAKE2b-256 4f260b30ed912de840b41b33ebfd99045dd05decdba2e90a55ca6567b1af15d0

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7f1a1d93b9f7751007bb4936335bbcc77d61a35f7de296a13fa1d3a1d68ce78b
MD5 0d7a4fe2e5a54fe83aea80fcd1f784b9
BLAKE2b-256 3c94ca5b4b3cbf0acf9706573fdc9c46ab1399c255a9db4eb65ff3ed131a03a0

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 775d22af403cdd6204db4a4e5141e7932a5f98fb2ff55b4fc7142a90bd558ea0
MD5 f739578db83436ea28f13189bf9625c7
BLAKE2b-256 1ef163e43b2ad3073caec483e02408bbe2bff6814da5a7449f4efa15ea4e884c

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 960.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 605776d9b21cf500d6aa1d6de394141feb86fe22f550b08fea121b16f9e5c1ad
MD5 6212a02ff8f6bf1b1a9be29e86223c15
BLAKE2b-256 6df776b200166b082004a214ecce8eb8396bd530d14ff691e64915ccd1dfe3b0

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ed2f815a70814225448559234e3944a5a25bd2082225c9c592dc05914b48ab0f
MD5 ba907ced04bc0bc9c30235c9b89d1495
BLAKE2b-256 aaeae80e9557f0c4b7e8d8d002413c3739e29c293ad263093930e5a74e8fe0b9

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9e6590ad8358146fef5acf697adf3f4d8d964cf76cfc1787e6de5b1ed0b1cc15
MD5 6ac3e7ff4bfe9fb2a81f6a963efed18d
BLAKE2b-256 534daeaf74c223fd5214af613c2e2d9d52a28aa31eb4b399a16eaeca7901bb72

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 05dc8bded939a5544588960fef34e61ff0cd4c52283440315a9fdb0e888c5483
MD5 e76551efe6cee162bcb848ae5edafb0e
BLAKE2b-256 f0a5400bdc4b311f23c09bfb24cbe59b390da1710d24e1e2e8f8cd73d4b2dc90

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 7e00a451abaeb8545d223085fc9e2eb9ae0a9901aedcbe552bf750db7424b744
MD5 236d647f4abf51285c3b142ab5059b39
BLAKE2b-256 072ee3d0646c0cfe53edbeaf1c1d97f14cd060f86345dc2134368e9312d6fefb

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 960.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3cd90012888ebf2eb0c87eb7b98ac81b3fb94cf3df23d5585bd0a20f915b46dc
MD5 30483b583ec731f0cfac8eb59c35c14c
BLAKE2b-256 3a23a6432cbd5c1496aed345b2bdc21e1230234e7fc8931e2da6c5a8ffa5134a

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ab0aab4050d1d4b13c767619f20b8d5a6fb9ce047e2c98c7551afd8aa1539b74
MD5 189309858e6e1a1c16dc7539708be945
BLAKE2b-256 e90961f055ef4a64b7ffab60835d3c1d3c0ff514f064890b5bb422ceb556eee8

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1f5467c0793ae526d8bc505fa04aba57be7558eca3b21f7bb4dccc9e9fe2cc34
MD5 e32354f4bb0f04e63e27ba2082d4cebd
BLAKE2b-256 9adedc461b04ec4706d3466ed1ed153d58c6cd979ecdb3a31f489ecf62bbe7a4

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3fbe7c6b82fbc8f7f7e58eda8b95028b4587c8de21587177564003bc0ebdc313
MD5 5007c8af06e6bf10e65ed9d17eadb65f
BLAKE2b-256 510c7bb11ecf4e1b4ba73d6d6c98464cf566106b3172c16fb4de1b2bc6c37c76

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 008d9b2263fe1e10234f01d47a36325e0e7ed0b5d9bd6e94bc4f1628f0a9fca8
MD5 46de495cc1866d7174cccb40939b7262
BLAKE2b-256 21975f2c16d41c971ffca0a078166bc75fc5acdd2a08ae91af0ccdb8c88e3b8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page