Skip to main content

A light-weight & fast library for Byte Pair Encoding (BPE) tokenization.

Project description

SmolToken

SmolToken is a fast library for tokenizing text using the Byte Pair Encoding (BPE) algorithm. Inspired by OpenAI's tiktoken, SmolToken is designed to fill a critical gap by enabling BPE training from scratch while maintaining high performance for encoding and decoding tasks.

Unlike tiktoken, SmolToken supports training tokenizers on custom data. Up to ~4x faster than the port of unoptimized educational implementation _educational.py in rust.

Benchmark Results

SmolToken is already faster than baseline educational implementation of BPE training:

Implementation Runtime (sec)
Unoptimized Implementation 36.94385
SmolToken Optimized 17.63223
SmolToken (with rayon) 7.489850

Tested on:

  • Vocabulary size: 500
  • Dataset: Tiny Stories (~18 MB)

Installation

Add smoltoken to your Rust project via crates.io:

cargo add smoltoken

Or add smoltoken to your Python project via PyPI:

pip install smoltoken

Features

  • Concurrency: Multi-threading support with rayon for accelerated training, encoding, and decoding processes.
  • Python Bindings: Seamless integration with Python via PyO3, enabling accessibility for Python developers.
  • Serialization: Support for saving and loading trained tokenizer vocabulary through serialization.

Contributing

We very much welcome contributions to make Smoltoken fast, robust and efficient. Make a fork, create a feature branch if needed and sumbit your pull request. Since, the library itself is in its early release stage, I also expect to get community feedback to improve on. Just raise an issue here and we will fix them promptly.

License

SmolToken is open source and licensed under the MIT License.

Acknowledgements

Special thanks to OpenAI's tiktoken for inspiration and foundational ideas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

smoltoken-0.1.4-cp313-cp313-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.13Windows x86-64

smoltoken-0.1.4-cp313-cp313-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

smoltoken-0.1.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

smoltoken-0.1.4-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

smoltoken-0.1.4-cp313-cp313-macosx_10_13_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

smoltoken-0.1.4-cp312-cp312-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.12Windows x86-64

smoltoken-0.1.4-cp312-cp312-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

smoltoken-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

smoltoken-0.1.4-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

smoltoken-0.1.4-cp312-cp312-macosx_10_13_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

smoltoken-0.1.4-cp311-cp311-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.11Windows x86-64

smoltoken-0.1.4-cp311-cp311-musllinux_1_2_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

smoltoken-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

smoltoken-0.1.4-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

smoltoken-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file smoltoken-0.1.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for smoltoken-0.1.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 9769f60fad99605dd9571d1de41f950fb1c0ddf6b0401d2ce061220aa282312f
MD5 2f0c2057f9e8700ac74fa636c2192822
BLAKE2b-256 cee9a55b10e930c7655d97f366fc85f034f9c926e20036aa3c15529e90b52665

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 907c8394f11fe4b614a76e4b90d0433b2685549b34126fe0d62db22a36b85ce1
MD5 cd346cb850e1d6d231ba487a2a877e57
BLAKE2b-256 30ff09e002fdf2243f2f5b02e94e5c406c83db802d2f6c2dbe7f4598bb2838df

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 98da8e72a401d1fe7eb16c86ee670df66d03f159b0d86cbe6b268e3de2d418f2
MD5 0029a3d006c7edf137b8094c93b88ada
BLAKE2b-256 9555254ca9772a5474158b8b218c10de09818a647c5b624073a10723d65534af

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0e794f66c93a8bc1c3616ba8717027415f8ddf08dd5f6c9f2bae9231c8798d34
MD5 7394816d325c1b0dffa0d3c4f2b7197d
BLAKE2b-256 a17d63c659dac48051fdccea60f6695c01c12406c9bd47800230af5a73ab4219

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 1fdf83d41ae0d07babd2384b86574d6c756cb347548a524f888f1368644196ac
MD5 af2ce6f3bf1e91fb45fc6d0f3bb78b8b
BLAKE2b-256 e02bfeb943f4741819e14c4228101dfd47d3b4f96f99c6c01f4566ead8bad3ec

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for smoltoken-0.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 9a20a73fbe5fd45a5b76ef5f9ee7eae4b3f148497da5aedc4c7432f39a4bdb06
MD5 666c98118393855ff8b72ffb7472fe3f
BLAKE2b-256 220235dd3fa0594830d8bd1079d0b3fdb27c51c44221fe3b1a2b11f7904df8f8

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a04d3009f9df6376eee81f29d164f0a711e37529be23184b387571687b942ec5
MD5 c4661c92f3e81b8a6ce4dc95e9d9da54
BLAKE2b-256 0e364cb6b5b544fe8563881299946aa44258c71851cd4bb0acecc3e8e5fa14ca

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee49442cc72cea63af358e425f475589ef66b584a8d0ce9d6186b948737c874f
MD5 b14b3bc038a24f5c477d5e04399ed471
BLAKE2b-256 527650a75afb299a74a1eca0fada90d24eb89f8fc3aa0c222e37bf2ecf01d90b

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9ae7e2af8a5636ef49084eb52d5c3292050a834dbe7143cb324503c7c705668
MD5 9274b2dfde208b82abee1782fdfb06bd
BLAKE2b-256 c91a75c80057db50232de45dc24a0538b8be628ecc12e80f157a981beb2df519

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ba05ea1dfbc36a0bed00852cd2395aff39d244de304b8fd2e13a469dc9211aa2
MD5 2acf7e0232e32247ada7e163f3c7a483
BLAKE2b-256 08d37dc11c3dfe5a1e46074664a61c974c5ff69b6dc93ea87b17929d4f47badc

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for smoltoken-0.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fea6a733de27983a90639041bc3f294da706488af8658f76d9a3435026d1d5be
MD5 caa2601fba4741a07b3001e6c994cd83
BLAKE2b-256 40d1555d3f4e7c409d96fd48470a9976b5b957fe472e3bca21d8d0a1d6bad755

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 feb99a7b1a102f678aaaab6dd2e2cf3af971c3d40941f78c5b9ca20960b79827
MD5 a9ba7ca65db4182389ec84c76c08141e
BLAKE2b-256 ac3a1a76688d2edeee7e82375e1ef1b7a98a4a04fe8fb0419ac78d55953de866

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 23f1c9767016297be760e3ba59f5f1928bb9acd5006acda04ce3062a3f11ebc1
MD5 4e284428ec65499eb9e575762702c370
BLAKE2b-256 626dbe57e7d97593391636625d56eeaeea7596b96a31bd4ae0b9fee6f88e3c0b

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 499aa0c6766e6f6f72ac879f6f7b6acdbb6f747a5430b9376bce2c3026fa6ef3
MD5 dfaa6fb058a3d7a3c00682ecfcbb1c46
BLAKE2b-256 e3506aef34602254b8c7494aa90bbb8a6274e0e6a3a4b1a331e06d09a5f7fbbd

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6ba90e96a504c5f6c97c0207fc32d24d6d965b2dee34b705c687b0eb700a46fd
MD5 e865eedfd3ea8b4f41b0f76b5e4e4828
BLAKE2b-256 a1fbacd9dcdc27cbc1405ac5438cccb985be7943c731c2a244bfffdb65415402

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page