Skip to main content

A fast library for Byte Pair Encoding (BPE) tokenization.

Project description

SmolToken

SmolToken is a fast library for tokenizing text using the Byte Pair Encoding (BPE) algorithm. Inspired by OpenAI's tiktoken, SmolToken is designed to fill a critical gap by enabling BPE training from scratch while maintaining high performance for encoding and decoding tasks.

Unlike tiktoken, SmolToken supports training tokenizers on custom data. Up to ~4x faster than the port of unoptimized educational implementation _educational.py in rust.

Benchmark Results

SmolToken is already faster than baseline educational implementation of BPE training:

Implementation Runtime (sec)
Unoptimized Implementation 36.94385
SmolToken Optimized 17.63223
SmolToken (with rayon) 7.489850

Tested on:

  • Vocabulary size: 500
  • Dataset: Tiny Stories (~18 MB)

Installation

Add smoltoken to your Rust project via crates.io:

cargo add smoltoken

Or add smoltoken to your Python project via PyPI:

pip install smoltoken

Roadmap

  • Concurrency: Add multi-threading support using rayon for faster training, encoding, and decoding.
  • Python Bindings: Integrate with Python using PyO3 to make it accessible for Python developers.
  • Serialization: Add serialization support to save/load trained tokenizer vocabulary.

Contributing

We very much welcome contributions to make Smoltoken fast, robust and efficient. Make a fork, create a feature branch if needed and sumbit your pull request. Since, the library itself is in its early release stage, I also expect to get community feedback to improve on. Just raise an issue here and we will fix them promptly.

License

SmolToken is open source and licensed under the MIT License.

Acknowledgements

Special thanks to OpenAI's tiktoken for inspiration and foundational ideas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

smoltoken-0.1.1-cp313-cp313-win_amd64.whl (958.3 kB view details)

Uploaded CPython 3.13Windows x86-64

smoltoken-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

smoltoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

smoltoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

smoltoken-0.1.1-cp313-cp313-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

smoltoken-0.1.1-cp312-cp312-win_amd64.whl (958.5 kB view details)

Uploaded CPython 3.12Windows x86-64

smoltoken-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

smoltoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

smoltoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

smoltoken-0.1.1-cp312-cp312-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

smoltoken-0.1.1-cp311-cp311-win_amd64.whl (958.2 kB view details)

Uploaded CPython 3.11Windows x86-64

smoltoken-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

smoltoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

smoltoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

smoltoken-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file smoltoken-0.1.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 958.3 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 54c06ca798b1ba673fdd43c44cdc0cd7d846e7f9b46b4075dd53807854079340
MD5 c60fa0d79922684f701f2b244f4ad051
BLAKE2b-256 3219243f199eb2324711cec23b188adac402caa2305e082956c1c9364c9b101f

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 61df5cb27ad17815527afdd7094f99fefa4dc80ba0b4116895b8c5bc7731dc58
MD5 872b3c2835c0b0e0beff39fd7b84d864
BLAKE2b-256 25118ca1db91615168154c0a8a0f25d4388bb7e23a4ad54f20a6cd921aa2bc41

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c8154c07320e320d4480f1bb6912bdabc25a4b96319b37601e0dcaca8957965c
MD5 e8adc607e6029426d1679045792453a5
BLAKE2b-256 bd0a319dd0c9705464648617effa8a612f5eff4f182b2fc8bf5555b125dcc50d

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5cad2f1ccbde23e055374967af4ca062bb27d2bfb09a1fc1cc73aa71f6f8e84d
MD5 5321b4a044201e1afb4b9fe39172ada5
BLAKE2b-256 6e9ae0c7ea71fe44e28b2ce633c407e2c69edea62c6775b42d477c9d7bc90ac9

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 f5419eea69b88d2a099783ebd58a19c16eac20482432161c49f1b09fa1afb181
MD5 ef799537f9e1809b8db62cbeda2b7401
BLAKE2b-256 2a5216a4c68fa98cdc5a7e0b210cc993143fcb6fb84da78acb52d0fd233800d0

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 958.5 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 896f9f4a61e6605893f32af9923b988a7e2110855b64b499fe2ac6c3775f679b
MD5 65278c2ea036dcff2f6802fe3041a1be
BLAKE2b-256 51cc4a9c2b7ffc8b14c98a2630572a704effce3bb92d4183d02878033becf1b1

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1ccaa93e676730c49eaafcf9eaa035820e0b5560014c25cd38987e9964bb699e
MD5 d2e923e272715c165caaecd9af179b8e
BLAKE2b-256 384579d3a3cb08945f940ce7d982cdb5e23f7957dc6b6a8915c96dcd5ccd9b93

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1c852a9c3620579ab3e4b1175f4af6435fa52b5546aacd2e94f7dd3b00f0fe89
MD5 e6526c7eae1e6cbc887a7f095f67e34f
BLAKE2b-256 fd3f01a96dd218dc6243e3c39777500604fd108fd18f55be1f450e81dca80a78

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc631086727bb3a5f6c109a6c710df0fd0debe9543308220ee4a13fda2cefed5
MD5 cc4577b10792f05de525669790ef346f
BLAKE2b-256 ecf7610a2aed85f3e7f2fd1a2707b4ba48f45e988e03dc68ee059ebcbb649403

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 ebc30a58eb88baeb29e95f45f346814294e69220f7644f5d3727f9b2fc832d96
MD5 fcb5e7e3bd5ca41169000f46f230971e
BLAKE2b-256 f51e21ba05f417d611a45a1ddac8d6def1d93792328283df4ddcf880a33f9d20

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: smoltoken-0.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 958.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for smoltoken-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7543934fc2f97954f1b38b259b6b494e560e876583081b6f293edb149468ee02
MD5 56a7f5adc76ee81bb72805af05ac6baf
BLAKE2b-256 7d9da8b35613eb996d626dac1eb9cbaafa12a0f0c0e369ec6ea39bb369043e97

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d36d267d9dfc1c0a92aa1fb0e707e3b4392f100570d6c3f73bf3537ac6fc2ee1
MD5 8010ac43bff605796b52bbf7acbf5a7d
BLAKE2b-256 ec6b0b21f697c400e696e27db4170644c0ae7e9e853d2c3301631f0a6aa3d57c

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7da5b6984033d487b532d78a87915df830ea7bd843663a2a34408d833ca9c2ce
MD5 b1816d9a89c14fca1a538182ef078c71
BLAKE2b-256 a18f6584bce8bc42821bca4488f61a974ddaf815b983771eda11b82b01aae1da

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 013b8f74c92a24b2a08d662634f26ecfb773c8aa272166b153ad30e3233a59e9
MD5 4569f2ce7e29cf8f133a685f51d8ba63
BLAKE2b-256 29069580e8502126dbca5d97c38fbdc2ad62a3f1fa2b19edc2e171abb91901e9

See more details on using hashes here.

File details

Details for the file smoltoken-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for smoltoken-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4a3d3b90aa76a6b5bf0ed59d63dc1e49144d6b0def165e54d173892c301483be
MD5 7183ade7e096ed3e3a7d68a13b85d39c
BLAKE2b-256 7132867f54d67afc9bce25be73a3229e4cdf58e0faee5f1c71d309b5ea64d402

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page