Fast BPE tokenizer with C++ backend
Project description
๐ supertokens
A high-performance tokenizer for Python โ powered by a C++ extension built with nanobind and Meson.
โจ Features
- โก Fast โ core tokenization in C++, zero-copy bindings via nanobind
- ๐ค BPE tokenization โ Byte-Pair Encoding implementation with full Unicode support via utfcpp
- ๐บ๏ธ Robin-map backend โ hash map operations use tsl::robin_map for cache-friendly performance
- ๐ Clean Python API โ expressive, Pythonic interface over the C++ core
- ๐๏ธ Meson build system โ reproducible builds, easy subproject management
๐ฆ Installation
From PyPI
pip install supertokens
From source
Prerequisites:
git clone https://github.com/yourorg/supertokens.git
cd supertokens
pip install .
Or with editable/dev install:
pip install --no-build-isolation -e .
๐ง Building Manually (Meson)
# Configure
meson setup build --wipe
# Compile
meson compile -C build
# Run tests
meson test -C build
๐ Quick Start
from supertokens import BPE
# Load a pre-trained tokenizer
tokenizer = BPE.from_file("tokenizer.sha")
# Encode text
ids = tokenizer.encode("Hello, world!")
print(ids) # [15496, 11, 995, 0]
# Decode tokens
text = tokenizer.decode(ids)
print(text) # "Hello, world!"
๐ API Reference
supertokens.model
Lower-level access to the underlying model data structures. See python/supertokens/model.py for details.
๐๏ธ Project Structure
supertokens/
โโโ python/supertokens/ # Python package
โ โโโ __init__.py
โ โโโ BPE.py # High-level BPE tokenizer
โ โโโ model.py # Model data types
โ
โโโ src/ # C++ extension source
โ โโโ bindings.cxx # nanobind Python bindings
โ โโโ tokenizer/
โ โ โโโ bpe.cxx / .hxx # BPE algorithm
โ โ โโโ datatypes.hxx # Shared types
โ โโโ utils/ # String & BPE utilities
โ โโโ libs/
โ โโโ expected.hxx # std::expected polyfill
โ โโโ utfcpp/ # UTF-8 string handling
โ
โโโ subprojects/ # Meson subprojects (vendored)
โ โโโ nanobind-2.12.0/ # Python โ C++ bindings
โ โโโ robin-map-1.4.0/ # Fast hash map
โ
โโโ meson.build # Top-level build definition
โโโ pyproject.toml # Python package metadata
๐ฉ Dependencies
| Dependency | Version | Role |
|---|---|---|
| nanobind | 2.12.0 | C++ โ Python bindings |
| tsl::robin-map | 1.4.0 | High-performance hash map |
| utfcpp | bundled | UTF-8 string processing |
| expected.hxx | bundled | Error handling polyfill |
All C++ dependencies are vendored under subprojects/ and managed by Meson โ no manual installation required.
Code style
black python/
ruff check python/
๐ Changelog
See CHANGELOG.md for release history.
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes
- Open a Pull Request
For major changes, open an issue first to discuss the approach.
๐ License
This project is licensed under the MIT License. See LICENSE for details.
Third-party licenses:
- nanobind: BSD 3-Clause (
subprojects/nanobind-2.12.0/LICENSE) - tsl::robin-map: MIT (
subprojects/robin-map-1.4.0/LICENSE)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file supertokens-0.1.5.tar.gz.
File metadata
- Download URL: supertokens-0.1.5.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39694b02804d5f88072077ab1ab5f622d6ccf1d3421ad80d7bfb7dac9acbdc51
|
|
| MD5 |
d2c6552d7a5e58ece8b0852962b3ed45
|
|
| BLAKE2b-256 |
ca4d73a01376718d6bd901f14f5bffba5edce9705c8469889193d9564abc8c24
|
Provenance
The following attestation bundles were made for supertokens-0.1.5.tar.gz:
Publisher:
main.yml on shaheen-coder/supertokens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
supertokens-0.1.5.tar.gz -
Subject digest:
39694b02804d5f88072077ab1ab5f622d6ccf1d3421ad80d7bfb7dac9acbdc51 - Sigstore transparency entry: 1702499857
- Sigstore integration time:
-
Permalink:
shaheen-coder/supertokens@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/shaheen-coder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
main.yml@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file supertokens-0.1.5-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: supertokens-0.1.5-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 92.5 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb5b819c5443ac7bb43f634a80a06723e40a24996cf36a334cf0ed31f1cc7c09
|
|
| MD5 |
327a7562239f809852fe82b47b8c8b32
|
|
| BLAKE2b-256 |
621118927e03e3336e9925f8d1e3353a76e154f940252b9d21a5d2a2229f5250
|
Provenance
The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-win_amd64.whl:
Publisher:
main.yml on shaheen-coder/supertokens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
supertokens-0.1.5-cp313-cp313-win_amd64.whl -
Subject digest:
cb5b819c5443ac7bb43f634a80a06723e40a24996cf36a334cf0ed31f1cc7c09 - Sigstore transparency entry: 1702499956
- Sigstore integration time:
-
Permalink:
shaheen-coder/supertokens@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/shaheen-coder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
main.yml@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 169.9 kB
- Tags: CPython 3.13, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14c1a11e6b2c12db3dd7df0dce194d543ce3d2bf23ce0d7284f7e212a092d68a
|
|
| MD5 |
5dfce11756cb502e36411d3ea1e255b3
|
|
| BLAKE2b-256 |
61ff70fc977c23893513164722c5cfbe957a43aea4e4d5ae974185f1e6547144
|
Provenance
The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:
Publisher:
main.yml on shaheen-coder/supertokens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl -
Subject digest:
14c1a11e6b2c12db3dd7df0dce194d543ce3d2bf23ce0d7284f7e212a092d68a - Sigstore transparency entry: 1702499894
- Sigstore integration time:
-
Permalink:
shaheen-coder/supertokens@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/shaheen-coder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
main.yml@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 72.0 kB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
450f7cfca402489f541eb844ccd482182ce528c9948c64ef65d5ba47347512d7
|
|
| MD5 |
d1a62f5ecfdf4a5132afc0ab61c501b0
|
|
| BLAKE2b-256 |
c120a09ebdb46a0b102c2ca8a22eb3e07123a6571c8d0561bfd845049a47af3a
|
Provenance
The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl:
Publisher:
main.yml on shaheen-coder/supertokens
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl -
Subject digest:
450f7cfca402489f541eb844ccd482182ce528c9948c64ef65d5ba47347512d7 - Sigstore transparency entry: 1702499992
- Sigstore integration time:
-
Permalink:
shaheen-coder/supertokens@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/shaheen-coder
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
main.yml@4065ad04a2fffd28201ff692ff96fe16bc359035 -
Trigger Event:
workflow_dispatch
-
Statement type: