Skip to main content

Fast BPE tokenizer with C++ backend

Project description

๐Ÿš€ supertokens

A high-performance tokenizer for Python โ€” powered by a C++ extension built with nanobind and Meson.

PyPI version Python PyPI downloads C++ License: MIT Build

Linux

macOS

Windows

โœจ Features

  • โšก Fast โ€” core tokenization in C++, zero-copy bindings via nanobind
  • ๐Ÿ”ค BPE tokenization โ€” Byte-Pair Encoding implementation with full Unicode support via utfcpp
  • ๐Ÿ—บ๏ธ Robin-map backend โ€” hash map operations use tsl::robin_map for cache-friendly performance
  • ๐Ÿ Clean Python API โ€” expressive, Pythonic interface over the C++ core
  • ๐Ÿ—๏ธ Meson build system โ€” reproducible builds, easy subproject management

๐Ÿ“ฆ Installation

From PyPI

pip install supertokens

From source

Prerequisites:

  • Python โ‰ฅ 3.10
  • C++17 compiler (g++, clang++, or MSVC 2019+)
  • Meson โ‰ฅ 1.1 and Ninja
git clone https://github.com/yourorg/supertokens.git
cd supertokens
pip install .

Or with editable/dev install:

pip install --no-build-isolation -e .

๐Ÿ”ง Building Manually (Meson)

# Configure
meson setup build --wipe

# Compile
meson compile -C build

# Run tests
meson test -C build

๐Ÿš€ Quick Start

from supertokens import BPE

# Load a pre-trained tokenizer
tokenizer = BPE.from_file("tokenizer.sha")

# Encode text
ids = tokenizer.encode("Hello, world!")
print(ids)  # [15496, 11, 995, 0]

# Decode tokens
text = tokenizer.decode(ids)
print(text)  # "Hello, world!"

๐Ÿ“– API Reference

supertokens.model

Lower-level access to the underlying model data structures. See python/supertokens/model.py for details.


๐Ÿ—‚๏ธ Project Structure

supertokens/
โ”œโ”€โ”€ python/supertokens/       # Python package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ BPE.py                # High-level BPE tokenizer
โ”‚   โ””โ”€โ”€ model.py              # Model data types
โ”‚
โ”œโ”€โ”€ src/                      # C++ extension source
โ”‚   โ”œโ”€โ”€ bindings.cxx          # nanobind Python bindings
โ”‚   โ”œโ”€โ”€ tokenizer/
โ”‚   โ”‚   โ”œโ”€โ”€ bpe.cxx / .hxx    # BPE algorithm
โ”‚   โ”‚   โ””โ”€โ”€ datatypes.hxx     # Shared types
โ”‚   โ”œโ”€โ”€ utils/                # String & BPE utilities
โ”‚   โ””โ”€โ”€ libs/
โ”‚       โ”œโ”€โ”€ expected.hxx      # std::expected polyfill
โ”‚       โ””โ”€โ”€ utfcpp/           # UTF-8 string handling
โ”‚
โ”œโ”€โ”€ subprojects/              # Meson subprojects (vendored)
โ”‚   โ”œโ”€โ”€ nanobind-2.12.0/      # Python โ†” C++ bindings
โ”‚   โ””โ”€โ”€ robin-map-1.4.0/      # Fast hash map
โ”‚
โ”œโ”€โ”€ meson.build               # Top-level build definition
โ””โ”€โ”€ pyproject.toml            # Python package metadata

๐Ÿ”ฉ Dependencies

Dependency Version Role
nanobind 2.12.0 C++ โ†” Python bindings
tsl::robin-map 1.4.0 High-performance hash map
utfcpp bundled UTF-8 string processing
expected.hxx bundled Error handling polyfill

All C++ dependencies are vendored under subprojects/ and managed by Meson โ€” no manual installation required.


Code style

  • C++: follow the existing .cxx/.hxx style; C++17 standard
  • Python: Black + Ruff
black python/
ruff check python/

๐Ÿ“‹ Changelog

See CHANGELOG.md for release history.


๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes
  4. Open a Pull Request

For major changes, open an issue first to discuss the approach.


๐Ÿ“„ License

This project is licensed under the MIT License. See LICENSE for details.

Third-party licenses:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertokens-0.1.5.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

supertokens-0.1.5-cp313-cp313-win_amd64.whl (92.5 kB view details)

Uploaded CPython 3.13Windows x86-64

supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (169.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl (72.0 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file supertokens-0.1.5.tar.gz.

File metadata

  • Download URL: supertokens-0.1.5.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for supertokens-0.1.5.tar.gz
Algorithm Hash digest
SHA256 39694b02804d5f88072077ab1ab5f622d6ccf1d3421ad80d7bfb7dac9acbdc51
MD5 d2c6552d7a5e58ece8b0852962b3ed45
BLAKE2b-256 ca4d73a01376718d6bd901f14f5bffba5edce9705c8469889193d9564abc8c24

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.5.tar.gz:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.5-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 cb5b819c5443ac7bb43f634a80a06723e40a24996cf36a334cf0ed31f1cc7c09
MD5 327a7562239f809852fe82b47b8c8b32
BLAKE2b-256 621118927e03e3336e9925f8d1e3353a76e154f940252b9d21a5d2a2229f5250

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-win_amd64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 14c1a11e6b2c12db3dd7df0dce194d543ce3d2bf23ce0d7284f7e212a092d68a
MD5 5dfce11756cb502e36411d3ea1e255b3
BLAKE2b-256 61ff70fc977c23893513164722c5cfbe957a43aea4e4d5ae974185f1e6547144

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 450f7cfca402489f541eb844ccd482182ce528c9948c64ef65d5ba47347512d7
MD5 d1a62f5ecfdf4a5132afc0ab61c501b0
BLAKE2b-256 c120a09ebdb46a0b102c2ca8a22eb3e07123a6571c8d0561bfd845049a47af3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.5-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page