Skip to main content

Fast BPE tokenizer with C++ backend

Project description

๐Ÿš€ supertokens

A high-performance tokenizer for Python โ€” powered by a C++ extension built with nanobind and Meson.

PyPI version Python License: MIT Build


โœจ Features

  • โšก Fast โ€” core tokenization in C++, zero-copy bindings via nanobind
  • ๐Ÿ”ค BPE tokenization โ€” Byte-Pair Encoding implementation with full Unicode support via utfcpp
  • ๐Ÿ—บ๏ธ Robin-map backend โ€” hash map operations use tsl::robin_map for cache-friendly performance
  • ๐Ÿ Clean Python API โ€” expressive, Pythonic interface over the C++ core
  • ๐Ÿ—๏ธ Meson build system โ€” reproducible builds, easy subproject management

๐Ÿ“ฆ Installation

From PyPI

pip install supertokens

From source

Prerequisites:

  • Python โ‰ฅ 3.10
  • C++17 compiler (g++, clang++, or MSVC 2019+)
  • Meson โ‰ฅ 1.1 and Ninja
git clone https://github.com/yourorg/supertokens.git
cd supertokens
pip install .

Or with editable/dev install:

pip install --no-build-isolation -e .

๐Ÿ”ง Building Manually (Meson)

# Configure
meson setup build --wipe

# Compile
meson compile -C build

# Run tests
meson test -C build

๐Ÿš€ Quick Start

from supertokens import BPE

# Load a pre-trained tokenizer
tokenizer = BPE.from_file("tokenizer.sha")

# Encode text
ids = tokenizer.encode("Hello, world!")
print(ids)  # [15496, 11, 995, 0]

# Decode tokens
text = tokenizer.decode(ids)
print(text)  # "Hello, world!"

๐Ÿ“– API Reference

supertokens.model

Lower-level access to the underlying model data structures. See python/supertokens/model.py for details.


๐Ÿ—‚๏ธ Project Structure

supertokens/
โ”œโ”€โ”€ python/supertokens/       # Python package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ BPE.py                # High-level BPE tokenizer
โ”‚   โ””โ”€โ”€ model.py              # Model data types
โ”‚
โ”œโ”€โ”€ src/                      # C++ extension source
โ”‚   โ”œโ”€โ”€ bindings.cxx          # nanobind Python bindings
โ”‚   โ”œโ”€โ”€ tokenizer/
โ”‚   โ”‚   โ”œโ”€โ”€ bpe.cxx / .hxx    # BPE algorithm
โ”‚   โ”‚   โ””โ”€โ”€ datatypes.hxx     # Shared types
โ”‚   โ”œโ”€โ”€ utils/                # String & BPE utilities
โ”‚   โ””โ”€โ”€ libs/
โ”‚       โ”œโ”€โ”€ expected.hxx      # std::expected polyfill
โ”‚       โ””โ”€โ”€ utfcpp/           # UTF-8 string handling
โ”‚
โ”œโ”€โ”€ subprojects/              # Meson subprojects (vendored)
โ”‚   โ”œโ”€โ”€ nanobind-2.12.0/      # Python โ†” C++ bindings
โ”‚   โ””โ”€โ”€ robin-map-1.4.0/      # Fast hash map
โ”‚
โ”œโ”€โ”€ meson.build               # Top-level build definition
โ””โ”€โ”€ pyproject.toml            # Python package metadata

๐Ÿ”ฉ Dependencies

Dependency Version Role
nanobind 2.12.0 C++ โ†” Python bindings
tsl::robin-map 1.4.0 High-performance hash map
utfcpp bundled UTF-8 string processing
expected.hxx bundled Error handling polyfill

All C++ dependencies are vendored under subprojects/ and managed by Meson โ€” no manual installation required.


Code style

  • C++: follow the existing .cxx/.hxx style; C++17 standard
  • Python: Black + Ruff
black python/
ruff check python/

๐Ÿ“‹ Changelog

See CHANGELOG.md for release history.


๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes
  4. Open a Pull Request

For major changes, open an issue first to discuss the approach.


๐Ÿ“„ License

This project is licensed under the MIT License. See LICENSE for details.

Third-party licenses:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

supertokens-0.1.0.tar.gz (2.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

supertokens-0.1.0-cp313-cp313-win_amd64.whl (92.3 kB view details)

Uploaded CPython 3.13Windows x86-64

supertokens-0.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (169.4 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

supertokens-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (71.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file supertokens-0.1.0.tar.gz.

File metadata

  • Download URL: supertokens-0.1.0.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for supertokens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1a0674c7ab1d9fee4cb5ae1e3e6430b72db8ff8d047d6110ab359810d6b54dac
MD5 31cbdd46cea6b4dbd689b1ab781a752d
BLAKE2b-256 76b72cd5a8876dfa1a0e42d2d372826c72552454e0c2b2fca3bfea5ea4cc668a

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.0.tar.gz:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 b23784aabd08b34bbc05e1ac75142268b1a4b4a4181d5fb2377f2b28043dd3ad
MD5 de1ec37bd406ef4f1551227eef4b21a1
BLAKE2b-256 bcfefe7e0aa23bb28ca8c3cf7315730add88f1e128bc15fd7e36e5bc7b33322b

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.0-cp313-cp313-win_amd64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dfdf5492f3a733b8aac183c8334153a27986c3cd2c7d9053e968a4906ecc500d
MD5 30fb702e813011909e16bd76f5c7324f
BLAKE2b-256 21a714b14a64bcf32eb95837df7b80fc2d1cb3a9cc778110d50ee51ef278baf1

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file supertokens-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for supertokens-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c2fe5b324bdae13d35b0f53cd4c6732732f30f84a5586090f34650f889b60394
MD5 05e0c8992495b8e57faed3413b3dc8fd
BLAKE2b-256 ff265fa93c469facb483542fce20b55620716a439423b77c1b9bcd112a866d4b

See more details on using hashes here.

Provenance

The following attestation bundles were made for supertokens-0.1.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: main.yml on shaheen-coder/supertokens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page