Skip to main content

High-performance BPE tokenizer. 20-60x faster than tiktoken.

Project description

nanotok

PyPI version Python License: MIT

A high-performance BPE tokenizer written in C++ with Python bindings. 20-60x faster than tiktoken.

Installation

pip install nanotok

With optional dependencies:

pip install "nanotok[all]"  # includes huggingface-hub and jinja2

Quick Start

from nanotok import Tokenizer

# Load from Hugging Face Hub (requires huggingface-hub)
tokenizer = Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Load from tiktoken encoding
tokenizer = Tokenizer.from_tiktoken("cl100k_base")

# Load from local file
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")

# Encode/decode
ids = tokenizer.encode("Hello, world!")
text = tokenizer.decode(ids)

# Batch processing
batch_ids = tokenizer.encode_batch(["Hello", "World"])
batch_texts = tokenizer.decode_batch(batch_ids)

# HuggingFace-style API
result = tokenizer("Hello, world!", padding=True, return_tensors="pt")
print(result["input_ids"], result["attention_mask"])

# Chat templates (requires jinja2)
messages = [{"role": "user", "content": "Hello!"}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False)

Features

  • Fast: 20-60x faster than tiktoken, written in C++ with SIMD optimizations
  • Compatible: Drop-in replacement for tiktoken and HuggingFace tokenizers
  • Batch processing: Efficient batch encode/decode
  • Chat templates: Support for Jinja2 chat templates
  • Special tokens: Full support for special token handling
  • Cache: Built-in encoding cache for repeated text

API Reference

Tokenizer

Class Methods

  • from_file(path) - Load from tokenizer.json file
  • from_pretrained(repo_id) - Load from Hugging Face Hub
  • from_tiktoken(encoding_name) - Load from tiktoken encoding (gpt2, r50k_base, p50k_base, cl100k_base, o200k_base)

Methods

  • encode(text, allowed_special=None, add_special_tokens=False) - Encode text to token IDs
  • decode(ids, skip_special_tokens=False) - Decode token IDs to text
  • encode_batch(texts, ...) - Batch encode
  • decode_batch(batch_ids, ...) - Batch decode
  • token_to_id(token) - Get ID for token
  • id_to_token(id) - Get token for ID
  • apply_chat_template(messages, tokenize=True, add_generation_prompt=False) - Apply chat template
  • clear_cache() - Clear encoding cache
  • set_cache_enabled(enabled) - Enable/disable cache

Properties

  • vocab_size - Vocabulary size
  • special_tokens - Dict of special tokens
  • eos_token, bos_token, pad_token, unk_token - Special token strings
  • eos_token_id, bos_token_id, pad_token_id, unk_token_id - Special token IDs

Development

# Clone and install with uv
git clone https://github.com/ishaan/nanotok
cd nanotok
uv sync

# Run tests
uv run pytest

# Build wheel
uv build

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanotok-0.1.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanotok-0.1.0-cp312-cp312-macosx_15_0_arm64.whl (188.8 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

File details

Details for the file nanotok-0.1.0.tar.gz.

File metadata

  • Download URL: nanotok-0.1.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for nanotok-0.1.0.tar.gz
Algorithm Hash digest
SHA256 012588233d1f376811d7b4904c20b235a73aa0e4b9adc71ed2abeebec987be64
MD5 3a8fad80389c4c4aa5a103b975302edf
BLAKE2b-256 58b6f816e4c6d12dd4a306b54a88abdc427a55ef42ef4a213e07c25e42c6b10e

See more details on using hashes here.

File details

Details for the file nanotok-0.1.0-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for nanotok-0.1.0-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c3ee15da1ad81ddd4f7eca7223a736d9559e2dbb2758a822598b328bf5d67378
MD5 dd84f27b82cb8e7cdbf231c715ce3e12
BLAKE2b-256 b07a9279a97441d489c710f5736d1d2aea45662be25f31a30b6676ccb83d9265

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page