High-performance BPE tokenizer. 20-60x faster than tiktoken.
Project description
nanotok
A high-performance BPE tokenizer written in C++ with Python bindings. 20-60x faster than tiktoken.
Installation
pip install nanotok
With optional dependencies:
pip install "nanotok[all]" # includes huggingface-hub and jinja2
Quick Start
from nanotok import Tokenizer
# Load from Hugging Face Hub (requires huggingface-hub)
tokenizer = Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Load from tiktoken encoding
tokenizer = Tokenizer.from_tiktoken("cl100k_base")
# Load from local file
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")
# Encode/decode
ids = tokenizer.encode("Hello, world!")
text = tokenizer.decode(ids)
# Batch processing
batch_ids = tokenizer.encode_batch(["Hello", "World"])
batch_texts = tokenizer.decode_batch(batch_ids)
# HuggingFace-style API
result = tokenizer("Hello, world!", padding=True, return_tensors="pt")
print(result["input_ids"], result["attention_mask"])
# Chat templates (requires jinja2)
messages = [{"role": "user", "content": "Hello!"}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False)
Features
- Fast: 20-60x faster than tiktoken, written in C++ with SIMD optimizations
- Compatible: Drop-in replacement for tiktoken and HuggingFace tokenizers
- Batch processing: Efficient batch encode/decode
- Chat templates: Support for Jinja2 chat templates
- Special tokens: Full support for special token handling
- Cache: Built-in encoding cache for repeated text
API Reference
Tokenizer
Class Methods
from_file(path)- Load from tokenizer.json filefrom_pretrained(repo_id)- Load from Hugging Face Hubfrom_tiktoken(encoding_name)- Load from tiktoken encoding (gpt2, r50k_base, p50k_base, cl100k_base, o200k_base)
Methods
encode(text, allowed_special=None, add_special_tokens=False)- Encode text to token IDsdecode(ids, skip_special_tokens=False)- Decode token IDs to textencode_batch(texts, ...)- Batch encodedecode_batch(batch_ids, ...)- Batch decodetoken_to_id(token)- Get ID for tokenid_to_token(id)- Get token for IDapply_chat_template(messages, tokenize=True, add_generation_prompt=False)- Apply chat templateclear_cache()- Clear encoding cacheset_cache_enabled(enabled)- Enable/disable cache
Properties
vocab_size- Vocabulary sizespecial_tokens- Dict of special tokenseos_token,bos_token,pad_token,unk_token- Special token stringseos_token_id,bos_token_id,pad_token_id,unk_token_id- Special token IDs
Development
# Clone and install with uv
git clone https://github.com/ishaan/nanotok
cd nanotok
uv sync
# Run tests
uv run pytest
# Build wheel
uv build
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nanotok-0.1.0.tar.gz
(1.2 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanotok-0.1.0.tar.gz.
File metadata
- Download URL: nanotok-0.1.0.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
012588233d1f376811d7b4904c20b235a73aa0e4b9adc71ed2abeebec987be64
|
|
| MD5 |
3a8fad80389c4c4aa5a103b975302edf
|
|
| BLAKE2b-256 |
58b6f816e4c6d12dd4a306b54a88abdc427a55ef42ef4a213e07c25e42c6b10e
|
File details
Details for the file nanotok-0.1.0-cp312-cp312-macosx_15_0_arm64.whl.
File metadata
- Download URL: nanotok-0.1.0-cp312-cp312-macosx_15_0_arm64.whl
- Upload date:
- Size: 188.8 kB
- Tags: CPython 3.12, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3ee15da1ad81ddd4f7eca7223a736d9559e2dbb2758a822598b328bf5d67378
|
|
| MD5 |
dd84f27b82cb8e7cdbf231c715ce3e12
|
|
| BLAKE2b-256 |
b07a9279a97441d489c710f5736d1d2aea45662be25f31a30b6676ccb83d9265
|