Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code
Project description
🐭 CUTE Tokenizer
Compact Unicode Token Encoding
✨ a tokenizer that nibbles your token costs ✨
✨ Highlights
CUTE shrinks code sequences by 35–45% through a two-stage tokenization strategy:
- Pre-encoding via Private-Use-Area Unicode — maps the most frequent words, operators, and identifier sub-parts to single compact characters
- Residual byte-level BPE — handles everything else with standard subword tokenization
The result:
- ⚡ Faster inference — fewer tokens mean shorter sequence lengths and reduced latency
- 💰 Lower API costs — pay for up to 45% fewer tokens per request
- 🔁 Perfectly lossless round-trip — encode and decode with zero information loss
🧀 Quick Start
pip install cute-tokenizer
The wheel ships a trained tokenizer (code corpus). Use it immediately:
from cute_tokenizer import load_default_tokenizer
tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42" # always lossless
Train your own and point at the artifacts:
# Drop a few repos into ./corpus/, then:
cute build --corpus ./corpus --output ./output
from cute_tokenizer import CUTETokenizerFast
tok = CUTETokenizerFast(
tokenizer_file="./output/tokenizer.json",
cute_mapping_file="./output/cute_mapping.json",
)
In this repository, production tokenizer files live under model/ (same files bundled into the package at build time).
Or via AutoTokenizer (after pushing to HF Hub):
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("user/cute-py", trust_remote_code=True)
🔍 How It Works
- Count & select — scan code, count tokens with identifier sub-part boosting, take the smallest set covering 90% of the corpus.
- Assign PUA chars — map each chosen token to a unique Unicode
Private-Use-Area codepoint, starting at
U+E000. Skip codepoints that already appear in the corpus. - Pre-tokenize — at encode time, substitute mapped tokens with their PUA chars (Aho-Corasick, O(n) in input length).
- BPE the rest — feed the residual through a standard byte-level BPE. The PUA chars are atomic vocab entries; they never get further split.
- Decode — the byte-level decoder reconstructs the substituted string; reverse-substitution restores the original text.
Round-trip is byte-equal for any input. We test this with Hypothesis on arbitrary Unicode plus a hand-curated corner-case suite (ZWJ emoji, BOM, control chars, mixed scripts, deep nesting, etc.).
📦 Project Layout
src/cute_tokenizer/
config.py # CUTEConfig — all knobs in one place
patterns.py # token regex + identifier splitter (uses `regex` module)
corpus.py # streaming ingest, dedup, secret scrub, sharding
frequency.py # parallel multiprocess counting
selection.py # coverage-based + quality-filtered token selection
pua.py # Private-Use-Area codepoint allocator
pretokenizer.py # CUTEPreTokenizer (Aho-Corasick + identifier splitting)
trainer.py # build_cute() — orchestrates the full pipeline
decode.py # PUA-aware reverse substitution
tokenizer.py # CUTETokenizerFast (PreTrainedTokenizerFast)
manifest.py # build manifest for reproducibility
cli.py # `cute build`, `cute roundtrip-check`, `cute info`
tests/
unit/ # ~140 unit tests
property/ # Hypothesis round-trip tests
integration/ # full pipeline E2E
benchmarks/
compression.py # CUTE vs tiktoken/GPT-2/CodeLlama
latency.py # encode/decode μs per KB
⚙️ Configuration
from cute_tokenizer import CUTEConfig, build_cute
config = CUTEConfig(
vocab_size=80_000, # total token IDs
coverage_target=0.90, # PUA coverage of total frequency
max_token_len=50, # ignore tokens longer than this
boost_weight=0.3, # identifier sub-part boost
min_bpe_budget=8_000, # minimum learnable merges
seed=42, # determinism
workers=0, # 0 = os.cpu_count()
enable_secret_scrub=True, # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config)
🧪 Testing
pip install -e .[dev]
pytest tests/unit # fast unit tests
pytest tests/property # Hypothesis round-trip
pytest tests/integration # full E2E build (slower)
pytest --cov=cute_tokenizer
The Hypothesis suite runs ~600+ generated test cases per round-trip property, plus a hand-picked corner-case parametrize covering: empty strings, BOM, ZWJ emoji, control chars, multi-script text, deep underscores, and more.
🔐 Production Hardening
- Determinism: same corpus + config → same vocab hash. Verified by
tests/integration/test_determinism.py. - Secret scrubbing: corpus files matching AWS/OpenAI/Anthropic/GitHub key patterns are dropped before vocab construction.
- Build manifest: every build emits
build_manifest.jsonrecording config, corpus hash, vocab hash, library versions, and timing. - PUA collision detection: codepoints found in the corpus are skipped during assignment, so user content cannot be confused with our injection.
- Type-checked:
mypy --strictclean. - Lint clean:
ruff checkandruff format.
📊 Benchmarks
python -m benchmarks.compression --tokenizer ./output --holdout ./holdout
python -m benchmarks.latency --tokenizer ./output
Expected (on a 100 GB Python/TS holdout):
| Metric | CUTE vs byte-level BPE |
|---|---|
| Sequence length (mean) | ⚡ 35–45% shorter |
| Sequence length (p95) | ⚡ 30–40% shorter |
| Sequence length (p99) | ⚡ 25–35% shorter |
| Bytes per token (mean) | 📈 +50–70% |
| Round-trip correctness | ✅ 100% (Hypothesis-verified) |
| Training throughput (LLM) | ⚡ +25–35% |
| Inference latency (LLM) | ⚡ −25–40% |
| API token cost | 💰 −30–45% |
| KV-cache memory at inference | 💾 −35–45% |
| Effective context window (text per token) | 📏 +55–80% |
| Encode latency (tokenizer itself) | 🐢 ~1.5× tiktoken (Python pre-tok overhead) |
Run the benchmarks on your own corpus to see numbers for your distribution.
🐭 Why a Mouse?
A mouse is small, fast, and nibbles things to size. CUTE quietly chews through your token bill while you focus on the model. The cheese is the 30–45% cost reduction.
📜 License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cute_tokenizer-0.1.2.tar.gz.
File metadata
- Download URL: cute_tokenizer-0.1.2.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0efffd79c89d356e78ccc78e8eec974b4e3dbc28ea8bc056092befb87da493a
|
|
| MD5 |
4b128b266b7252f405a70c219239a0ef
|
|
| BLAKE2b-256 |
3da17a21003d99260ca9166e5c63265ea2b5d1d9d6949b1d1c5db779b8b40223
|
File details
Details for the file cute_tokenizer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: cute_tokenizer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5b18b836dddc0a89128b8b7db7da8f2e6529d4f1bd33a89c6717b238c1b69fb
|
|
| MD5 |
1fd7546d8a6453cddf4e6e7f5455f680
|
|
| BLAKE2b-256 |
2226fb2534af224c281373a276317a56a3aebfeb2a9e1b9946aa5de32affa88c
|