Skip to main content

Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code

Project description

CUTE Tokenizer Mascot

🐭 CUTE Tokenizer

Compact Unicode Token Encoding

✨ semantic-prior-guided contextual tokenization for code ✨

Python 3.10+ License: MIT HuggingFace PyPI version CI


✨ Highlights

CUTE is a code-aware tokenizer that combines explicit semantic anchors with contextual subword merges to produce compact, lossless token sequences for Python, TypeScript, JavaScript, Rust, Go, and other common programming languages.

The architecture has two stages:

  • Savings-based PUA mapping — high-value words, operators, and identifier sub-parts are mapped to single Unicode Private-Use-Area characters, ranked by expected token savings vs the cl100k baseline (not raw frequency).
  • Contextual byte-level BPE — the trainer sees PUA-substituted text, so it can learn merges around those anchors (e.g. whitespace + PUA), while a post-train safety filter forbids PUA + PUA pairs to keep the semantic units atomic.

The result:

  • 🪄 Beats every open-source code tokenizer we tested
  • 🔁 Byte-equal lossless round-trip on arbitrary Unicode (verified on 3,000 held-out files: Python + JS + TS + Rust + Go)
  • 🔒 Deterministic within a fixed (OS, python, tokenizers) host triple
  • 🤗 Drop-in AutoTokenizer compatibility via trust_remote_code=True

📊 Benchmarks

Numbers below are measured, not theoretical, on held-out code that was never seen during training. Lower mean tokens = better compression; higher bytes/token = better.

Python (1,500 held-out files from The Stack)

Tokenizer mean tokens bytes/token vocab roundtrip
OpenAI cl100k_base 1,874.1 4.17 100k 1500/1500
OpenAI o200k_base 1,885.6 4.14 200k 1500/1500
CUTE 2,009.3 3.89 150k 1500/1500
StarCoder2 2,210.0 3.53 49k 685/1500 ❌
CodeLlama 2,572.9 3.03 32k 1493/1500 ⚠
GPT-2 3,580.7 2.18 50k 1500/1500

Multi-language (1,500 held-out files: JS / TS / Rust / Go)

Tokenizer mean tokens bytes/token vocab roundtrip
OpenAI cl100k_base 1,966.0 3.91 100k 1500/1500
OpenAI o200k_base 1,970.1 3.90 200k 1500/1500
CUTE 2,078.0 3.70 150k 1500/1500
StarCoder2 2,262.0 3.40 49k 566/1500 ❌
CodeLlama 2,650.2 2.90 32k 1500/1500
GPT-2 3,365.4 2.28 50k 1500/1500

What this means

  • CUTE beats every open-source code tokenizer we benchmarked (StarCoder2, CodeLlama, GPT-2) on both Python and the multi-lang holdout — by ~9–44% depending on the comparison.
  • OpenAI's cl100k still beats CUTE by ~5–7% on this corpus. We're closing on it but not there yet.
  • CUTE is the only specialty code tokenizer with zero roundtrip failures on the test set. StarCoder2 corrupts ~54% of multi-lang files and ~54% of Python files; CodeLlama leaks 7 Python files.
  • CUTE is slower than cl100k at encode time — the Python-side PUA substitution adds overhead. Expect roughly an order of magnitude higher encode latency than cl100k. Decode latency is comparable.

This is the first public release and there is significant room for improvement: bigger and more diverse training corpora, multi-language training tuned for the deployment language, smarter PUA selection, faster Python-side substitution, possibly a Rust pre-tokenizer. This is only the beginning.

Reproduce these numbers locally:

python -m benchmarks.runner \
    --tokenizer ./model \
    --holdout ./your-holdout-corpus \
    --output reports/mine

🧀 Quick Start

pip install cute-tokenizer

The wheel ships a pretrained tokenizer. Use it immediately — no training required:

from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"  # always lossless

Use via 🤗 HuggingFace

The same pretrained tokenizer is hosted on the HuggingFace Hub:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "HusseinEid/cute-tokenizer",
    trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)

trust_remote_code=True is required because CUTE's wrapper class (CUTETokenizerFast) does Python-side PUA substitution before delegating to the underlying ByteLevel BPE.

Train your own

# Drop a few repos into ./corpus/, then:
pip install 'cute-tokenizer[baseline]'  # pulls tiktoken for cl100k-aware ranking
cute build --corpus ./corpus --output ./output
from cute_tokenizer import CUTETokenizerFast

tok = CUTETokenizerFast(
    tokenizer_file="./output/tokenizer.json",
    cute_mapping_file="./output/cute_mapping.json",
)

🔍 How It Works

  1. Corpus ingest — stream files, dedup by content hash, scrub secrets (AWS / OpenAI / Anthropic / GitHub keys, JWTs, PEM private keys), optionally license-filter, write deterministic gzipped shards.
  2. Frequency mining — parallel multiprocess token counter with identifier sub-part boosting (camelCase / snake_case / SCREAMING_CASE).
  3. Savings-based selection — for each candidate token, compute score = frequency × max(0, cl100k_count − 1). Tokens whose cl100k cost is 1 (single-byte ASCII like (, ,) score zero — byte fallback already handles them optimally. Hashes / UUIDs / base64 blobs are filtered out by shape.
  4. PUA assignment — selected tokens get unique codepoints in the Unicode supplementary planes (U+F0000+). The Basic Multilingual Plane PUA range (U+E000–U+F8FF) is deliberately skipped because real source code occasionally contains literal BMP PUA chars (Asian fonts, Unicode mapping tables in TS/JS) and using them would cause decode-time collisions.
  5. Contextual BPE training — the training stream is PUA-substituted before it reaches the trainer, so byte-level BPE actually sees PUA chars and can learn merges like [Ġ][⟦return⟧] (whitespace + anchor). PUA chars are also registered as AddedTokens so any anchor that wasn't picked up still has an atomic vocab id.
  6. Atomicity audit — post-train, the merge_policy module walks the tokenizer JSON and (under strict_pua_atomicity) drops any PUA-PUA merges. Four invariants are asserted on every save: model is BPE, decoder is ByteLevel, pre-tokenizer is ByteLevel, every mapping PUA char has a vocab id.
  7. Decode — the byte-level decoder reconstructs the substituted string; reverse-substitution restores the original text.

Round-trip is byte-equal for any input. We test this with Hypothesis on arbitrary Unicode (incl. supplementary planes) plus a hand-curated torture set: ZWJ family emoji, RTL+bidi controls, BOM, control chars, NFC/NFD variants, mixed scripts, deep underscores. Plus 3,000 held-out real-world code files.


📦 Project Layout

src/cute_tokenizer/
  baseline.py       # Cl100kBaseline / NullBaseline (savings scoring)
  config.py         # CUTEConfig — all knobs in one place
  patterns.py       # token regex + identifier splitter (uses `regex` module)
  corpus.py         # streaming ingest, dedup, secret scrub, sharding
  frequency.py      # parallel multiprocess counting
  selection.py      # savings-based selection + tightened PUA filter
  pua.py            # Private-Use-Area codepoint allocator (skips BMP by default)
  pretokenizer.py   # PUA substitution (Aho-Corasick + identifier splitting)
  trainer.py        # build_cute() — pre-substituted BPE training
  merge_policy.py   # PUA atomicity audit + invariant assertions
  decode.py         # PUA-aware reverse substitution
  tokenizer.py      # CUTETokenizerFast (PreTrainedTokenizerFast)
  manifest.py       # build manifest for reproducibility
  cli.py            # `cute build`, `cute roundtrip-check`, `cute info`

tests/
  unit/             # ~180 unit tests
  property/         # Hypothesis round-trip + Unicode torture
  integration/      # full pipeline E2E + determinism + collision regressions

benchmarks/
  baselines.py      # cl100k / o200k / gpt2 / codellama / starcoder2 adapters
  runner.py         # research-grade compression + latency report
  compression.py    # legacy compression-only script
  latency.py        # standalone latency benchmark

scripts/
  download_stack_python.py  # download a Stack subset, train/holdout split
  find_roundtrip_failures.py  # diagnostic: find files that don't roundtrip

⚙️ Configuration

from cute_tokenizer import CUTEConfig, Cl100kBaseline, build_cute

config = CUTEConfig(
    vocab_size=200_000,            # total token IDs
    pua_budget=50_000,             # max PUA-mapped tokens
    min_bpe_budget=130_000,        # minimum learnable BPE merges
    max_token_len=50,              # ignore tokens longer than this
    boost_weight=0.3,              # identifier sub-part boost
    seed=42,                       # determinism
    workers=0,                     # 0 = os.cpu_count()
    use_savings_selection=True,    # use cl100k-aware ranking (default)
    strict_pua_atomicity=True,     # forbid PUA+PUA merges (default)
    allow_supplementary_pua=True,  # use full 50k PUA budget
    pua_skip_bmp=True,             # avoid BMP collisions (production default)
    enable_secret_scrub=True,      # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config=config, baseline=Cl100kBaseline())

The vocab math (validated at construction time) is:

byte_alphabet (256) + special_tokens + pua_budget + min_bpe_budget ≤ vocab_size

🧪 Testing

pip install -e .[dev]
pytest tests/unit          # fast unit tests
pytest tests/property      # Hypothesis round-trip + Unicode torture
pytest tests/integration   # full E2E build (slower)
pytest --cov=cute_tokenizer

🔐 Production Hardening

  • Determinism: same (OS, python, tokenizers, corpus_hash, seed) → byte-identical tokenizer.json. Verified on Linux. Cross-platform byte-identity is explicitly not part of the contract.
  • Roundtrip integrity: 1500/1500 on Python holdout, 1500/1500 on multi-language holdout — verified by the benchmark runner on every release.
  • Atomicity invariants: merge_policy.assert_invariants enforces model.type=BPE, decoder.type=ByteLevel, pre_tokenizer.type=ByteLevel, and that every mapping PUA char has a vocab id, after every save.
  • No BMP-PUA collisions: literal BMP PUA chars in user source (TS Unicode tables, CJK fonts) roundtrip unchanged because we assign mappings only to supplementary-plane PUAs.
  • No special-token text collisions: <s>, </s>, <unk>, <pad> are deliberately not in the default special-token list — they collide with natural text in code.
  • Secret scrubbing: corpus files matching AWS / OpenAI / Anthropic / GitHub / Slack / Google API key patterns, JWTs, and PEM private keys are dropped before vocab construction.
  • Build manifest: every build emits build_manifest.json recording config, baseline name, corpus hash, vocab hash, library versions, merge audit counts, ingest stats, and timing.
  • Lint clean: ruff check and ruff format.

🐭 Why a Mouse?

A mouse is small, fast, and nibbles things to size. CUTE quietly chews through your tokenization while you focus on the model.


📜 License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cute_tokenizer-0.2.0.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cute_tokenizer-0.2.0-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file cute_tokenizer-0.2.0.tar.gz.

File metadata

  • Download URL: cute_tokenizer-0.2.0.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2fcc6301841396c34e50abe9cb7433394f9d2d26c0b96d88702d7919da77e334
MD5 3833213f9d190b47fc022a909e38d4a6
BLAKE2b-256 d8312321256c4c6e5b26dcf0d583251981f662ea2df14eea844128463ec8593a

See more details on using hashes here.

File details

Details for the file cute_tokenizer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cute_tokenizer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4dece63dce10da7ecdaba732faab489353a14a0309d418e2392822000f7708f
MD5 e95b0c4f4edeb3205149d77d40294e45
BLAKE2b-256 f519acadfe7cb0f2cbf46bdc5c40a7792293ff04c8a08c5f3afed6fa8f751a24

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page