Skip to main content

Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code

Project description

CUTE Tokenizer Mascot

🐭 CUTE Tokenizer

Compact Unicode Token Encoding

✨ semantic-prior-guided contextual tokenization for code ✨

Python 3.10+ License: MIT HuggingFace Compatible PyPI version CI


✨ Highlights

CUTE is a code-aware tokenizer that combines explicit semantic anchors with contextual subword merges to produce compact, lossless token sequences for Python, TypeScript, JavaScript, Rust, Go, and other common programming languages.

The architecture has two stages:

  • Savings-based PUA mapping — high-value words, operators, and identifier sub-parts are mapped to single Unicode Private-Use-Area characters, ranked by expected token savings vs the cl100k baseline (not raw frequency).
  • Contextual byte-level BPE — the trainer sees PUA-substituted text, so it can learn merges around those anchors (e.g. whitespace + PUA), while a post-train safety filter forbids PUA + PUA pairs to keep the semantic units atomic.

The result:

  • 🪄 Shorter sequences on real code than vanilla byte-level BPE
  • 🔁 Byte-equal lossless round-trip on arbitrary Unicode (Hypothesis-verified)
  • 🔒 Deterministic within a fixed (OS, python, tokenizers) host triple
  • 🤗 Drop-in AutoTokenizer compatibility via trust_remote_code

🧀 Quick Start

pip install cute-tokenizer

The wheel ships a pretrained tokenizer (v1 model, code corpus). Use it immediately — no training required:

from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"  # always lossless

Train your own and point at the artifacts:

# Drop a few repos into ./corpus/, then:
pip install 'cute-tokenizer[baseline]'  # pulls tiktoken for cl100k-aware ranking
cute build --corpus ./corpus --output ./output
from cute_tokenizer import CUTETokenizerFast

tok = CUTETokenizerFast(
    tokenizer_file="./output/tokenizer.json",
    cute_mapping_file="./output/cute_mapping.json",
)

Or via AutoTokenizer (after pushing to HF Hub):

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("user/cute-py", trust_remote_code=True)

🔍 How It Works

  1. Corpus ingest — stream files, dedup by content hash, scrub secrets (AWS / OpenAI / Anthropic / GitHub / private keys / JWTs), optionally license-filter, write deterministic gzipped shards.
  2. Frequency mining — parallel multiprocess token counter with identifier sub-part boosting (camelCase / snake_case / SCREAMING_CASE).
  3. Savings-based selection — for each candidate token, compute score = frequency × max(0, cl100k_count − 1). Tokens whose cl100k cost is 1 (single-byte ASCII like (, ,) score zero — byte fallback already handles them optimally. Hashes / UUIDs / base64 blobs are filtered out by shape.
  4. PUA assignment — selected tokens get unique codepoints in the Private-Use-Area, BMP first (U+E000 …) for the cheapest UTF-8 encoding. Codepoints already present in the corpus are skipped.
  5. Contextual BPE training — the training stream is PUA-substituted before it reaches the trainer, so byte-level BPE actually sees PUA chars and can learn merges like [Ġ][⟦return⟧] (whitespace + anchor). PUA chars are also registered as AddedTokens so any anchor that wasn't picked up still has an atomic vocab id.
  6. Atomicity audit — post-train, the merge_policy module walks the tokenizer JSON and (under strict_pua_atomicity) drops any PUA-PUA merges. Four invariants are asserted on every save: model is BPE, decoder is ByteLevel, pre-tokenizer is ByteLevel, every mapping PUA char has a vocab id.
  7. Decode — the byte-level decoder reconstructs the substituted string; reverse-substitution restores the original text.

Round-trip is byte-equal for any input. We test this with Hypothesis on arbitrary Unicode (incl. supplementary planes) plus a hand-curated torture set: ZWJ family emoji, RTL+bidi controls, BOM, control chars, NFC/NFD variants, mixed scripts, deep underscores.


📦 Project Layout

src/cute_tokenizer/
  baseline.py       # Cl100kBaseline / NullBaseline (savings scoring)
  config.py         # CUTEConfig — all knobs in one place
  patterns.py       # token regex + identifier splitter (uses `regex` module)
  corpus.py         # streaming ingest, dedup, secret scrub, sharding
  frequency.py      # parallel multiprocess counting
  selection.py      # savings-based selection + tightened PUA filter
  pua.py            # Private-Use-Area codepoint allocator
  pretokenizer.py   # PUA substitution (Aho-Corasick + identifier splitting)
  trainer.py        # build_cute() — pre-substituted BPE training
  merge_policy.py   # PUA atomicity audit + invariant assertions
  decode.py         # PUA-aware reverse substitution
  tokenizer.py      # CUTETokenizerFast (PreTrainedTokenizerFast)
  manifest.py       # build manifest for reproducibility
  cli.py            # `cute build`, `cute roundtrip-check`, `cute info`

tests/
  unit/             # ~180 unit tests (incl. baseline + selection + merge_policy)
  property/         # Hypothesis round-trip + Unicode torture
  integration/      # full pipeline E2E + determinism

benchmarks/
  compression.py    # CUTE vs cl100k / GPT-2 / CodeLlama
  latency.py        # encode/decode μs per KB

plans/
  cute-refit.md     # 9-step blueprint for the full v2 production refit

⚙️ Configuration

from cute_tokenizer import CUTEConfig, Cl100kBaseline, build_cute

config = CUTEConfig(
    vocab_size=120_000,            # total token IDs
    pua_budget=50_000,             # max PUA-mapped tokens
    min_bpe_budget=50_000,         # minimum learnable BPE merges
    max_token_len=50,              # ignore tokens longer than this
    boost_weight=0.3,              # identifier sub-part boost
    seed=42,                       # determinism
    workers=0,                     # 0 = os.cpu_count()
    use_savings_selection=True,    # use cl100k-aware ranking (default)
    strict_pua_atomicity=True,     # forbid PUA+PUA merges (default)
    allow_supplementary_pua=False, # cap budget at BMP (6,400) for byte efficiency
    enable_secret_scrub=True,      # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config=config, baseline=Cl100kBaseline())

The vocab math (validated at construction time) is:

byte_alphabet (256) + special_tokens + pua_budget + min_bpe_budget ≤ vocab_size

🧪 Testing

pip install -e .[dev]
pytest tests/unit          # fast unit tests
pytest tests/property      # Hypothesis round-trip + Unicode torture
pytest tests/integration   # full E2E build (slower)
pytest --cov=cute_tokenizer

The Hypothesis suite runs hundreds of generated cases per property, plus a hand-picked torture set covering: empty strings, BOM, ZWJ family emoji, RTL+bidi controls, combining marks, control chars, supplementary planes, NFC vs NFD, mixed scripts.


🔐 Production Hardening

  • Determinism: same (OS, python, tokenizers, corpus_hash, seed) → byte-identical tokenizer.json. Verified on Linux by tests/integration/test_tokenizer_determinism.py. Cross-platform byte-identity is explicitly not part of the contract.
  • Atomicity invariants: merge_policy.assert_invariants enforces model.type=BPE, decoder.type=ByteLevel, pre_tokenizer.type=ByteLevel, and that every mapping PUA char has a vocab id, after every save.
  • Secret scrubbing: corpus files matching AWS / OpenAI / Anthropic / GitHub / Slack / Google API key patterns, JWTs, and PEM private keys are dropped before vocab construction.
  • Build manifest: every build emits build_manifest.json recording config, baseline name, corpus hash, vocab hash, library versions, merge audit counts, ingest stats, and timing.
  • PUA collision detection: codepoints found in the corpus are skipped during assignment, so user content cannot be confused with our injection.
  • Lint clean: ruff check and ruff format.

📊 Benchmarks

# Compare CUTE against cl100k (and other baselines if installed)
python -m benchmarks.compression --tokenizer ./output --holdout ./holdout
python -m benchmarks.latency --tokenizer ./output

The benchmark suite measures bytes-per-token, p50/p95/p99 sequence lengths, encode and decode latency, and peak RSS, on a held-out corpus that was never seen during training. Run it on your own corpus to see numbers for your distribution.

A reproducible v2-model benchmark vs cl100k / GPT-2 / CodeLlama / StarCoder2 is the gate for unlocking quantitative compression claims in this README.


🐭 Why a Mouse?

A mouse is small, fast, and nibbles things to size. CUTE quietly chews through your tokenization while you focus on the model.


📜 License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cute_tokenizer-0.1.3.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cute_tokenizer-0.1.3-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file cute_tokenizer-0.1.3.tar.gz.

File metadata

  • Download URL: cute_tokenizer-0.1.3.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 799e427d44579be2ae360d9b547a43bf2e6696015b0b99c4bd3cab24afee5b6e
MD5 16d3a432ec287ce835f27814be8a8838
BLAKE2b-256 1f7cd5c50a8836c37babe1b5ac8837400ec8f7f1f5297b978e367b5c24dbd7ed

See more details on using hashes here.

File details

Details for the file cute_tokenizer-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: cute_tokenizer-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7ab465e69fa02d82eb26f6f05c853087bf890d3db290f6f8cb98040e39c8f36f
MD5 3ee5dd581975aa254fd039d0407056eb
BLAKE2b-256 99b2613fe2254e4078b133b33cc12115af1d5c1826d4d86e78fd44ad4a6ca713

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page