Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code

These details have not been verified by PyPI

Project links

Project description

CUTE Tokenizer Mascot

🐭 CUTE Tokenizer

Compact Unicode Token Encoding

✨ a tokenizer that nibbles your token costs ✨

✨ Highlights

CUTE shrinks code sequences by 35–45% through a two-stage tokenization strategy:

Pre-encoding via Private-Use-Area Unicode — maps the most frequent words, operators, and identifier sub-parts to single compact characters
Residual byte-level BPE — handles everything else with standard subword tokenization

The result:

⚡ Faster inference — fewer tokens mean shorter sequence lengths and reduced latency
💰 Lower API costs — pay for up to 45% fewer tokens per request
🔁 Perfectly lossless round-trip — encode and decode with zero information loss

🧀 Quick Start

pip install cute-tokenizer

The wheel ships a trained tokenizer (code corpus). Use it immediately:

from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"  # always lossless

Train your own and point at the artifacts:

# Drop a few repos into ./corpus/, then:
cute build --corpus ./corpus --output ./output

from cute_tokenizer import CUTETokenizerFast

tok = CUTETokenizerFast(
    tokenizer_file="./output/tokenizer.json",
    cute_mapping_file="./output/cute_mapping.json",
)

In this repository, production tokenizer files live under model/ (same files bundled into the package at build time).

Or via AutoTokenizer (after pushing to HF Hub):

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("user/cute-py", trust_remote_code=True)

🔍 How It Works

Count & select — scan code, count tokens with identifier sub-part boosting, take the smallest set covering 90% of the corpus.
Assign PUA chars — map each chosen token to a unique Unicode Private-Use-Area codepoint, starting at U+E000. Skip codepoints that already appear in the corpus.
Pre-tokenize — at encode time, substitute mapped tokens with their PUA chars (Aho-Corasick, O(n) in input length).
BPE the rest — feed the residual through a standard byte-level BPE. The PUA chars are atomic vocab entries; they never get further split.
Decode — the byte-level decoder reconstructs the substituted string; reverse-substitution restores the original text.

Round-trip is byte-equal for any input. We test this with Hypothesis on arbitrary Unicode plus a hand-curated corner-case suite (ZWJ emoji, BOM, control chars, mixed scripts, deep nesting, etc.).

📦 Project Layout

src/cute_tokenizer/
  config.py         # CUTEConfig — all knobs in one place
  patterns.py       # token regex + identifier splitter (uses `regex` module)
  corpus.py         # streaming ingest, dedup, secret scrub, sharding
  frequency.py      # parallel multiprocess counting
  selection.py      # coverage-based + quality-filtered token selection
  pua.py            # Private-Use-Area codepoint allocator
  pretokenizer.py   # CUTEPreTokenizer (Aho-Corasick + identifier splitting)
  trainer.py        # build_cute() — orchestrates the full pipeline
  decode.py         # PUA-aware reverse substitution
  tokenizer.py      # CUTETokenizerFast (PreTrainedTokenizerFast)
  manifest.py       # build manifest for reproducibility
  cli.py            # `cute build`, `cute roundtrip-check`, `cute info`

tests/
  unit/             # ~140 unit tests
  property/         # Hypothesis round-trip tests
  integration/      # full pipeline E2E

benchmarks/
  compression.py    # CUTE vs tiktoken/GPT-2/CodeLlama
  latency.py        # encode/decode μs per KB

⚙️ Configuration

from cute_tokenizer import CUTEConfig, build_cute

config = CUTEConfig(
    vocab_size=80_000,        # total token IDs
    coverage_target=0.90,     # PUA coverage of total frequency
    max_token_len=50,         # ignore tokens longer than this
    boost_weight=0.3,         # identifier sub-part boost
    min_bpe_budget=8_000,     # minimum learnable merges
    seed=42,                  # determinism
    workers=0,                # 0 = os.cpu_count()
    enable_secret_scrub=True, # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config)

🧪 Testing

pip install -e .[dev]
pytest tests/unit          # fast unit tests
pytest tests/property      # Hypothesis round-trip
pytest tests/integration   # full E2E build (slower)
pytest --cov=cute_tokenizer

The Hypothesis suite runs ~600+ generated test cases per round-trip property, plus a hand-picked corner-case parametrize covering: empty strings, BOM, ZWJ emoji, control chars, multi-script text, deep underscores, and more.

🔐 Production Hardening

Determinism: same corpus + config → same vocab hash. Verified by tests/integration/test_determinism.py.
Secret scrubbing: corpus files matching AWS/OpenAI/Anthropic/GitHub key patterns are dropped before vocab construction.
Build manifest: every build emits build_manifest.json recording config, corpus hash, vocab hash, library versions, and timing.
PUA collision detection: codepoints found in the corpus are skipped during assignment, so user content cannot be confused with our injection.
Type-checked: mypy --strict clean.
Lint clean: ruff check and ruff format.

📊 Benchmarks

python -m benchmarks.compression --tokenizer ./output --holdout ./holdout
python -m benchmarks.latency --tokenizer ./output

Expected (on a 100 GB Python/TS holdout):

Metric	CUTE vs byte-level BPE
Sequence length (mean)	⚡ 35–45% shorter
Sequence length (p95)	⚡ 30–40% shorter
Sequence length (p99)	⚡ 25–35% shorter
Bytes per token (mean)	📈 +50–70%
Round-trip correctness	✅ 100% (Hypothesis-verified)
Training throughput (LLM)	⚡ +25–35%
Inference latency (LLM)	⚡ −25–40%
API token cost	💰 −30–45%
KV-cache memory at inference	💾 −35–45%
Effective context window (text per token)	📏 +55–80%
Encode latency (tokenizer itself)	🐢 ~1.5× tiktoken (Python pre-tok overhead)

Run the benchmarks on your own corpus to see numbers for your distribution.

🐭 Why a Mouse?

A mouse is small, fast, and nibbles things to size. CUTE quietly chews through your token bill while you focus on the model. The cheese is the 30–45% cost reduction.

📜 License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 8, 2026

0.1.3 yanked

May 7, 2026

This version

0.1.2 yanked

May 7, 2026

0.1.1 yanked

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cute_tokenizer-0.1.2.tar.gz (1.1 MB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cute_tokenizer-0.1.2-py3-none-any.whl (1.1 MB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file cute_tokenizer-0.1.2.tar.gz.

File metadata

Download URL: cute_tokenizer-0.1.2.tar.gz
Upload date: May 7, 2026
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b0efffd79c89d356e78ccc78e8eec974b4e3dbc28ea8bc056092befb87da493a`
MD5	`4b128b266b7252f405a70c219239a0ef`
BLAKE2b-256	`3da17a21003d99260ca9166e5c63265ea2b5d1d9d6949b1d1c5db779b8b40223`

See more details on using hashes here.

File details

Details for the file cute_tokenizer-0.1.2-py3-none-any.whl.

File metadata

Download URL: cute_tokenizer-0.1.2-py3-none-any.whl
Upload date: May 7, 2026
Size: 1.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for cute_tokenizer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5b18b836dddc0a89128b8b7db7da8f2e6529d4f1bd33a89c6717b238c1b69fb`
MD5	`1fd7546d8a6453cddf4e6e7f5455f680`
BLAKE2b-256	`2226fb2534af224c281373a276317a56a3aebfeb2a9e1b9946aa5de32affa88c`

See more details on using hashes here.

cute-tokenizer 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐭 CUTE Tokenizer

Compact Unicode Token Encoding

✨ Highlights

🧀 Quick Start

🔍 How It Works

📦 Project Layout

⚙️ Configuration

🧪 Testing

🔐 Production Hardening

📊 Benchmarks

🐭 Why a Mouse?

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes