LLM-friendly random byte sequence decoder

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

japlete

These details have not been verified by PyPI

Project description

utf-token

Convert random string identifiers to a LLM-friendly format to reduce token usage in certain retrieval and agentic tasks.

utf-token encodes the identifier into a 2-token sequence by default with 30 bits of entropy. Collisions are prevented automatically and the conversion is fully reversible.

token savings vs hex, base64, and uuid

Install

uv add utf-token

Usage

The IdTokenBiMap class is used to encode identifiers and store the full original bytes so you can recover them later.

from utf_token import IdTokenBiMap

bimap = IdTokenBiMap()

hex_str = "215aada34d0987ebfb9de132d913e46b"
# 17 tokens: 215 a ada 34 d 098 7 eb fb 9 de 132 d 913 e 46 b

token_hex = bimap.fromhex(hex_str)
print(token_hex)
# 2 tokens: ao 691

reconstructed_hex = bimap.tohex(token_hex) # Recovers the original hex string

Forward methods: frombytes, fromhex, frombase64, fromuuid. Reverse methods: tobytes, tohex, tobase64, touuid.

Both forward and reverse methods accept either:

a single value -> returns one encoded str (or recovered value)
an iterable of values -> returns a lazy iterator

Persisting the reversible map

The internal map in IdTokenBiMap can be saved and restored to transfer offline conversions for online usage:

to_dict / from_dict
to_json / from_json

Optional arguments

LLM - token vocabulary pairing

Pick the token vocabulary that matches the model you are using. Current options are:

Default: o200k (OpenAI GPT-5+)
gemma4 (Google Gemma 4)

bimap = IdTokenBiMap(vocab="gemma4")

Controlling how many bits are encoded with `keep_bits`

IdTokenBiMap takes keep_bits at construction (default 30):

a positive integer that is a multiple of the vocab's pair_index_bits (15 for shipped vocabs): you get 1 token per 15 bits
None or "all": encode the full input

short_bimap = IdTokenBiMap()                                     # keep_bits=30
longer_bimap = IdTokenBiMap(keep_bits=45)
full_bimap = IdTokenBiMap(keep_bits="all")

short = short_bimap.frombytes(b"\x01\x02\x03\x04\x05\x06")
short_bimap.tobytes(short) == b"\x01\x02\x03\x04\x05\x06"        # reverse returns the full input

The default 30 bits (two full 15-bit chunks) is enough entropy for retrieval workloads where you only need a handful of distinct identifiers visible to the model at once, and is also the minimum we recommend for the healing logic described below to stay reliable. Use a larger multiple of 15 if you need more in-context disambiguation.

Healing transcription errors on reverse lookup

LLMs occasionally make transcription errors when copying identifiers. Reverse methods accept an errors keyword to control what happens when the input is not an exact match in the reverse map:

errors="fix" (default): return the closest previously encoded identifier by Levenshtein distance.
errors="raise": if the exact lookup misses, raise KeyError. Useful when you want to manage error handling yourself.

bimap = IdTokenBiMap()
encoded = bimap.fromuuid("123e4567-e89b-12d3-a456-426614174000")

bimap.touuid(encoded)                                    # exact match
bimap.touuid(encoded[:-1] + "Z")                         # heals to nearest stored id

bimap.touuid("not_a_real_id", errors="raise")            # raises KeyError

if encoded in bimap:                                     # supports membership checks
    print("This will print")

Standalone forward-only helpers

frombytes, fromhex, frombase64, and fromuuid are also available as standalone module-level functions. They perform only the forward conversion, and they default to keeping the full input rather than truncating. They are useful when you want to plug utf-token into your own data flow or build your own reverse-lookup table:

from utf_token import fromhex

my_hex = "215aada34d0987ebfb9de132d913e46b"
encoded_hex = fromhex(my_hex)                            # full input
short_hex = fromhex(my_hex, keep_bits=30)                 # top 30 MSBs

Both keep_bits=None and keep_bits="all" keep the full input.

For the standalone functions, pass the vocab parameter in the call.

Included safe character set in tokens

Both o200k and gemma4 lookup tables are restricted to ASCII (A-Z, a-z, 0-9, _) to avoid LLM confusion.

Neither vocabulary emits quotes, slashes, brackets, commas, pipes, whitespace, or other delimiter characters, which makes the output easy to embed in JSON, Markdown, logs, tables, and prompts where the LLM or code needs to see clearly where an identifier begins and ends.

Instructions to include in prompts/tools

To avoid confusion when your agent sees these IDs, you can adapt these instructions to your specific use case:

Identifiers are random LLM token sequences containing only ASCII alphanumeric or _ characters. They are delimited by <insert your delimiters here>. Some identifiers may contain words or part of words, it's just a coincidence due to the use of tokens. Do not translate or fix typos in the identifiers. Transcribe them verbatim.

Other recommendations for maximum reliability in identifier retrieval

Use consistent delimiters to clearly separate identifiers from other text in the prompt.
Keep the default keep_bits=30 (or a higher multiple of 15) so the healing logic has enough signal to disambiguate identifiers.
Use structured outputs / JSON tools to request the identifiers. Provide a regex pattern such as ^[A-Za-z0-9_]+$ for the output strings in the JSON schema.
Use smart models. For OpenAI, use at least GPT-5.4-mini (not nano). For Gemini, use at least Gemma 4. For Anthropic, use at least Haiku 4.5.
Use low temperature if the model supports it.

Retrieval benchmark

A NIAH-style benchmark is included to test small LLMs (GPT-5.4-mini, Gemma 4, Claude Haiku) on retrieval accuracy. With 100 samples for each model, and both full-input and default keep_bits=30 identifiers, the success rate is 100%. The context length is 32k tokens (calibrated for hex identifiers, then re-encoded for each encoding), and identifiers have 16 bytes of entropy.

See docs/benchmarks_niah.md.

The synthetic NIAH benchmark was adapted from NVIDIA/RULER.

How it works

utf-token encodes the underlying bytes directly. Each vocabulary ships two pre-built lookup tables, generated offline by scripts/process_token_vocab.py: a large pair table indexed by either 15 or 16 bits (depending on how many clean tokens the vocabulary can supply) and a small tail table indexed by 8 bits.

For 15-bit pair tables (both shipped vocabs) the encoder treats the input as an MSB-first bitstream, splits it into 15-bit chunks for the pair table, and uses the tail table for any 1–8 bit residual at the end. A 16-bit fast path is also implemented for any future vocabulary that can fill a 16-bit pair table under the curated latin_16bit recipe.

IdTokenBiMap keeps a forward map and a reverse map so the generated string can be resolved back to the original bytes later. Collisions can happen when different inputs produce the same encoded string, especially when keep_bits truncates them to a short prefix. When IdTokenBiMap sees that a new value would collide with an existing one, it deterministically moves to the next prefix until it finds an unused encoded string. The stored reverse map still points that generated string back to the original full input.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

japlete

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.5

May 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utf_token-0.1.5.tar.gz (235.0 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

utf_token-0.1.5-py3-none-any.whl (219.1 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file utf_token-0.1.5.tar.gz.

File metadata

Download URL: utf_token-0.1.5.tar.gz
Upload date: May 18, 2026
Size: 235.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for utf_token-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`0a77a25315b703d6cb87e7eb49b1807ecf72b1de9dcb858c9ae4f09707cc9706`
MD5	`9cb05bfd4a91ff986051bf62e2a6b41f`
BLAKE2b-256	`9516237fef1077975f4a9deb3a2f8af351efa0d570e477028fde353b06ef86ad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for utf_token-0.1.5.tar.gz:

Publisher: release.yml on japlete/utf-token

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: utf_token-0.1.5.tar.gz
- Subject digest: 0a77a25315b703d6cb87e7eb49b1807ecf72b1de9dcb858c9ae4f09707cc9706
- Sigstore transparency entry: 1565268433
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: japlete/utf-token@8886be8191db0913c25bfc2be0e4f1df8230b576
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/japlete
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8886be8191db0913c25bfc2be0e4f1df8230b576
- Trigger Event: push

File details

Details for the file utf_token-0.1.5-py3-none-any.whl.

File metadata

Download URL: utf_token-0.1.5-py3-none-any.whl
Upload date: May 18, 2026
Size: 219.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for utf_token-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea3286edf4dbbd80671c40d75d74915abfcf77c03a775800dc0c5951bc7d3a8d`
MD5	`34261aaf9f8fb971dbf41702d82bf4ff`
BLAKE2b-256	`9da3db1ccae6a94c714266e380bfc9eb3ba76f4fb1d401b7536ed0ea26630ab2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for utf_token-0.1.5-py3-none-any.whl:

Publisher: release.yml on japlete/utf-token

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: utf_token-0.1.5-py3-none-any.whl
- Subject digest: ea3286edf4dbbd80671c40d75d74915abfcf77c03a775800dc0c5951bc7d3a8d
- Sigstore transparency entry: 1565268475
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: japlete/utf-token@8886be8191db0913c25bfc2be0e4f1df8230b576
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/japlete
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8886be8191db0913c25bfc2be0e4f1df8230b576
- Trigger Event: push

utf-token 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

utf-token

Install

Usage

Persisting the reversible map

Optional arguments

LLM - token vocabulary pairing

Controlling how many bits are encoded with keep_bits

Healing transcription errors on reverse lookup

Standalone forward-only helpers

Included safe character set in tokens

Instructions to include in prompts/tools

Other recommendations for maximum reliability in identifier retrieval

Retrieval benchmark

How it works

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Controlling how many bits are encoded with `keep_bits`