LLM-friendly random byte sequence decoder
Project description
utf-token
Convert random string identifiers to a LLM-friendly format to reduce token usage in certain retrieval and agentic tasks.
utf-token encodes the identifier into a 2-token sequence by default with 30 bits of entropy. Collisions are prevented automatically and the conversion is fully reversible.
Install
uv add utf-token
Usage
The IdTokenBiMap class is used to encode identifiers and store the full original bytes so you can recover them later.
from utf_token import IdTokenBiMap
bimap = IdTokenBiMap()
hex_str = "215aada34d0987ebfb9de132d913e46b"
# 17 tokens: 215 a ada 34 d 098 7 eb fb 9 de 132 d 913 e 46 b
token_hex = bimap.fromhex(hex_str)
print(token_hex)
# 2 tokens: ao 691
reconstructed_hex = bimap.tohex(token_hex) # Recovers the original hex string
Forward methods: frombytes, fromhex, frombase64, fromuuid.
Reverse methods: tobytes, tohex, tobase64, touuid.
Both forward and reverse methods accept either:
- a single value -> returns one encoded
str(or recovered value) - an iterable of values -> returns a lazy iterator
Persisting the reversible map
The internal map in IdTokenBiMap can be saved and restored to transfer offline conversions for online usage:
to_dict/from_dictto_json/from_json
Optional arguments
LLM - token vocabulary pairing
Pick the token vocabulary that matches the model you are using. Current options are:
- Default:
o200k(OpenAI GPT-5+) gemma4(Google Gemma 4)
bimap = IdTokenBiMap(vocab="gemma4")
Controlling how many bits are encoded with keep_bits
IdTokenBiMap takes keep_bits at construction (default 30):
- a positive integer that is a multiple of the vocab's
pair_index_bits(15 for shipped vocabs): you get 1 token per 15 bits Noneor"all": encode the full input
short_bimap = IdTokenBiMap() # keep_bits=30
longer_bimap = IdTokenBiMap(keep_bits=45)
full_bimap = IdTokenBiMap(keep_bits="all")
short = short_bimap.frombytes(b"\x01\x02\x03\x04\x05\x06")
short_bimap.tobytes(short) == b"\x01\x02\x03\x04\x05\x06" # reverse returns the full input
The default 30 bits (two full 15-bit chunks) is enough entropy for retrieval workloads where you only need a handful of distinct identifiers visible to the model at once, and is also the minimum we recommend for the healing logic described below to stay reliable. Use a larger multiple of 15 if you need more in-context disambiguation.
Healing transcription errors on reverse lookup
LLMs occasionally make transcription errors when copying identifiers. Reverse methods accept an errors keyword to control what happens when the input is not an exact match in the reverse map:
errors="fix"(default): return the closest previously encoded identifier by Levenshtein distance.errors="raise": if the exact lookup misses, raiseKeyError. Useful when you want to manage error handling yourself.
bimap = IdTokenBiMap()
encoded = bimap.fromuuid("123e4567-e89b-12d3-a456-426614174000")
bimap.touuid(encoded) # exact match
bimap.touuid(encoded[:-1] + "Z") # heals to nearest stored id
bimap.touuid("not_a_real_id", errors="raise") # raises KeyError
if encoded in bimap: # supports membership checks
print("This will print")
Standalone forward-only helpers
frombytes, fromhex, frombase64, and fromuuid are also available as standalone module-level functions. They perform only the forward conversion, and they default to keeping the full input rather than truncating. They are useful when you want to plug utf-token into your own data flow or build your own reverse-lookup table:
from utf_token import fromhex
my_hex = "215aada34d0987ebfb9de132d913e46b"
encoded_hex = fromhex(my_hex) # full input
short_hex = fromhex(my_hex, keep_bits=30) # top 30 MSBs
Both keep_bits=None and keep_bits="all" keep the full input.
For the standalone functions, pass the vocab parameter in the call.
Included safe character set in tokens
Both o200k and gemma4 lookup tables are restricted to ASCII (A-Z, a-z, 0-9, _) to avoid LLM confusion.
Neither vocabulary emits quotes, slashes, brackets, commas, pipes, whitespace, or other delimiter characters, which makes the output easy to embed in JSON, Markdown, logs, tables, and prompts where the LLM or code needs to see clearly where an identifier begins and ends.
Instructions to include in prompts/tools
To avoid confusion when your agent sees these IDs, you can adapt these instructions to your specific use case:
Identifiers are random LLM token sequences containing only ASCII alphanumeric or
_characters. They are delimited by<insert your delimiters here>. Some identifiers may contain words or part of words, it's just a coincidence due to the use of tokens. Do not translate or fix typos in the identifiers. Transcribe them verbatim.
Other recommendations for maximum reliability in identifier retrieval
- Use consistent delimiters to clearly separate identifiers from other text in the prompt.
- Keep the default
keep_bits=30(or a higher multiple of 15) so the healing logic has enough signal to disambiguate identifiers. - Use structured outputs / JSON tools to request the identifiers. Provide a regex pattern such as
^[A-Za-z0-9_]+$for the output strings in the JSON schema. - Use smart models. For OpenAI, use at least GPT-5.4-mini (not nano). For Gemini, use at least Gemma 4. For Anthropic, use at least Haiku 4.5.
- Use low temperature if the model supports it.
Retrieval benchmark
A NIAH-style benchmark is included to test small LLMs (GPT-5.4-mini, Gemma 4, Claude Haiku) on retrieval accuracy. With 100 samples for each model, and both full-input and default keep_bits=30 identifiers, the success rate is 100%. The context length is 32k tokens (calibrated for hex identifiers, then re-encoded for each encoding), and identifiers have 16 bytes of entropy.
The synthetic NIAH benchmark was adapted from NVIDIA/RULER.
How it works
utf-token encodes the underlying bytes directly. Each vocabulary ships two pre-built lookup tables, generated offline by scripts/process_token_vocab.py: a large pair table indexed by either 15 or 16 bits (depending on how many clean tokens the vocabulary can supply) and a small tail table indexed by 8 bits.
For 15-bit pair tables (both shipped vocabs) the encoder treats the input as an MSB-first bitstream, splits it into 15-bit chunks for the pair table, and uses the tail table for any 1–8 bit residual at the end. A 16-bit fast path is also implemented for any future vocabulary that can fill a 16-bit pair table under the curated latin_16bit recipe.
IdTokenBiMap keeps a forward map and a reverse map so the generated string can be resolved back to the original bytes later. Collisions can happen when different inputs produce the same encoded string, especially when keep_bits truncates them to a short prefix. When IdTokenBiMap sees that a new value would collide with an existing one, it deterministically moves to the next prefix until it finds an unused encoded string. The stored reverse map still points that generated string back to the original full input.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file utf_token-0.1.5.tar.gz.
File metadata
- Download URL: utf_token-0.1.5.tar.gz
- Upload date:
- Size: 235.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a77a25315b703d6cb87e7eb49b1807ecf72b1de9dcb858c9ae4f09707cc9706
|
|
| MD5 |
9cb05bfd4a91ff986051bf62e2a6b41f
|
|
| BLAKE2b-256 |
9516237fef1077975f4a9deb3a2f8af351efa0d570e477028fde353b06ef86ad
|
Provenance
The following attestation bundles were made for utf_token-0.1.5.tar.gz:
Publisher:
release.yml on japlete/utf-token
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
utf_token-0.1.5.tar.gz -
Subject digest:
0a77a25315b703d6cb87e7eb49b1807ecf72b1de9dcb858c9ae4f09707cc9706 - Sigstore transparency entry: 1565268433
- Sigstore integration time:
-
Permalink:
japlete/utf-token@8886be8191db0913c25bfc2be0e4f1df8230b576 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/japlete
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8886be8191db0913c25bfc2be0e4f1df8230b576 -
Trigger Event:
push
-
Statement type:
File details
Details for the file utf_token-0.1.5-py3-none-any.whl.
File metadata
- Download URL: utf_token-0.1.5-py3-none-any.whl
- Upload date:
- Size: 219.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea3286edf4dbbd80671c40d75d74915abfcf77c03a775800dc0c5951bc7d3a8d
|
|
| MD5 |
34261aaf9f8fb971dbf41702d82bf4ff
|
|
| BLAKE2b-256 |
9da3db1ccae6a94c714266e380bfc9eb3ba76f4fb1d401b7536ed0ea26630ab2
|
Provenance
The following attestation bundles were made for utf_token-0.1.5-py3-none-any.whl:
Publisher:
release.yml on japlete/utf-token
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
utf_token-0.1.5-py3-none-any.whl -
Subject digest:
ea3286edf4dbbd80671c40d75d74915abfcf77c03a775800dc0c5951bc7d3a8d - Sigstore transparency entry: 1565268475
- Sigstore integration time:
-
Permalink:
japlete/utf-token@8886be8191db0913c25bfc2be0e4f1df8230b576 -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/japlete
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8886be8191db0913c25bfc2be0e4f1df8230b576 -
Trigger Event:
push
-
Statement type: