Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement
Project description
jtoken
Author: Hermann Samimi
jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, and lossless round-trip for supported scalar nested dicts. It includes normalization for Elasticsearch hits and MongoDB JSON, a CLI, and token measurement helpers.
Python 3.8+.
Installation
Core (no extra runtime dependencies)
pip install jtoken
With accurate OpenAI-style token counting
pip install "jtoken[tiktoken]"
The core package uses only the Python standard library. Install the tiktoken extra when you want tokenizer-accurate counts for OpenAI-compatible models.
Quick start
import jtoken
data = {
"user": "alice",
"age": 30,
"premium": True,
"verified": True,
"is_remote": False,
"trial": False,
"score": 9.5,
"referral": None,
"last_login": None,
}
text = jtoken.encode(data)
restored = jtoken.decode(text)
assert restored == data
Aliases: jtoken.dumps = encode, jtoken.loads = decode.
End-to-end document workflow
import jtoken
raw = open("hit.json", encoding="utf-8").read()
text, context = jtoken.encode_document(raw, source="elastic_hit")
restored = jtoken.decode_document(text, target="elastic_hit", context=context)
Keep the normalization context sidecar when you need a lossless decode back into Mongo shell, Extended JSON, or an Elasticsearch hit envelope.
Format overview
JSON
{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
jtoken
name: Alice
age: 30
trues: active
falses: verified
nulls: ref
Encoding rules
- Nested dicts flatten with dot notation.
True,False, andNonecollapse intotrues:,falses:, andnulls:summary lines.- Ambiguous strings keep quotes on encode.
- Multiline strings are JSON-quoted on one line.
- Keys containing
.are escaped during normalization and restored from context.
Supported scalar types
str, int, float, bool, None, and nested dict.
Limitations
- Keys cannot contain
": "in the core codec. - Reserved top-level keys:
nulls,trues,falses. - Lists are normalized into nested dicts with numeric keys before encoding.
Public API reference
Package metadata
| name | type | description |
|---|---|---|
jtoken.__version__ |
str |
package version |
jtoken.__author__ |
str |
author name (Hermann Samimi) |
Core codec
| function | signature | description |
|---|---|---|
encode |
encode(data: dict) -> str |
compress a nested scalar dict into jtoken text |
decode |
decode(text: str) -> dict |
reconstruct the nested dict |
dumps |
alias of encode |
json-style alias |
loads |
alias of decode |
json-style alias |
Normalization and denormalization
| function | signature | description |
|---|---|---|
parse_input |
parse_input(text, *, source="auto") |
parse foreign text into Python data |
normalize |
normalize(data, *, source="auto", context=None) |
return (normalized_dict, NormalizationContext) |
denormalize |
denormalize(data, *, target="python", context) |
restore lists, typed values, and dialect shape |
render_output |
render_output(value, *, target="python") -> str |
render denormalized data as text |
encode_document |
encode_document(raw, *, source="auto", context=None) |
return (jtoken_text, NormalizationContext) |
decode_document |
decode_document(text, *, target="python", context) |
decode jtoken text and denormalize |
Token measurement
| function | signature | description |
|---|---|---|
count_tokens |
count_tokens(data, *, model="cl100k_base", backend="auto") -> int |
count tokens for a dict or encoded jtoken string |
count_text_tokens |
count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int |
count tokens for raw text |
token_savings |
token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2) |
compare jtoken vs pretty JSON token usage |
TokenSavings properties
| property | type | description |
|---|---|---|
jtoken_tokens |
int |
token count for the jtoken representation |
json_tokens |
int |
token count for the JSON baseline |
saved |
int |
json_tokens - jtoken_tokens |
percent |
float |
percent saved relative to JSON |
str(stats) prints a one-line summary.
NormalizationContext fields
| field | type | description |
|---|---|---|
source_format |
str |
input dialect used during normalization |
target_format |
str | None |
optional output hint |
typed_values |
dict[str, str] |
dotted paths with BSON-like type markers |
lists |
set[str] |
dotted paths that were lists before flattening |
dotted_keys |
dict[str, str] |
escaped keys that originally contained . |
elastic |
dict | None |
Elasticsearch envelope metadata |
Methods: to_dict(), from_dict(data).
Format enums
InputFormat: auto, json, python, mongo_extended, mongo_shell, elastic_hit, elastic_source
OutputFormat: python, json, mongo_extended, mongo_shell, elastic_hit, elastic_source
Exceptions
| exception | base | when raised |
|---|---|---|
JPackError |
Exception |
base library error |
JPackEncodeError |
JPackError |
encoding fails |
JPackDecodeError |
JPackError |
decoding fails |
NormalizationError |
JPackError |
normalization fails |
DenormalizationError |
JPackError |
denormalization fails |
TokenCountError |
JPackError |
token counting fails |
Token counting
stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
backend |
behavior |
|---|---|
auto |
use tiktoken when installed, otherwise estimate |
tiktoken |
require tiktoken |
estimate |
simple character heuristic |
json_indent=2 compares against prompt-style pretty JSON. Use json_indent=None for compact JSON.
CLI
jtoken encode --input-format mongo_shell -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
jtoken stats --input-format json -f doc.json --model gpt-4o --backend tiktoken
jtoken count --input-format json -f doc.json --backend estimate
python -m jtoken encode
Common flags:
-f/--file--input-format--output-format--context-out--context-in--model--backend
Links
- Homepage: https://github.com/hermannsamimi/jtoken
- Repository: https://github.com/hermannsamimi/jtoken
- Issues: https://github.com/hermannsamimi/jtoken/issues
License
MIT — Copyright (c) 2026 Hermann Samimi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jtoken-0.2.1.tar.gz.
File metadata
- Download URL: jtoken-0.2.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad9025ebcfe9d563a17473256ab70f929b9f117fd5233614224ddbde058c327d
|
|
| MD5 |
ef5b9da872cc416e1ae9ebad3c78ce35
|
|
| BLAKE2b-256 |
f990a9e5d006317794420a1faed5a46fd8be5393598c775062d482548f84a0cd
|
File details
Details for the file jtoken-0.2.1-py3-none-any.whl.
File metadata
- Download URL: jtoken-0.2.1-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
381897b67f12ef6e160a6242c92203ad534bad40f02cde5d7acb585ee0b4b650
|
|
| MD5 |
9ca0e4a41119d94a3cb66efa521d90aa
|
|
| BLAKE2b-256 |
34f0ba96a4c5692e9c467eaa7d1f19014e3fec99b167e1281cb7b40abe01674f
|