Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement
Project description
jtoken
Author: Hermann Samimi
jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, lossless round-trip. Pass a file, a string, or a dict — it figures out the rest.
Python 3.8+. No extra runtime dependencies.
Installation
pip install jtoken
pip install "jtoken[tiktoken]" # for OpenAI-compatible token counting
Quick start
import jtoken
# From a file — read as text, pass directly
raw = open("data.json").read()
encoded = jtoken.encode(raw)
print(encoded)
# From a Python dict
data = {"user": "alice", "age": 30, "active": True, "ref": None}
encoded = jtoken.encode(data)
decoded = jtoken.decode(encoded)
assert decoded == data
Aliases: jtoken.dumps = encode, jtoken.loads = decode.
Format overview
JSON
{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
jtoken
name: Alice
age: 30
trues: active
falses: verified
nulls: ref
Encoding rules
- Nested dicts flatten with dot notation.
True,False, andNonecollapse intotrues:,falses:, andnulls:summary lines.- Ambiguous strings keep quotes on encode.
- Keys containing
.are escaped during normalization and restored from context.
What jtoken accepts
encode accepts a string (file content) or a dict/list. When given a string, it auto-detects the format:
- Standard JSON objects and arrays
- Multiple bare JSON objects in a single string (no array wrapper needed)
- MongoDB shell format (
ObjectId(...),ISODate(...),NumberInt(...)) - MongoDB Extended JSON (
$oid,$date,$numberInt, …) - Elasticsearch search hits (with
_source)
No format flag required — just pass the text.
Normalization and denormalization
For lossless round-trips back into MongoDB shell or Elasticsearch hit format, use encode_document / decode_document:
import jtoken
raw = open("hit.json").read()
text, context = jtoken.encode_document(raw)
restored = jtoken.decode_document(text, target="mongo_shell", context=context)
jtoken encode -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
Input and output formats
auto (the default) handles everything automatically. Override with source= / target= only when needed.
| Input format | Description |
|---|---|
auto |
detect from content (default) |
json |
standard JSON |
mongo_shell |
MongoDB shell (ObjectId, ISODate, …) |
mongo_extended |
MongoDB Extended JSON |
elastic_hit |
Elasticsearch hit with _source |
elastic_source |
_source wrapper only |
| Output format | Description |
|---|---|
json |
pretty-printed JSON (CLI default) |
python |
Python repr (Python API default) |
mongo_shell |
MongoDB shell document |
mongo_extended |
MongoDB Extended JSON |
elastic_hit |
full Elasticsearch hit envelope |
elastic_source |
_source wrapper |
Public API reference
Core codec
| function | description |
|---|---|
encode(data) -> str |
compress string, dict, or list to jtoken |
decode(text: str) -> dict |
reconstruct the nested dict |
dumps / loads |
json-style aliases |
Normalization
| function | description |
|---|---|
encode_document(raw, *, source="auto", context=None) |
return (jtoken_text, NormalizationContext) |
decode_document(text, *, target="json", context=None) |
decode and denormalize |
normalize(data, *, source="auto", context=None) |
return (normalized_dict, NormalizationContext) |
denormalize(data, *, target="python", context) |
restore lists, typed values, and dialect |
parse_input(text, *, source="auto") |
parse foreign text into Python data |
render_output(value, *, target="python") -> str |
render denormalized data as text |
Token measurement
| function | description |
|---|---|
count_tokens(data, *, model, backend) -> int |
token count for dict or jtoken string |
count_text_tokens(text, *, model, backend) -> int |
token count for raw text |
token_savings(data, *, model, backend, json_indent=2) |
compare jtoken vs pretty JSON |
TokenSavings properties
| property | type | description |
|---|---|---|
jtoken_tokens |
int |
tokens in jtoken representation |
json_tokens |
int |
tokens in JSON baseline |
saved |
int |
json_tokens - jtoken_tokens |
percent |
float |
percent saved |
NormalizationContext fields
| field | description |
|---|---|
source_format |
detected input dialect |
target_format |
optional output hint |
typed_values |
BSON-like type markers per path |
lists |
paths that were lists before flattening |
dotted_keys |
paths with escaped . keys |
elastic |
Elasticsearch envelope metadata |
Methods: to_dict(), from_dict(data).
Exceptions
| exception | when raised |
|---|---|
JPackEncodeError |
encoding fails |
JPackDecodeError |
decoding fails |
NormalizationError |
normalization fails |
DenormalizationError |
denormalization fails |
TokenCountError |
token counting fails |
Token counting
stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
backend |
behavior |
|---|---|
auto |
use tiktoken when installed, otherwise estimate |
tiktoken |
require tiktoken |
estimate |
character heuristic |
Representative token counts
| Document type | JSON | jtoken |
|---|---|---|
| ELK hit | 1537 | 583 |
| Mongo shell | 770 | 508 |
| PostgreSQL structured document | 831 | 685 |
| Standard JSON | 617 | 503 |
CLI
cat data.json | jtoken encode
cat data.jtoken | jtoken decode
jtoken encode -f data.json
jtoken stats -f data.json --model gpt-4o --backend tiktoken
jtoken count -f data.json
python -m jtoken encode
Links
- Homepage: https://github.com/hermannsamimi/jtoken
- Issues: https://github.com/hermannsamimi/jtoken/issues
License
MIT — Copyright (c) 2026 Hermann Samimi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jtoken-0.3.3.tar.gz.
File metadata
- Download URL: jtoken-0.3.3.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7538afcfa4e40fbc4ac40c79c450fd1577545daf14c79ed58485448fb77d48fc
|
|
| MD5 |
299ebb35e690d66bd1517d086df2432e
|
|
| BLAKE2b-256 |
e8db577da4c7b1c5a1e80012b25253190094fd78d492c631b8d78eb35d28ddd8
|
File details
Details for the file jtoken-0.3.3-py3-none-any.whl.
File metadata
- Download URL: jtoken-0.3.3-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7453cce77b6e26d1e7f91f432db948ace07db9f6cc20ef42afe721207cd263a3
|
|
| MD5 |
cceecf3fa0a5542ca873685021fb65dc
|
|
| BLAKE2b-256 |
7bc89267e36b1eff0d8f239a6e4b0391d237fa3f0465caf142b27b0b060ff1bd
|