Skip to main content

Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement

Project description

jtoken

Author: Hermann Samimi

jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, lossless round-trip. Pass a file, a string, or a dict — it figures out the rest.

Python 3.8+. No extra runtime dependencies.

Installation

pip install jtoken
pip install "jtoken[tiktoken]"   # for OpenAI-compatible token counting

Quick start

import jtoken

# From a file — read as text, pass directly
raw = open("data.json").read()
encoded = jtoken.encode(raw)
print(encoded)

# From a Python dict
data = {"user": "alice", "age": 30, "active": True, "ref": None}
encoded = jtoken.encode(data)
decoded = jtoken.decode(encoded)
assert decoded == data

Aliases: jtoken.dumps = encode, jtoken.loads = decode.

Format overview

JSON

{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}

jtoken

name: Alice
age: 30
trues: active
falses: verified
nulls: ref

Encoding rules

  • Nested dicts flatten with dot notation.
  • True, False, and None collapse into trues:, falses:, and nulls: summary lines.
  • Ambiguous strings keep quotes on encode.
  • Keys containing . are escaped during normalization and restored from context.

What jtoken accepts

encode accepts a string (file content) or a dict/list. When given a string, it auto-detects the format:

  • Standard JSON objects and arrays
  • Multiple bare JSON objects in a single string (no array wrapper needed)
  • MongoDB shell format (ObjectId(...), ISODate(...), NumberInt(...))
  • MongoDB Extended JSON ($oid, $date, $numberInt, …)
  • Elasticsearch search hits (with _source)

No format flag required — just pass the text.

Normalization and denormalization

For lossless round-trips back into MongoDB shell or Elasticsearch hit format, use encode_document / decode_document:

import jtoken

raw = open("hit.json").read()
text, context = jtoken.encode_document(raw)
restored = jtoken.decode_document(text, target="mongo_shell", context=context)
jtoken encode -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json

Input and output formats

auto (the default) handles everything automatically. Override with source= / target= only when needed.

Input format Description
auto detect from content (default)
json standard JSON
mongo_shell MongoDB shell (ObjectId, ISODate, …)
mongo_extended MongoDB Extended JSON
elastic_hit Elasticsearch hit with _source
elastic_source _source wrapper only
Output format Description
json pretty-printed JSON (CLI default)
python Python repr (Python API default)
mongo_shell MongoDB shell document
mongo_extended MongoDB Extended JSON
elastic_hit full Elasticsearch hit envelope
elastic_source _source wrapper

Public API reference

Core codec

function description
encode(data) -> str compress string, dict, or list to jtoken
decode(text: str) -> dict reconstruct the nested dict
dumps / loads json-style aliases

Normalization

function description
encode_document(raw, *, source="auto", context=None) return (jtoken_text, NormalizationContext)
decode_document(text, *, target="json", context=None) decode and denormalize
normalize(data, *, source="auto", context=None) return (normalized_dict, NormalizationContext)
denormalize(data, *, target="python", context) restore lists, typed values, and dialect
parse_input(text, *, source="auto") parse foreign text into Python data
render_output(value, *, target="python") -> str render denormalized data as text

Token measurement

function description
count_tokens(data, *, model, backend) -> int token count for dict or jtoken string
count_text_tokens(text, *, model, backend) -> int token count for raw text
token_savings(data, *, model, backend, json_indent=2) compare jtoken vs pretty JSON

TokenSavings properties

property type description
jtoken_tokens int tokens in jtoken representation
json_tokens int tokens in JSON baseline
saved int json_tokens - jtoken_tokens
percent float percent saved

NormalizationContext fields

field description
source_format detected input dialect
target_format optional output hint
typed_values BSON-like type markers per path
lists paths that were lists before flattening
dotted_keys paths with escaped . keys
elastic Elasticsearch envelope metadata

Methods: to_dict(), from_dict(data).

Exceptions

exception when raised
JPackEncodeError encoding fails
JPackDecodeError decoding fails
NormalizationError normalization fails
DenormalizationError denormalization fails
TokenCountError token counting fails

Token counting

stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
backend behavior
auto use tiktoken when installed, otherwise estimate
tiktoken require tiktoken
estimate character heuristic

Representative token counts

Document type JSON jtoken
ELK hit 1537 583
Mongo shell 770 508
PostgreSQL structured document 831 685
Standard JSON 617 503

Token count by representation

CLI

cat data.json | jtoken encode
cat data.jtoken | jtoken decode
jtoken encode -f data.json
jtoken stats -f data.json --model gpt-4o --backend tiktoken
jtoken count -f data.json
python -m jtoken encode

Links

License

MIT — Copyright (c) 2026 Hermann Samimi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jtoken-0.3.3.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jtoken-0.3.3-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file jtoken-0.3.3.tar.gz.

File metadata

  • Download URL: jtoken-0.3.3.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for jtoken-0.3.3.tar.gz
Algorithm Hash digest
SHA256 7538afcfa4e40fbc4ac40c79c450fd1577545daf14c79ed58485448fb77d48fc
MD5 299ebb35e690d66bd1517d086df2432e
BLAKE2b-256 e8db577da4c7b1c5a1e80012b25253190094fd78d492c631b8d78eb35d28ddd8

See more details on using hashes here.

File details

Details for the file jtoken-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: jtoken-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for jtoken-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7453cce77b6e26d1e7f91f432db948ace07db9f6cc20ef42afe721207cd263a3
MD5 cceecf3fa0a5542ca873685021fb65dc
BLAKE2b-256 7bc89267e36b1eff0d8f239a6e4b0391d237fa3f0465caf142b27b0b060ff1bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page