Skip to main content

Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement

Project description

jtoken

jtoken

Full documentation, diagrams, and the GitHub README: github.com/HermannSamimi/jtoken.

Author: Hermann Samimi

jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, and lossless round-trip for supported scalar nested dicts. It includes normalization for Elasticsearch hits and MongoDB JSON, a CLI, and token measurement helpers.

Python 3.8+.

Installation

Core (no extra runtime dependencies)

pip install jtoken

With accurate OpenAI-style token counting

pip install "jtoken[tiktoken]"

The core package uses only the Python standard library. Install the tiktoken extra when you want tokenizer-accurate counts for OpenAI-compatible models.

Quick start

import jtoken

data = {
    "user": "alice",
    "age": 30,
    "premium": True,
    "verified": True,
    "is_remote": False,
    "trial": False,
    "score": 9.5,
    "referral": None,
    "last_login": None,
}

text = jtoken.encode(data)
restored = jtoken.decode(text)
assert restored == data

Aliases: jtoken.dumps = encode, jtoken.loads = decode.

End-to-end document workflow

import jtoken

raw = open("hit.json", encoding="utf-8").read()
text, context = jtoken.encode_document(raw, source="elastic_hit")
restored = jtoken.decode_document(text, target="elastic_hit", context=context)

Keep the normalization context sidecar when you need a lossless decode back into Mongo shell, Extended JSON, or an Elasticsearch hit envelope.

Format overview

JSON

{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}

jtoken

name: Alice
age: 30
trues: active
falses: verified
nulls: ref

Encoding rules

  • Nested dicts flatten with dot notation.
  • True, False, and None collapse into trues:, falses:, and nulls: summary lines.
  • Ambiguous strings keep quotes on encode.
  • Multiline strings are JSON-quoted on one line.
  • Keys containing . are escaped during normalization and restored from context.

Supported scalar types

str, int, float, bool, None, and nested dict.

Limitations

  • Keys cannot contain ": " in the core codec.
  • Reserved top-level keys: nulls, trues, falses.
  • Lists are normalized into nested dicts with numeric keys before encoding.

Input and output formats

Use source= / target= in Python or --input-format / --output-format on the CLI. encode, stats, and count accept --input-format (default auto). decode accepts --output-format (default json).

Input (source / --input-format) Use when
auto Let jtoken detect the dialect from the text or object shape
json Standard JSON object
python Same JSON parser as json
mongo_extended MongoDB Extended JSON with $oid, $date, $numberInt, $numberLong, $numberDouble, $numberDecimal
mongo_shell MongoDB shell document with ObjectId(), ISODate(), NumberInt(), NumberLong()
elastic_hit Elasticsearch search hit with _source (and optional fields)
elastic_source _source payload only, or a document wrapped as {"_source": {...}}
Output (target / --output-format) Use when
python Python repr (Python API default)
json Pretty-printed JSON (CLI decode default)
mongo_extended Extended JSON; requires a context sidecar for BSON-like types
mongo_shell Mongo shell document; requires a context sidecar for BSON-like types
elastic_hit Full Elasticsearch hit envelope; requires a context sidecar
elastic_source JSON shaped like an Elasticsearch _source wrapper

With auto, jtoken picks mongo_shell when it sees ObjectId(...) or ISODate(...), elastic_hit when the object has a dict _source, mongo_extended when Extended JSON markers such as $oid or $date appear, and otherwise json.

Write the normalization context to a sidecar on encode (--context-out / NormalizationContext.to_dict()) and pass it back on decode when the output dialect is not plain JSON or Python. The sidecar records list paths, dotted keys, Elasticsearch envelope metadata, and MongoDB type markers in typed_values (object_id, datetime, long).

MongoDB shell and Extended JSON

Mongo shell input is parsed as JSON after rewriting shell literals: ObjectId("...") and ISODate("...") become Extended JSON, NumberInt(n) becomes a plain integer, and NumberLong(n) becomes {"$numberLong": "n"}. On normalize, object_id, datetime, and long values are stored in the context so mongo_extended and mongo_shell output can restore {"$oid": ...} / ObjectId(...), {"$date": ...} / ISODate(...), and {"$numberLong": ...} / NumberLong(...). $numberInt, $numberDouble, and $numberDecimal are coerced to Python scalars and are not tracked in typed_values.

Elasticsearch hits

elastic_hit encodes the merged _source document (plus any fields values that are not already present in _source) and stores _index, _id, _version, _score, _type, and _routing in the context for lossless elastic_hit output.

Public API reference

Package metadata

name type description
jtoken.__version__ str package version
jtoken.__author__ str author name (Hermann Samimi)

Core codec

function signature description
encode encode(data: dict) -> str compress a nested scalar dict into jtoken text
decode decode(text: str) -> dict reconstruct the nested dict
dumps alias of encode json-style alias
loads alias of decode json-style alias

Normalization and denormalization

function signature description
parse_input parse_input(text, *, source="auto") parse foreign text into Python data
normalize normalize(data, *, source="auto", context=None) return (normalized_dict, NormalizationContext)
denormalize denormalize(data, *, target="python", context) restore lists, typed values, and dialect shape
render_output render_output(value, *, target="python") -> str render denormalized data as text
encode_document encode_document(raw, *, source="auto", context=None) return (jtoken_text, NormalizationContext)
decode_document decode_document(text, *, target="python", context) decode jtoken text and denormalize

Token measurement

function signature description
count_tokens count_tokens(data, *, model="cl100k_base", backend="auto") -> int count tokens for a dict or encoded jtoken string
count_text_tokens count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int count tokens for raw text
token_savings token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2) compare jtoken vs pretty JSON token usage

TokenSavings properties

property type description
jtoken_tokens int token count for the jtoken representation
json_tokens int token count for the JSON baseline
saved int json_tokens - jtoken_tokens
percent float percent saved relative to JSON

str(stats) prints a one-line summary.

NormalizationContext fields

field type description
source_format str input dialect used during normalization
target_format str | None optional output hint
typed_values dict[str, str] dotted paths with BSON-like type markers
lists set[str] dotted paths that were lists before flattening
dotted_keys dict[str, str] escaped keys that originally contained .
elastic dict | None Elasticsearch envelope metadata

Methods: to_dict(), from_dict(data).

Format enums

InputFormat: auto, json, python, mongo_extended, mongo_shell, elastic_hit, elastic_source

OutputFormat: python, json, mongo_extended, mongo_shell, elastic_hit, elastic_source

Exceptions

exception base when raised
JPackError Exception base library error
JPackEncodeError JPackError encoding fails
JPackDecodeError JPackError decoding fails
NormalizationError JPackError normalization fails
DenormalizationError JPackError denormalization fails
TokenCountError JPackError token counting fails

Token counting

stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
backend behavior
auto use tiktoken when installed, otherwise estimate
tiktoken require tiktoken
estimate simple character heuristic

json_indent=2 compares against prompt-style pretty JSON. Use json_indent=None for compact JSON.

Representative token counts

Sample payloads measured as pretty JSON versus jtoken on representative documents:

Document type JSON jtoken
ELK hit 1537 583
Mongo shell 770 508
PostgreSQL structured document 831 685
Standard JSON 617 503

Token count by representation

CLI

jtoken encode --input-format mongo_shell -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
jtoken stats --input-format json -f doc.json --model gpt-4o --backend tiktoken
jtoken count --input-format json -f doc.json --backend estimate
python -m jtoken encode

Common flags:

  • -f/--file
  • --input-format
  • --output-format
  • --context-out
  • --context-in
  • --model
  • --backend

Links

License

MIT — Copyright (c) 2026 Hermann Samimi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jtoken-0.3.4.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jtoken-0.3.4-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file jtoken-0.3.4.tar.gz.

File metadata

  • Download URL: jtoken-0.3.4.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.3.4.tar.gz
Algorithm Hash digest
SHA256 e89b1d1baecc1b970e914abaa447a058d158edf3148b90642d82998ebad1e48f
MD5 a56d6ec86ed453f5ae154bfd3c170941
BLAKE2b-256 c155ac3c2c6df5fe6ee7e6d068f306b53f7b18f402a48cfd6fd4c06018ffc635

See more details on using hashes here.

File details

Details for the file jtoken-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: jtoken-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 09fbe21f0e95e1c0da4c6ab3e4d2efb3a10652216d0d0e58cdc8b04b8d5e74e7
MD5 69698b69740e5daff4c00de6d83917c8
BLAKE2b-256 0f81aef97a0c81fa58a7cfb9a0f256d465de0987c978411520d8cc7069ba0b1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page