Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement

These details have not been verified by PyPI

Project links

Project description

jtoken

Author: Hermann Samimi

jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, and lossless round-trip for supported scalar nested dicts. It includes normalization for Elasticsearch hits and MongoDB JSON, a CLI, and token measurement helpers.

Python 3.8+.

Installation

Core (no extra runtime dependencies)

pip install jtoken

With accurate OpenAI-style token counting

pip install "jtoken[tiktoken]"

The core package uses only the Python standard library. Install the tiktoken extra when you want tokenizer-accurate counts for OpenAI-compatible models.

Quick start

import jtoken

data = {
    "user": "alice",
    "age": 30,
    "premium": True,
    "verified": True,
    "is_remote": False,
    "trial": False,
    "score": 9.5,
    "referral": None,
    "last_login": None,
}

text = jtoken.encode(data)
restored = jtoken.decode(text)
assert restored == data

Aliases: jtoken.dumps = encode, jtoken.loads = decode.

End-to-end document workflow

import jtoken

raw = open("hit.json", encoding="utf-8").read()
text, context = jtoken.encode_document(raw, source="elastic_hit")
restored = jtoken.decode_document(text, target="elastic_hit", context=context)

Keep the normalization context sidecar when you need a lossless decode back into Mongo shell, Extended JSON, or an Elasticsearch hit envelope.

Format overview

JSON

{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}

jtoken

name: Alice
age: 30
trues: active
falses: verified
nulls: ref

Encoding rules

Nested dicts flatten with dot notation.
True, False, and None collapse into trues:, falses:, and nulls: summary lines.
Ambiguous strings keep quotes on encode.
Multiline strings are JSON-quoted on one line.
Keys containing . are escaped during normalization and restored from context.

Supported scalar types

str, int, float, bool, None, and nested dict.

Limitations

Keys cannot contain ": " in the core codec.
Reserved top-level keys: nulls, trues, falses.
Lists are normalized into nested dicts with numeric keys before encoding.

Input and output formats

Use source= / target= in Python or --input-format / --output-format on the CLI. encode, stats, and count accept --input-format (default auto). decode accepts --output-format (default json).

Input (`source` / `--input-format`)	Use when
`auto`	Let jtoken detect the dialect from the text or object shape
`json`	Standard JSON object
`python`	Same JSON parser as `json`
`mongo_extended`	MongoDB Extended JSON with `$oid`, `$date`, `$numberInt`, `$numberLong`, `$numberDouble`, `$numberDecimal`
`mongo_shell`	MongoDB shell document with `ObjectId()`, `ISODate()`, `NumberInt()`, `NumberLong()`
`elastic_hit`	Elasticsearch search hit with `_source` (and optional `fields`)
`elastic_source`	`_source` payload only, or a document wrapped as `{"_source": {...}}`

Output (`target` / `--output-format`)	Use when
`python`	Python `repr` (Python API default)
`json`	Pretty-printed JSON (CLI `decode` default)
`mongo_extended`	Extended JSON; requires a context sidecar for BSON-like types
`mongo_shell`	Mongo shell document; requires a context sidecar for BSON-like types
`elastic_hit`	Full Elasticsearch hit envelope; requires a context sidecar
`elastic_source`	JSON shaped like an Elasticsearch `_source` wrapper

With auto, jtoken picks mongo_shell when it sees ObjectId(...) or ISODate(...), elastic_hit when the object has a dict _source, mongo_extended when Extended JSON markers such as $oid or $date appear, and otherwise json.

Write the normalization context to a sidecar on encode (--context-out / NormalizationContext.to_dict()) and pass it back on decode when the output dialect is not plain JSON or Python. The sidecar records list paths, dotted keys, Elasticsearch envelope metadata, and MongoDB type markers in typed_values (object_id, datetime, long).

MongoDB shell and Extended JSON

Mongo shell input is parsed as JSON after rewriting shell literals: ObjectId("...") and ISODate("...") become Extended JSON, NumberInt(n) becomes a plain integer, and NumberLong(n) becomes {"$numberLong": "n"}. On normalize, object_id, datetime, and long values are stored in the context so mongo_extended and mongo_shell output can restore {"$oid": ...} / ObjectId(...), {"$date": ...} / ISODate(...), and {"$numberLong": ...} / NumberLong(...). $numberInt, $numberDouble, and $numberDecimal are coerced to Python scalars and are not tracked in typed_values.

Elasticsearch hits

elastic_hit encodes the merged _source document (plus any fields values that are not already present in _source) and stores _index, _id, _version, _score, _type, and _routing in the context for lossless elastic_hit output.

Public API reference

Package metadata

name	type	description
`jtoken.__version__`	`str`	package version
`jtoken.__author__`	`str`	author name (`Hermann Samimi`)

Core codec

function	signature	description
`encode`	`encode(data: dict) -> str`	compress a nested scalar dict into jtoken text
`decode`	`decode(text: str) -> dict`	reconstruct the nested dict
`dumps`	alias of `encode`	json-style alias
`loads`	alias of `decode`	json-style alias

Normalization and denormalization

function	signature	description
`parse_input`	`parse_input(text, *, source="auto")`	parse foreign text into Python data
`normalize`	`normalize(data, *, source="auto", context=None)`	return `(normalized_dict, NormalizationContext)`
`denormalize`	`denormalize(data, *, target="python", context)`	restore lists, typed values, and dialect shape
`render_output`	`render_output(value, *, target="python") -> str`	render denormalized data as text
`encode_document`	`encode_document(raw, *, source="auto", context=None)`	return `(jtoken_text, NormalizationContext)`
`decode_document`	`decode_document(text, *, target="python", context)`	decode jtoken text and denormalize

Token measurement

function	signature	description
`count_tokens`	`count_tokens(data, *, model="cl100k_base", backend="auto") -> int`	count tokens for a dict or encoded jtoken string
`count_text_tokens`	`count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int`	count tokens for raw text
`token_savings`	`token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2)`	compare jtoken vs pretty JSON token usage

`TokenSavings` properties

property	type	description
`jtoken_tokens`	`int`	token count for the jtoken representation
`json_tokens`	`int`	token count for the JSON baseline
`saved`	`int`	`json_tokens - jtoken_tokens`
`percent`	`float`	percent saved relative to JSON

str(stats) prints a one-line summary.

`NormalizationContext` fields

field	type	description
`source_format`	`str`	input dialect used during normalization
`target_format`	`str \| None`	optional output hint
`typed_values`	`dict[str, str]`	dotted paths with BSON-like type markers
`lists`	`set[str]`	dotted paths that were lists before flattening
`dotted_keys`	`dict[str, str]`	escaped keys that originally contained `.`
`elastic`	`dict \| None`	Elasticsearch envelope metadata

Methods: to_dict(), from_dict(data).

Format enums

InputFormat: auto, json, python, mongo_extended, mongo_shell, elastic_hit, elastic_source

OutputFormat: python, json, mongo_extended, mongo_shell, elastic_hit, elastic_source

Exceptions

exception	base	when raised
`JPackError`	`Exception`	base library error
`JPackEncodeError`	`JPackError`	encoding fails
`JPackDecodeError`	`JPackError`	decoding fails
`NormalizationError`	`JPackError`	normalization fails
`DenormalizationError`	`JPackError`	denormalization fails
`TokenCountError`	`JPackError`	token counting fails

Token counting

stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)

`backend`	behavior
`auto`	use `tiktoken` when installed, otherwise estimate
`tiktoken`	require `tiktoken`
`estimate`	simple character heuristic

json_indent=2 compares against prompt-style pretty JSON. Use json_indent=None for compact JSON.

CLI

jtoken encode --input-format mongo_shell -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
jtoken stats --input-format json -f doc.json --model gpt-4o --backend tiktoken
jtoken count --input-format json -f doc.json --backend estimate
python -m jtoken encode

Common flags:

-f/--file
--input-format
--output-format
--context-out
--context-in
--model
--backend

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.4

May 14, 2026

0.3.3

May 12, 2026

0.3.1

May 12, 2026

0.3.0

May 11, 2026

0.2.4

May 11, 2026

0.2.3

May 11, 2026

This version

0.2.2

May 11, 2026

0.2.1

May 11, 2026

0.2.0

May 11, 2026

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jtoken-0.2.2.tar.gz (21.4 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jtoken-0.2.2-py3-none-any.whl (17.3 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file jtoken-0.2.2.tar.gz.

File metadata

Download URL: jtoken-0.2.2.tar.gz
Upload date: May 11, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`b584f1856eac3c9461898d3bb3549df12a009b089ed35311ecd2a80cd8d50a74`
MD5	`51c42018481e7b85e8ebb1557cecdb2e`
BLAKE2b-256	`5770ef40abb85548671391190fe66f3cc9962b386d7c0259264e1018188a4a0c`

See more details on using hashes here.

File details

Details for the file jtoken-0.2.2-py3-none-any.whl.

File metadata

Download URL: jtoken-0.2.2-py3-none-any.whl
Upload date: May 11, 2026
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9da4e517632a3879d30780e0682a73d6be4617d8c57e21831bea8834b622104c`
MD5	`0ae5f651bcbe1ec7a98e33f8c2f94853`
BLAKE2b-256	`1488f235587f91e70b63758128f6906ef62194e50064833e515376e6706a1567`

See more details on using hashes here.

jtoken 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

jtoken

Installation

Core (no extra runtime dependencies)

With accurate OpenAI-style token counting

Quick start

End-to-end document workflow

Format overview

Encoding rules

Supported scalar types

Limitations

Input and output formats

MongoDB shell and Extended JSON

Elasticsearch hits

Public API reference

Package metadata

Core codec

Normalization and denormalization

Token measurement

TokenSavings properties

NormalizationContext fields

Format enums

Exceptions

Token counting

CLI

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`TokenSavings` properties

`NormalizationContext` fields