Skip to main content

Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement

Project description

jtoken

Author: Hermann Samimi

jtoken compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, and lossless round-trip for supported scalar nested dicts. It includes normalization for Elasticsearch hits and MongoDB JSON, a CLI, and token measurement helpers.

Python 3.8+.

Installation

Core (no extra runtime dependencies)

pip install jtoken

With accurate OpenAI-style token counting

pip install "jtoken[tiktoken]"

The core package uses only the Python standard library. Install the tiktoken extra when you want tokenizer-accurate counts for OpenAI-compatible models.

Quick start

import jtoken

data = {
    "user": "alice",
    "age": 30,
    "premium": True,
    "verified": True,
    "is_remote": False,
    "trial": False,
    "score": 9.5,
    "referral": None,
    "last_login": None,
}

text = jtoken.encode(data)
restored = jtoken.decode(text)
assert restored == data

Aliases: jtoken.dumps = encode, jtoken.loads = decode.

End-to-end document workflow

import jtoken

raw = open("hit.json", encoding="utf-8").read()
text, context = jtoken.encode_document(raw, source="elastic_hit")
restored = jtoken.decode_document(text, target="elastic_hit", context=context)

Keep the normalization context sidecar when you need a lossless decode back into Mongo shell, Extended JSON, or an Elasticsearch hit envelope.

Format overview

JSON

{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}

jtoken

name: Alice
age: 30
trues: active
falses: verified
nulls: ref

Encoding rules

  • Nested dicts flatten with dot notation.
  • True, False, and None collapse into trues:, falses:, and nulls: summary lines.
  • Ambiguous strings keep quotes on encode.
  • Multiline strings are JSON-quoted on one line.
  • Keys containing . are escaped during normalization and restored from context.

Supported scalar types

str, int, float, bool, None, and nested dict.

Limitations

  • Keys cannot contain ": " in the core codec.
  • Reserved top-level keys: nulls, trues, falses.
  • Lists are normalized into nested dicts with numeric keys before encoding.

Public API reference

Package metadata

name type description
jtoken.__version__ str package version
jtoken.__author__ str author name (Hermann Samimi)

Core codec

function signature description
encode encode(data: dict) -> str compress a nested scalar dict into jtoken text
decode decode(text: str) -> dict reconstruct the nested dict
dumps alias of encode json-style alias
loads alias of decode json-style alias

Normalization and denormalization

function signature description
parse_input parse_input(text, *, source="auto") parse foreign text into Python data
normalize normalize(data, *, source="auto", context=None) return (normalized_dict, NormalizationContext)
denormalize denormalize(data, *, target="python", context) restore lists, typed values, and dialect shape
render_output render_output(value, *, target="python") -> str render denormalized data as text
encode_document encode_document(raw, *, source="auto", context=None) return (jtoken_text, NormalizationContext)
decode_document decode_document(text, *, target="python", context) decode jtoken text and denormalize

Token measurement

function signature description
count_tokens count_tokens(data, *, model="cl100k_base", backend="auto") -> int count tokens for a dict or encoded jtoken string
count_text_tokens count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int count tokens for raw text
token_savings token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2) compare jtoken vs pretty JSON token usage

TokenSavings properties

property type description
jtoken_tokens int token count for the jtoken representation
json_tokens int token count for the JSON baseline
saved int json_tokens - jtoken_tokens
percent float percent saved relative to JSON

str(stats) prints a one-line summary.

NormalizationContext fields

field type description
source_format str input dialect used during normalization
target_format str | None optional output hint
typed_values dict[str, str] dotted paths with BSON-like type markers
lists set[str] dotted paths that were lists before flattening
dotted_keys dict[str, str] escaped keys that originally contained .
elastic dict | None Elasticsearch envelope metadata

Methods: to_dict(), from_dict(data).

Format enums

InputFormat: auto, json, python, mongo_extended, mongo_shell, elastic_hit, elastic_source

OutputFormat: python, json, mongo_extended, mongo_shell, elastic_hit, elastic_source

Exceptions

exception base when raised
JPackError Exception base library error
JPackEncodeError JPackError encoding fails
JPackDecodeError JPackError decoding fails
NormalizationError JPackError normalization fails
DenormalizationError JPackError denormalization fails
TokenCountError JPackError token counting fails

Token counting

stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
backend behavior
auto use tiktoken when installed, otherwise estimate
tiktoken require tiktoken
estimate simple character heuristic

json_indent=2 compares against prompt-style pretty JSON. Use json_indent=None for compact JSON.

CLI

jtoken encode --input-format mongo_shell -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
jtoken stats --input-format json -f doc.json --model gpt-4o --backend tiktoken
jtoken count --input-format json -f doc.json --backend estimate
python -m jtoken encode

Common flags:

  • -f/--file
  • --input-format
  • --output-format
  • --context-out
  • --context-in
  • --model
  • --backend

Links

License

MIT — Copyright (c) 2026 Hermann Samimi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jtoken-0.2.1.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jtoken-0.2.1-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file jtoken-0.2.1.tar.gz.

File metadata

  • Download URL: jtoken-0.2.1.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.2.1.tar.gz
Algorithm Hash digest
SHA256 ad9025ebcfe9d563a17473256ab70f929b9f117fd5233614224ddbde058c327d
MD5 ef5b9da872cc416e1ae9ebad3c78ce35
BLAKE2b-256 f990a9e5d006317794420a1faed5a46fd8be5393598c775062d482548f84a0cd

See more details on using hashes here.

File details

Details for the file jtoken-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: jtoken-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for jtoken-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 381897b67f12ef6e160a6242c92203ad534bad40f02cde5d7acb585ee0b4b650
MD5 9ca0e4a41119d94a3cb66efa521d90aa
BLAKE2b-256 34f0ba96a4c5692e9c467eaa7d1f19014e3fec99b167e1281cb7b40abe01674f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page