Data Intelligent Structure Token-efficient Interchange for LLMs - Compress JSON for LLM consumption

These details have not been verified by PyPI

Project links

Project description

DISTILL

Data Intelligent Structure Token-efficient Interchange for LLMs

A Python package that compresses JSON data for LLM consumption while maintaining semantic readability. Unlike binary compression (gzip, LZ77), DISTILL preserves meaning while reducing token count by 60-85%.

Why DISTILL?

When sending data to LLMs, every token counts:

Cost: Less tokens = lower API costs
Context: Fit more data in the context window
Speed: Fewer tokens = faster responses

DISTILL compresses JSON while keeping it LLM-readable - no binary blobs, no unreadable encodings.

Best For: DISTILL achieves maximum compression (60-85%) on repetitive, structured data - arrays of objects with repeated field names and values (logs, events, API responses, database records). For mostly unique data with no repetition, compression benefits are minimal (~20-30%).

Installation

pip install distill-json

# With accurate token counting (recommended)
pip install distill-json[tiktoken]

Quick Start

from distill import compress, decompress

# Your JSON data
data = [
    {"id": 1, "name": "Alice", "role": "developer", "team": "backend", "remote": True},
    {"id": 2, "name": "Bob", "role": "developer", "team": "backend", "remote": True},
    {"id": 3, "name": "Charlie", "role": "designer", "team": "frontend", "remote": False},
    {"id": 4, "name": "Diana", "role": "developer", "team": "backend", "remote": True},
]

# Compress
result = compress(data)

print(result["compressed"])
print(f"Reduced by {result['meta']['reduction_percent']}%")
# Output: Reduced by 72.5%

# Decompress back to original (100% lossless)
original = decompress(result["compressed"])
assert original == data  # Exact match guaranteed

How It Works

DISTILL uses a 3-layer compression architecture:

Input JSON
    ↓
Layer 1: Schema Extraction (objects → tuples + field names)
    ↓
Layer 2: Dictionary Encoding (frequent values → single-letter codes a-z)
    ↓
Layer 3: Equivalence Partitioning (repeated tuples → #N references)
    ↓
Output: Compressed JSON

Layer 1: Schema Extraction

Extracts field names into a schema, converts objects to value tuples:

Input:  [{"name": "Alice", "role": "dev"}, {"name": "Bob", "role": "dev"}]
Output: schema=["name", "role"], tuples=[["Alice", "dev"], ["Bob", "dev"]]

Benefit: Field names stored once instead of repeated per object.

Layer 2: Dictionary Encoding

Maps frequent values to single-letter codes (a-z, max 26):

Frequency analysis: "dev" appears 100x, "home" appears 80x
Dictionary: {"a": "dev", "b": "home", ...}
Encoded: "dev" → "a", "home" → "b"

Benefit: Long repeated strings become single characters.

Layer 3: Equivalence Partitioning

Groups identical encoded tuples into references:

Input:  ["abc", "abc", "abc", "abd"]
Output: equiv={"#0": "abc"}, data=["#0", "#0", "#0", "abd"]

Benefit: Repeated records stored once with short references.

Output Format

DISTILL produces a JSON structure with a $ metadata section:

{
  "$": {
    "schema": ["name", "role", "team"],
    "dict": {"a": "\"developer\"", "b": "\"backend\"", "c": "\"frontend\""},
    "equiv": {"#0": "abc"}
  },
  "data": ["#0", "#0", "acd", "#0"]
}

Format Breakdown

Key	Purpose	Example
`$.schema`	Field names in sorted order	`["name", "role", "team"]`
`$.dict`	Value → code mapping	`{"a": "\"click\"", "b": "\"home\""}`
`$.equiv`	Tuple → reference mapping	`{"#0": "abc", "#1": "abd"}`
`$.\_bare`	Original was bare list (not wrapped)	`true`
`data`	Compressed records (or original key name)	`["#0", "#0", "abc"]`
`_extra`	Preserved non-array data from original	`{"meta": {"count": 100}}`

API Reference

compress(data, level="auto")

Compress JSON data for LLM consumption.

Parameters:

data: JSON-compatible data (dict, list, or JSON string)
level: Compression level (kept for API compatibility, uses optimal settings)

Returns:

{
    "compressed": "...",  # Compressed JSON string
    "meta": {
        "method": "schema+dict+equiv",
        "original_tokens": 1520,
        "compressed_tokens": 228,
        "reduction_percent": 85.0,
        "tokens_saved": 1292,
        "schema_fields": 5,
        "dict_codes": 12,
        "equiv_classes": 3,
        "data_key": "events",
        "has_extra": False
    }
}

decompress(compressed)

Reconstruct original JSON from compressed output. 100% lossless guaranteed.

Parameters:

compressed: DISTILL compressed string or result dict from compress()

Returns:

Original JSON-compatible data structure (exact match)

# Both work:
original = decompress(result["compressed"])
original = decompress(result)  # Pass whole result dict

analyze(data)

Analyze data for compression potential without compressing.

from distill import analyze

analysis = analyze(data)
print(analysis)
# {
#     "original_tokens": 1520,
#     "compressible": True,
#     "schema_fields": 5,
#     "total_tuples": 100,
#     "unique_values": 45,
#     "repeated_tuples": 12,
#     "estimated_reduction": 75,
#     "data_key": "events"
# }

Utility Functions

from distill import compress_to_string, is_distill_format
from distill.core.tokenizer import count_tokens

# Get just the compressed string (no metadata)
compressed = compress_to_string(data)

# Check if text is DISTILL format
if is_distill_format(text):
    original = decompress(text)

# Count tokens
tokens = count_tokens(json_string)

Performance Metrics

Dataset	Items	Original	Compressed	Reduction	Ratio
Simple repetitive	100	701	180	74.3%	3.9x
Events (3 fields)	1,000	10,001	1,535	84.7%	6.5x
Large dataset	10,000	100,001	15,092	84.9%	6.6x
Log entries	500	5,801	1,372	76.3%	4.2x
API responses	200	2,201	817	62.9%	2.7x
Partial repetition	100	-	-	57.6%	~2.4x
Mostly unique	50	-	-	28.8%	~1.4x

Compression Factor Analysis

For highly repetitive data (100 identical records):

Layer	Contribution	Cumulative
Schema Extraction	~70%	70%
Dictionary Encoding	~10%	80%
Equivalence Partitioning	~7%	87%

Type Preservation

DISTILL guarantees 100% lossless roundtrip with exact type preservation:

# These remain distinct after compression/decompression:
{"value": "123"}   # String "123"
{"value": 123}     # Integer 123
{"value": "null"}  # String "null"
{"value": None}    # Actual null
{"value": "true"}  # String "true"
{"value": True}    # Boolean true

Configuration

from distill.config import with_config

# Customize compression settings
with with_config(
    max_depth=100,           # Max nesting depth (default: 50)
    dict_min_frequency=2,    # Min occurrences for dictionary (default: 1)
    min_equiv_count=2,       # Min occurrences for equivalence (default: 2)
    fallback_on_increase=True  # Return original if compression increases size
):
    result = compress(data)

Error Handling

from distill.exceptions import (
    DistillError,           # Base exception
    CompressionError,       # Compression failed
    DecompressionError,     # Decompression failed
    ValidationError,        # Invalid data (NaN, Inf, sets)
    InvalidInputError       # None or empty input
)

try:
    result = compress(data)
except InvalidInputError as e:
    print(f"Bad input: {e}")
except ValidationError as e:
    print(f"Invalid data: {e}")
except CompressionError as e:
    print(f"Compression failed: {e}")

Requirements

Python 3.9+
Optional: tiktoken for accurate token counting

License

MIT License

Contributing

Contributions welcome! Please read our contributing guidelines.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Nov 25, 2025

0.2.1

Nov 25, 2025

This version

0.2.0

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distill_json-0.2.0.tar.gz (59.5 kB view details)

Uploaded Nov 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distill_json-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Nov 25, 2025 Python 3

File details

Details for the file distill_json-0.2.0.tar.gz.

File metadata

Download URL: distill_json-0.2.0.tar.gz
Upload date: Nov 25, 2025
Size: 59.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for distill_json-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e7d4bf1418c1cbfdfabb1cc8aa622c0d2b69d65d54ce964865ed15f77a70dd57`
MD5	`1ff4be1bae4b51ef2179ffabfee85a72`
BLAKE2b-256	`e8a134d24b3da57cd54cc1c110ffd6fc640e45ce4d9c0ac8daf950c1d66063b1`

See more details on using hashes here.

File details

Details for the file distill_json-0.2.0-py3-none-any.whl.

File metadata

Download URL: distill_json-0.2.0-py3-none-any.whl
Upload date: Nov 25, 2025
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for distill_json-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ccb1619ae07a290e46d30858b9f1c32fb2ab10d035dfcc4efca2c0908beeec6`
MD5	`eaa9755084b60b1d5d95bca74a099e53`
BLAKE2b-256	`8add50a0858ff6d15a72e7baa68295fc0e00aa044406d107ebfaa0b91a8589b4`

See more details on using hashes here.

distill-json 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DISTILL

Why DISTILL?

Installation

Quick Start

How It Works

Layer 1: Schema Extraction

Layer 2: Dictionary Encoding

Layer 3: Equivalence Partitioning

Output Format

Format Breakdown

API Reference

compress(data, level="auto")

decompress(compressed)

analyze(data)

Utility Functions

Performance Metrics

Compression Factor Analysis

Type Preservation

Configuration

Error Handling

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes