Skip to main content

Data Intelligent Structure Token-efficient Interchange for LLMs - Compress JSON for LLM consumption

Project description

DISTILL

Data Intelligent Structure Token-efficient Interchange for LLMs

A Python package that compresses JSON data for LLM consumption while maintaining semantic readability. Unlike binary compression (gzip, LZ77), DISTILL preserves meaning while reducing token count by 60-85%.

Why DISTILL?

When sending data to LLMs, every token counts:

  • Cost: Less tokens = lower API costs
  • Context: Fit more data in the context window
  • Speed: Fewer tokens = faster responses

DISTILL compresses JSON while keeping it LLM-readable - no binary blobs, no unreadable encodings.

Best For: DISTILL achieves maximum compression (60-85%) on repetitive, structured data - arrays of objects with repeated field names and values (logs, events, API responses, database records). For mostly unique data with no repetition, compression benefits are minimal (~20-30%).

Installation

pip install distill-json

# With accurate token counting (recommended)
pip install distill-json[tiktoken]

Quick Start

from distill import compress, decompress

# Your JSON data
data = [
    {"id": 1, "name": "Alice", "role": "developer", "team": "backend", "remote": True},
    {"id": 2, "name": "Bob", "role": "developer", "team": "backend", "remote": True},
    {"id": 3, "name": "Charlie", "role": "designer", "team": "frontend", "remote": False},
    {"id": 4, "name": "Diana", "role": "developer", "team": "backend", "remote": True},
]

# Compress
result = compress(data)

print(result["compressed"])
print(f"Reduced by {result['meta']['reduction_percent']}%")
# Output: Reduced by 72.5%

# Decompress back to original (100% lossless)
original = decompress(result["compressed"])
assert original == data  # Exact match guaranteed

How It Works

DISTILL uses a 3-layer compression architecture:

Input JSON
    ↓
Layer 1: Schema Extraction (objects → tuples + field names)
    ↓
Layer 2: Dictionary Encoding (frequent values → single-letter codes a-z)
    ↓
Layer 3: Equivalence Partitioning (repeated tuples → #N references)
    ↓
Output: Compressed JSON

Layer 1: Schema Extraction

Extracts field names into a schema, converts objects to value tuples:

Input:  [{"name": "Alice", "role": "dev"}, {"name": "Bob", "role": "dev"}]
Output: schema=["name", "role"], tuples=[["Alice", "dev"], ["Bob", "dev"]]

Benefit: Field names stored once instead of repeated per object.

Layer 2: Dictionary Encoding

Maps frequent values to single-letter codes (a-z, max 26):

Frequency analysis: "dev" appears 100x, "home" appears 80x
Dictionary: {"a": "dev", "b": "home", ...}
Encoded: "dev" → "a", "home" → "b"

Benefit: Long repeated strings become single characters.

Layer 3: Equivalence Partitioning

Groups identical encoded tuples into references:

Input:  ["abc", "abc", "abc", "abd"]
Output: equiv={"#0": "abc"}, data=["#0", "#0", "#0", "abd"]

Benefit: Repeated records stored once with short references.

Output Format

DISTILL produces a JSON structure with a $ metadata section:

{
  "$": {
    "schema": ["name", "role", "team"],
    "dict": {"a": "\"developer\"", "b": "\"backend\"", "c": "\"frontend\""},
    "equiv": {"#0": "abc"}
  },
  "data": ["#0", "#0", "acd", "#0"]
}

Format Breakdown

Key Purpose Example
$.schema Field names in sorted order ["name", "role", "team"]
$.dict Value → code mapping {"a": "\"click\"", "b": "\"home\""}
$.equiv Tuple → reference mapping {"#0": "abc", "#1": "abd"}
$.\_bare Original was bare list (not wrapped) true
data Compressed records (or original key name) ["#0", "#0", "abc"]
_extra Preserved non-array data from original {"meta": {"count": 100}}

API Reference

compress(data, level="auto")

Compress JSON data for LLM consumption.

Parameters:

  • data: JSON-compatible data (dict, list, or JSON string)
  • level: Compression level (kept for API compatibility, uses optimal settings)

Returns:

{
    "compressed": "...",  # Compressed JSON string
    "meta": {
        "method": "schema+dict+equiv",
        "original_tokens": 1520,
        "compressed_tokens": 228,
        "reduction_percent": 85.0,
        "tokens_saved": 1292,
        "schema_fields": 5,
        "dict_codes": 12,
        "equiv_classes": 3,
        "data_key": "events",
        "has_extra": False
    }
}

decompress(compressed)

Reconstruct original JSON from compressed output. 100% lossless guaranteed.

Parameters:

  • compressed: DISTILL compressed string or result dict from compress()

Returns:

  • Original JSON-compatible data structure (exact match)
# Both work:
original = decompress(result["compressed"])
original = decompress(result)  # Pass whole result dict

analyze(data)

Analyze data for compression potential without compressing.

from distill import analyze

analysis = analyze(data)
print(analysis)
# {
#     "original_tokens": 1520,
#     "compressible": True,
#     "schema_fields": 5,
#     "total_tuples": 100,
#     "unique_values": 45,
#     "repeated_tuples": 12,
#     "estimated_reduction": 75,
#     "data_key": "events"
# }

Utility Functions

from distill import compress_to_string, is_distill_format
from distill.core.tokenizer import count_tokens

# Get just the compressed string (no metadata)
compressed = compress_to_string(data)

# Check if text is DISTILL format
if is_distill_format(text):
    original = decompress(text)

# Count tokens
tokens = count_tokens(json_string)

Performance Metrics

Dataset Items Original Compressed Reduction Ratio
Simple repetitive 100 701 180 74.3% 3.9x
Events (3 fields) 1,000 10,001 1,535 84.7% 6.5x
Large dataset 10,000 100,001 15,092 84.9% 6.6x
Log entries 500 5,801 1,372 76.3% 4.2x
API responses 200 2,201 817 62.9% 2.7x
Partial repetition 100 - - 57.6% ~2.4x
Mostly unique 50 - - 28.8% ~1.4x

Compression Factor Analysis

For highly repetitive data (100 identical records):

Layer Contribution Cumulative
Schema Extraction ~70% 70%
Dictionary Encoding ~10% 80%
Equivalence Partitioning ~7% 87%

Type Preservation

DISTILL guarantees 100% lossless roundtrip with exact type preservation:

# These remain distinct after compression/decompression:
{"value": "123"}   # String "123"
{"value": 123}     # Integer 123
{"value": "null"}  # String "null"
{"value": None}    # Actual null
{"value": "true"}  # String "true"
{"value": True}    # Boolean true

Configuration

from distill.config import with_config

# Customize compression settings
with with_config(
    max_depth=100,           # Max nesting depth (default: 50)
    dict_min_frequency=2,    # Min occurrences for dictionary (default: 1)
    min_equiv_count=2,       # Min occurrences for equivalence (default: 2)
    fallback_on_increase=True  # Return original if compression increases size
):
    result = compress(data)

Error Handling

from distill.exceptions import (
    DistillError,           # Base exception
    CompressionError,       # Compression failed
    DecompressionError,     # Decompression failed
    ValidationError,        # Invalid data (NaN, Inf, sets)
    InvalidInputError       # None or empty input
)

try:
    result = compress(data)
except InvalidInputError as e:
    print(f"Bad input: {e}")
except ValidationError as e:
    print(f"Invalid data: {e}")
except CompressionError as e:
    print(f"Compression failed: {e}")

Requirements

  • Python 3.9+
  • Optional: tiktoken for accurate token counting

License

MIT License

Contributing

Contributions welcome! Please read our contributing guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distill_json-0.2.0.tar.gz (59.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distill_json-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file distill_json-0.2.0.tar.gz.

File metadata

  • Download URL: distill_json-0.2.0.tar.gz
  • Upload date:
  • Size: 59.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for distill_json-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e7d4bf1418c1cbfdfabb1cc8aa622c0d2b69d65d54ce964865ed15f77a70dd57
MD5 1ff4be1bae4b51ef2179ffabfee85a72
BLAKE2b-256 e8a134d24b3da57cd54cc1c110ffd6fc640e45ce4d9c0ac8daf950c1d66063b1

See more details on using hashes here.

File details

Details for the file distill_json-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: distill_json-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for distill_json-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ccb1619ae07a290e46d30858b9f1c32fb2ab10d035dfcc4efca2c0908beeec6
MD5 eaa9755084b60b1d5d95bca74a099e53
BLAKE2b-256 8add50a0858ff6d15a72e7baa68295fc0e00aa044406d107ebfaa0b91a8589b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page