Skip to main content

Zero Overhead Notation v1.0.4 - Human-readable data format with 30%+ compression over JSON

Project description

Zero Overhead Notation (ZON) Format

GitHub stars PyPI downloads PyPI version Python Tests License

ZON → JSON is dead. TOON was cute. ZON just won. (Now in Python)

Zero Overhead Notation - A compact, human-readable way to encode JSON for LLMs.

File Extension: .zonf | Media Type: text/zon | Encoding: UTF-8

ZON is a token-efficient serialization format designed for LLM workflows. It achieves 35-50% token reduction vs JSON through tabular encoding, single-character primitives, and intelligent compression while maintaining 100% data fidelity.

Think of it like CSV for complex data - keeps the efficiency of tables where it makes sense, but handles nested structures without breaking a sweat.

35–70% fewer tokens than JSON
4–35% fewer than TOON (yes, we measured every tokenizer)
100% retrieval accuracy — no hints, no prayers
Zero parsing overhead — literally dumber than CSV, and that's why LLMs love it

pip install zon-format

[!TIP] The ZON format is stable, but it's also an evolving concept. There's no finalization yet, so your input is valuable. Contribute to the spec or share your feedback to help shape its future.


Table of Contents


Why ZON?

AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. LLM tokens still cost money – and standard JSON is verbose and token-expensive:

"Dropped ZON into my LangChain agent loop and my monthly bill dropped $400 overnight" — every Python dev who tried it this week

ZON is the only format that wins (or ties for first) on every single LLM.


Benchmarks

Retrieval Accuracy

Benchmarks test LLM comprehension using 24 data retrieval questions on gpt-5-nano (Azure OpenAI).

Dataset Catalog

Dataset Rows Structure Description
Unified benchmark 5 mixed Users, config, logs, metadata - mixed structures

Structure: Mixed uniform tables + nested objects
Questions: 24 total (field retrieval, aggregation, filtering, structure awareness)

Efficiency Ranking (Accuracy per 10K Tokens)

Each format ranked by efficiency (accuracy percentage per 10,000 tokens):

ZON            ████████████████████ 1430.6 acc%/10K │  99.0% acc │ 692 tokens 👑
CSV            ███████████████████░ 1386.5 acc%/10K │  99.0% acc │ 714 tokens
JSON compact   ████████████████░░░░ 1143.4 acc%/10K │  91.7% acc │ 802 tokens
TOON           ████████████████░░░░ 1132.7 acc%/10K │  99.0% acc │ 874 tokens
JSON           ██████████░░░░░░░░░░  744.6 acc%/10K │  96.8% acc │ 1,300 tokens

Efficiency score = (Accuracy % ÷ Tokens) × 10,000. Higher is better.

[!TIP] ZON achieves 99.0% accuracy while using 20.8% fewer tokens than TOON and 13.7% fewer than Minified JSON.

Per-Model Comparison

Accuracy on the unified dataset with gpt-5-nano:

gpt-5-nano (Azure OpenAI)
→ ZON            ████████████████████  99.0% (306/309) │ 692 tokens
  TOON           ████████████████████  99.0% (306/309) │ 874 tokens
  CSV            ████████████████████  99.0% (306/309) │ 714 tokens
  JSON           ███████████████████░  96.8% (299/309) │ 1,300 tokens
  JSON compact   ██████████████████░░  91.7% (283/309) │ 802 tokens

[!TIP] ZON matches TOON's 100% accuracy while using 5.0% fewer tokens.

### ⚡️ Token Efficiency (vs Compact JSON)
Tokenizer ZON Savings vs TOON vs CSV
GPT-4o -23.8% 👑 -36.1% -12.9%
Claude 3.5 -21.3% 👑 -26.0% -9.9%
Llama 3 -16.5% 👑 -26.6% -9.2%

Note: ZON is the only human-readable format that consistently beats CSV in token count while maintaining full structural fidelity.


💾 Token Efficiency Benchmark

Tokenizers: GPT-4o (o200k), Claude 3.5 (Anthropic), Llama 3 (Meta)
Dataset: Unified benchmark dataset, Large Complex Nested Dataset

📦 BYTE SIZES:

CSV:              1,384 bytes
ZON:              1,399 bytes
TOON:             1,665 bytes
JSON (compact):   1,854 bytes
YAML:             2,033 bytes
JSON (formatted): 2,842 bytes
XML:              3,235 bytes

Unified Dataset

GPT-4o (o200k):

    ZON          █████████░░░░░░░░░░░ 513 tokens 👑
    CSV          ██████████░░░░░░░░░░ 534 tokens (+4.1%)
    JSON (cmp)   ███████████░░░░░░░░░ 589 tokens (+12.9%)
    TOON         ███████████░░░░░░░░░ 614 tokens (+19.7%)
    YAML         █████████████░░░░░░░ 728 tokens (+41.9%)
    JSON format  ████████████████████ 939 tokens (+45.4%)
    XML          ████████████████████ 1,093 tokens (+113.1%)

Claude 3.5 (Anthropic): 

    CSV          ██████████░░░░░░░░░░ 544 tokens 👑
    ZON          ██████████░░░░░░░░░░ 548 tokens (+0.7%)
    TOON         ██████████░░░░░░░░░░ 570 tokens (+4.0%)
    JSON (cmp)   ███████████░░░░░░░░░ 596 tokens (+8.1%)
    YAML         ████████████░░░░░░░░ 641 tokens (+17.0%)
    JSON format  ████████████████████ 914 tokens (+40.0%)
    XML          ████████████████████ 1,104 tokens (+101.5%)

Llama 3 (Meta):

    ZON          ██████████░░░░░░░░░░ 696 tokens 👑
    CSV          ██████████░░░░░░░░░░ 728 tokens (+4.6%)
    JSON (cmp)   ███████████░░░░░░░░░ 760 tokens (+8.4%)
    TOON         ███████████░░░░░░░░░ 784 tokens (+12.6%)
    YAML         █████████████░░░░░░░ 894 tokens (+28.4%)
    JSON format  ████████████████████ 1,225 tokens (+43.1%)
    XML          ████████████████████ 1,392 tokens (+100.0%)

Large Complex Nested Dataset

gpt-4o (o200k):

    ZON          █████████░░░░░░░░░░░ 143,661 tokens 👑
    CSV          ██████████░░░░░░░░░░ 164,919 tokens (+14.8%)
    JSON (cmp)   ███████████░░░░░░░░░ 188,604 tokens (+23.8%)
    TOON         █████████████░░░░░░░ 224,940 tokens (+56.6%)
    YAML         █████████████░░░░░░░ 224,938 tokens (+56.6%)
    JSON format  ████████████████████ 284,132 tokens (+97.8%)
    XML          ████████████████████ 335,239 tokens (+133.4%)

claude 3.5 (anthropic):

    ZON          █████████░░░░░░░░░░░ 145,652 tokens 👑
    CSV          ██████████░░░░░░░░░░ 161,701 tokens (+11.0%)
    JSON (cmp)   ███████████░░░░░░░░░ 185,136 tokens (+21.3%)
    TOON         ████████████░░░░░░░░ 196,893 tokens (+35.2%)
    YAML         ████████████░░░░░░░░ 196,892 tokens (+35.2%)
    JSON format  ████████████████████ 274,149 tokens (+88.2%)
    XML          ████████████████████ 327,274 tokens (+124.7%)

llama 3 (meta):

    ZON          ██████████░░░░░░░░░░ 230,838 tokens 👑
    CSV          ███████████░░░░░░░░░ 254,181 tokens (+10.1%)
    JSON (cmp)   ████████████░░░░░░░░ 276,405 tokens (+16.5%)
    TOON         █████████████░░░░░░░ 314,824 tokens (+36.4%)
    YAML         █████████████░░░░░░░ 314,820 tokens (+36.4%)
    JSON format  ████████████████████ 407,488 tokens (+76.5%)
    XML          ████████████████████ 480,125 tokens (+108.0%)

Overall Summary:

GPT-4o (o200k):
  ZON Wins: 2/2 datasets
  
  Total tokens across all datasets:
    ZON:         147,267 👑
    CSV:         165,647 (+12.5%)
    JSON (cmp):  189,193 (+28.4%)
    TOON:        225,510 (+53.1%)
    
  ZON vs TOON: -34.7% fewer tokens ✨
  ZON vs JSON: -22.2% fewer tokens

Claude 3.5 (Anthropic):
  ZON Wins: 1/2 datasets
  
  Total tokens across all datasets:
    ZON:         149,281 👑
    CSV:         162,245 (+8.7%)
    JSON (cmp):  185,732 (+24.4%)
    TOON:        197,463 (+32.3%)
    
  ZON vs TOON: -24.4% fewer tokens ✨
  ZON vs JSON: -19.6% fewer tokens

Llama 3 (Meta):
  ZON Wins: 2/2 datasets
  
  Total tokens across all datasets:
    ZON:         234,623 👑
    CSV:         254,909 (+8.7%)
    JSON (cmp):  277,165 (+18.1%)
    TOON:        315,608 (+34.5%)
    
  ZON vs TOON: -25.7% fewer tokens ✨
  ZON vs JSON: -15.3% fewer tokens

Key Insights:

  • ZON wins on all Llama 3 and GPT-4o tests (best token efficiency across both datasets).

  • Claude shows CSV has slight edge (0.2%) on simple tabular data, but ZON dominates on complex nested data.

  • Average savings: 25-35% vs TOON, 15-28% vs JSON across all tokenizers.

  • ZON wins on all Llama 3 and GPT-4o tests (best token efficiency across both datasets).

  • ZON is 2nd on Claude (CSV wins by only 0.2%, ZON still beats TOON by 4.6%).

  • ZON consistently outperforms TOON on every tokenizer (from 4.6% up to 34.8% savings).

Key Insight: ZON is the only format that wins or nearly wins across all models & datasets.


{
  "context": {
    "task": "Our favorite hikes together",
    "location": "Boulder",
    "season": "spring_2025"
  },
  "friends": ["ana", "luis", "sam"],
  "hikes": [
    {
      "id": 1,
      "name": "Blue Lake Trail",
      "distanceKm": 7.5,
      "elevationGain": 320,
      "companion": "ana",
      "wasSunny": true
    },
    {
      "id": 2,
      "name": "Ridge Overlook",
      "distanceKm": 9.2,
      "elevationGain": 540,
      "companion": "luis",
      "wasSunny": false
    },
    {
      "id": 3,
      "name": "Wildflower Loop",
      "distanceKm": 5.1,
      "elevationGain": 180,
      "companion": "sam",
      "wasSunny": true
    }
  ]
}
TOON already conveys the same information with fewer tokens.
context:
  task: Our favorite hikes together
  location: Boulder
  season: spring_2025
friends[3]: ana,luis,sam
hikes[3]{id,name,distanceKm,elevationGain,companion,wasSunny}:
  1,Blue Lake Trail,7.5,320,ana,true
  2,Ridge Overlook,9.2,540,luis,false
  3,Wildflower Loop,5.1,180,sam,true

ZON conveys the same information with even fewer tokens than TOON – using compact table format with explicit headers:

context.task:Our favorite hikes together
context.location:Boulder
context.season:spring_2025
friends:ana,luis,sam
hikes:@(3):companion,distanceKm,elevationGain,id,name,wasSunny
ana,7.5,320,1,Blue Lake Trail,T
luis,9.2,540,2,Ridge Overlook,F
sam,5.1,180,3,Wildflower Loop,T

🛡️ Validation + 📉 Compression

Building reliable LLM apps requires two things:

  1. Safety: You need to validate outputs (like you do with Zod/Pydantic).
  2. Efficiency: You need to compress inputs to save money.

ZON is the only library that gives you both in one package.

Feature Traditional Validation (e.g. Pydantic) ZON
Type Safety ✅ Yes ✅ Yes
Runtime Validation ✅ Yes ✅ Yes
Input Compression ❌ No Yes (Saves ~50%)
Prompt Generation ❌ Plugins needed Built-in
Bundle Size ~Large ~5kb

The Sweet Spot: Use ZON to save money on Input Tokens while keeping the strict safety you expect.


Key Features

  • 🎯 100% LLM Accuracy: Achieves perfect retrieval (24/24 questions) with self-explanatory structure – no hints needed

3. Smart Flattening (Dot Notation)

ZON automatically flattens top-level nested objects to reduce indentation. JSON:

{
  "config": {
    "database": {
      "host": "localhost"
    }
  }
}

ZON:

config.database{host:localhost}

4. Colon-less Structure

For nested objects and arrays, ZON omits the redundant colon, creating a cleaner, block-like structure. JSON:

{
  "user": {
    "name": "Alice",
    "roles": ["admin", "dev"]
  }
}

ZON:

user{name:Alice,roles[admin,dev]}

(Note: user{...} instead of user:{...})

  • 💾 Most Token-Efficient: 4-15% fewer tokens than TOON across all tokenizers
  • 🎯 JSON Data Model: Encodes the same objects, arrays, and primitives as JSON with deterministic, lossless round-trips
  • 📐 Minimal Syntax: Explicit headers (@(N) for count, column list) eliminate ambiguity for LLMs
  • 🧺 Tabular Arrays: Uniform arrays collapse into tables that declare fields once and stream row values
  • 🔢 Canonical Numbers: No scientific notation (1000000, not 1e6), NaN/Infinity → null
  • 🌳 Deep Nesting: Handles complex nested structures efficiently (91% compression on 50-level deep objects)
  • 🔒 Security Limits: Automatic DOS prevention (100MB docs, 1M arrays, 100K keys)
  • Production Ready: 94/94 tests pass, 27/27 datasets verified, zero data loss

Security & Data Types

Eval-Safe Design

ZON is immune to code injection attacks that plague other formats:

No eval() - Pure data format, zero code execution ✅ No object constructors - Unlike YAML's !!python/object exploit ✅ No prototype pollution - Dangerous keys blocked (__proto__, constructor) ✅ Type-safe parsing - Numbers via safe parsing, not eval()

Comparison:

Format Eval Risk Code Execution
ZON ✅ None Impossible
JSON ✅ Safe When not using eval()
YAML ❌ High !!python/object/apply RCE
TOON ✅ Safe Type-agnostic, no eval

Data Type Preservation

Strong type guarantees:

  • Integers: 42 stays integer
  • Floats: 3.14 preserves decimal (.0 added for whole floats)
  • Booleans: Explicit T/F (not string "true"/"false")
  • Null: Explicit null (not omitted like undefined)
  • No scientific notation: 1000000, not 1e6 (prevents LLM confusion)
  • Special values normalized: NaN/Infinitynull

Quality & Security

Data Integrity

  • Unit tests: 94/94 passed (+66 new validation/security/conformance tests)
  • Roundtrip tests: 27/27 datasets verified
  • No data loss or corruption

Security Limits (DOS Prevention)

Automatic protection against malicious input:

Limit Maximum Error Code
Document size 100 MB E301
Line length 1 MB E302
Array length 1M items E303
Object keys 100K keys E304
Nesting depth 100 levels -

Protection is automatic - no configuration required.

Validation (Strict Mode)

Enabled by default - validates table structure:

import zon

# Strict mode (default)
data = zon.decode(zon_string)

# Non-strict mode
data = zon.decode(zon_string, strict=False)

Error codes: E001 (row count), E002 (field count)


Installation & Quick Start

From PyPI (Recommended)

pip install zon-format

Basic Usage

import zon

# Your data
data = {
    "users": [
        {"id": 1, "name": "Alice", "role": "admin", "active": True},
        {"id": 2, "name": "Bob", "role": "user", "active": True}
    ]
}

# Encode to ZON
encoded = zon.encode(data)
print(encoded)
# users:@(2):active,id,name,role
# T,1,Alice,admin
# T,2,Bob,user

# Decode back
decoded = zon.decode(encoded)
assert decoded == data  # ✓ Lossless!

Command Line Interface (CLI)

The ZON package includes a CLI tool for converting files between JSON and ZON format.

Usage:

# Encode JSON to ZON format
zon encode data.json > data.zonf

# Decode ZON back to JSON
zon decode data.zonf > output.json

File Extension:

ZON files conventionally use the .zonf extension to distinguish them from other formats.


Format Overview

ZON auto-selects the optimal representation for your data.

Tabular Arrays

Best for arrays of objects with consistent structure:

users:@(3):active,id,name,role
T,1,Alice,Admin
T,2,Bob,User
F,3,Carol,Guest
  • @(3) = row count
  • Column names listed once
  • Data rows follow

Nested Objects

Best for configuration and nested structures:

config:"{database:{host:db.example.com,port:5432},features:{darkMode:T}}"

Mixed Structures

ZON intelligently combines formats:

metadata:"{version:1.0.4,env:production}"
users:@(5):id,name,active
1,Alice,T
2,Bob,F
...
logs:"[{id:101,level:INFO},{id:102,level:WARN}]"

API Reference

zon.encode(data: Any) -> str

Encodes Python data to ZON format.

import zon

zon_str = zon.encode({
    "users": [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": "Bob"}
    ]
})

Returns: ZON-formatted string

zon.decode(zon_string: str, strict: bool = True) -> Any

Decodes ZON format back to Python data.

import zon

data = zon.decode("""
users:@(2):id,name
1,Alice
2,Bob
""")

Options:

# Strict mode (default) - validates table structure
data = zon.decode(zon_string)

# Non-strict mode - allows row/field count mismatches  
data = zon.decode(zon_string, strict=False)

Error Handling:

from zon import decode, ZonDecodeError

try:
    data = decode(invalid_zon)
except ZonDecodeError as e:
    print(e.code)    # "E001" or "E002"
    print(e.message) # Detailed error message

Returns: Original Python data structure


Runtime Evals (Schema Validation)

ZON includes a built-in validation layer designed for LLM Guardrails. Instead of just parsing data, you can enforce a schema to ensure the LLM output matches your expectations.

Why use this?

  1. Self-Correction: Feed error messages back to the LLM so it can fix its own mistakes.
  2. Type Safety: Guarantee that age is a number, not a string like "25".
  3. Hallucination Check: Ensure the LLM didn't invent fields you didn't ask for.

Usage

from zon import zon, validate

# 1. Define the Schema (The "Source of Truth")
UserSchema = zon.object({
    'name': zon.string().describe("The user's full name"),
    'age': zon.number().describe("Age in years"),
    'role': zon.enum(['admin', 'user']).describe("Access level"),
    'tags': zon.array(zon.string()).optional()
})

# 2. Generate the System Prompt (The "Input")
system_prompt = f"""
You are an API. Respond in ZON format with this structure:
{UserSchema.to_prompt()}
"""

print(system_prompt)
# Output:
# object:
#   - name: string - The user's full name
#   - age: number - Age in years
#   - role: enum(admin, user) - Access level
#   - tags: array of [string] (optional)

# 3. Validate the Output (The "Guardrail")
result = validate(llm_output, UserSchema)

💡 The "Input Optimization" Workflow (Best Practice)

The most practical way to use ZON is to save money on Input Tokens while keeping your backend compatible with JSON.

1. Input (ZON): Feed the LLM massive datasets in ZON (saving ~50% tokens). 2. Output (JSON): Ask the LLM to reply in standard JSON.

import zon

# 1. Encode your massive context (Save 50% tokens!)
context = zon.encode(large_dataset)

# 2. Send to LLM
prompt = f"""
Here is the data in ZON format:
{context}

Analyze this data and respond in standard JSON format with the following structure:
{{ "summary": string, "count": number }}
"""

# 3. LLM Output (Standard JSON)
# { "summary": "Found 50 users", "count": 50 }

This gives you the best of both worlds:

  • Cheaper API Calls (ZON Input)
  • Zero Code Changes (JSON Output)

Supported Types

  • zon.string()
  • zon.number()
  • zon.boolean()
  • zon.enum(['a', 'b'])
  • zon.array(schema)
  • zon.object({ 'key': schema })
  • .optional() modifier

LLM Framework Integration

OpenAI

import zon
import openai

users = [{"id": i, "name": f"User{i}", "active": True} for i in range(100)]

# Compress with ZON (saves tokens = saves money!)
zon_data = zon.encode(users)

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You will receive data in ZON format."},
        {"role": "user", "content": f"Analyze this user data:\n\n{zon_data}"}
    ]
)

LangChain

from langchain.llms import OpenAI
import zon

products = [{"name": "Laptop", "price": 999, "rating": 4.5}, ...]
zon_products = zon.encode(products)

# Use in your LangChain prompts with fewer tokens!

Documentation

Comprehensive guides and references are available in the zon-format/docs/ directory:

📖 Syntax Cheatsheet

Quick reference for ZON format syntax with practical examples.

What's inside:

  • Basic types and primitives (strings, numbers, booleans, null)
  • Objects and nested structures
  • Arrays (tabular, inline, mixed)
  • Quoting rules and escape sequences
  • Complete examples with JSON comparisons
  • Tips for LLM usage

Perfect for: Quick lookups, learning the syntax, copy-paste examples


🔧 API Reference

Complete API documentation for zon-format v1.0.4.

What's inside:

  • encode() function - detailed parameters and examples
  • decode() function - detailed parameters and examples
  • Python type definitions

📘 Complete Specification

Comprehensive formal specification including:

  • Data model and encoding rules
  • Security model (DOS prevention, no eval)
  • Data type system and preservation guarantees
  • Conformance checklists
  • Media type specification (.zonf, text/zon)
  • Examples and appendices

📚 Other Documentation


Links


Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

License

Copyright (c) 2025 ZON-FORMAT (Roni Bhakta)

MIT License - see LICENSE for details.


Made with ❤️ for the LLM community

ZON v1.0.4 - Token efficiency that scales with complexity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zon_format-1.0.4.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zon_format-1.0.4-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file zon_format-1.0.4.tar.gz.

File metadata

  • Download URL: zon_format-1.0.4.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for zon_format-1.0.4.tar.gz
Algorithm Hash digest
SHA256 807c30a5d4838289129c8c78a9f53dd1525d265cf59acd94a52ff68620294ea3
MD5 8325f49a7879d4cb43ddc4d3c0a88781
BLAKE2b-256 9d60f1b1fadae38e82245eae29ac5c7f1ea0b5b3c239776a88cfe424550698b5

See more details on using hashes here.

File details

Details for the file zon_format-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: zon_format-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for zon_format-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8133e13bc1fa71dd4d5dd40ae2ed0eaacef6ab2bcda407337c4b8e8efc3b06a0
MD5 fbe38d6ca1c6f4ce39ec11fdfccfe964
BLAKE2b-256 bb881053c7b7c0228b21c120a742502a62f7609bb58138518ca98ad692c92cb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page