Skip to main content

Smart structured-data-to-TOON gateway with pragmatic auto-gating for LLM prompts.

Project description

datoon

smart structured-data→TOON gateway — converts only when it actually saves tokens

Tests Pre-commit Release PyPI Python License: MIT

Before/AfterInstallWhat You GetHow It WorksBenchmarksFull install guide


Raw structured data is often verbose in LLM prompts. TOON can save tokens — but blind conversion can also make payloads worse. datoon adds a decision layer: convert when structure and savings justify it, skip when they don't, and always explain why.

Supports JSON, CSV, JSONL, YAML, XML, Parquet, Avro, ORC, Excel, and Apple Numbers — auto-detected from file extension.

Before / After

JSON in the prompt (43 tokens)

{"users":[
  {"id":1,"name":"Ada","role":"admin"},
  {"id":2,"name":"Lin","role":"analyst"},
  {"id":3,"name":"Grace","role":"viewer"}
]}

datoon converts → TOON (24 tokens)

users[3]{id,name,role}:
  1,Ada,admin
  2,Lin,analyst
  3,Grace,viewer
{"decision":"convert","reason":"Estimated savings 44.19% (threshold 15.00%)."}

CSV from a data pipeline (111 tokens as JSON)

id,name,role
1,Ada,admin
2,Lin,analyst
3,Grace,viewer

datoon auto-converts → TOON (24 tokens)

datoon data.csv --report-stdout

Same result. Zero JSON serialization in your code.

Non-uniform payload (26 tokens)

{"config":{"debug":true},"tags":["a","b"]}

datoon skips → keeps JSON

{"decision":"skip","reason":"No uniform object arrays found with at least 3 rows."}

No Node.js call. No silent corruption.

Same data. Right format. Always explained.

┌──────────────────────────────────────────────────┐
│  PAYLOAD SAVINGS (auto avg)    ████░░░░░░   28%  │
│  PAYLOAD SAVINGS (agent skill) ████████░░   62%  │
│  DECISION ACCURACY             ██████████  100%  │
│  HARMFUL CONVERSIONS BLOCKED   ██████████  100%  │
└──────────────────────────────────────────────────┘

[!IMPORTANT] datoon saves payload tokens — the structured data portion of your prompt. Token savings depend on payload shape: uniform tabular data converts well; deeply nested or non-uniform structures are skipped. Every decision includes a reason so pipelines can log, debug, and trust the outcome.

Install

# core (JSON, CSV, JSONL, XML — no extra deps)
uv add datoon
pip install datoon

# with YAML support
pip install "datoon[yaml]"

# with Excel support
pip install "datoon[excel]"

# with Parquet / ORC / Avro support
pip install "datoon[columnar]"

# with Apple Numbers support
pip install "datoon[numbers]"

# with tiktoken-based token counting
pip install "datoon[tokens]"

# with MCP server
pip install "datoon[mcp]"

# everything
pip install "datoon[all]"

Requires Python 3.12+. TOON conversion requires Node.js with npx in PATH — analysis and format reading work without it.

For Claude Code plugin, Codex, and MCP config → INSTALL.md.

What You Get

What
datoon CLI Auto-gate any supported format → TOON from terminal or scripts
Python API convert_json_for_llm() + read_tabular() for any LLM pipeline
MCP Server convert_json, convert_text, analyze_json tools for Claude Desktop, Cursor, Windsurf
Claude Code Plugin /datoon in-session trigger, installs from GitHub in one command
Codex Plugin Marketplace plugin — structured-data mode for Codex

Supported input formats

Format Extension Extra needed
JSON .json
JSONL .jsonl, .ndjson
CSV .csv
XML .xml
YAML .yaml, .yml datoon[yaml]
Excel .xlsx, .xls datoon[excel]
Parquet .parquet datoon[columnar]
Avro .avro datoon[columnar]
ORC .orc datoon[columnar]
Apple Numbers .numbers datoon[numbers]

How It Works

  1. Detect format — from --format flag, file extension, or default to JSON for stdin
  2. Read + normalize — parse source into list of row dicts; serialize to compact JSON
  3. Analyze structure — uniform object arrays? acceptable depth? minimum rows?
  4. Gate early — non-candidates skip before any CLI call; no Node.js overhead
  5. Convert + estimate — TOON CLI runs, token savings calculated
  6. Gate savings — below threshold → return JSON; above → return TOON with report

Every path returns a ConversionReport with decision, reason, and token estimates. Pipelines never get silent surprises.


Quick Start

JSON (stdin):

echo '{"users":[{"id":1,"name":"Ada"},{"id":2,"name":"Lin"},{"id":3,"name":"Grace"}]}' | datoon --report-stdout

CSV (auto-detected from extension):

datoon data.csv --report-stdout

JSONL:

datoon data.jsonl -o output.toon

YAML (requires datoon[yaml]):

datoon data.yaml --report-stdout

Parquet (requires datoon[columnar]):

datoon data.parquet --report ./report.json

Explicit format override:

datoon --format csv < data.csv --report-stdout

Force conversion (bypass gating — for experiments):

datoon data.json --force --report-stdout

Python API

JSON conversion:

from datoon import convert_json_for_llm, ConversionConfig, DatoonError

config = ConversionConfig(min_savings_ratio=0.15, max_depth=6, min_uniform_rows=3)

try:
    outcome = convert_json_for_llm(raw_json, config)
except DatoonError as exc:
    raise

# outcome.payload_text  — TOON or original JSON
# outcome.report.decision  — "convert" | "skip"
# outcome.report.reason    — human-readable explanation
send_to_model(outcome.payload_text)

Any format via read_tabular:

import json
from pathlib import Path
from datoon import read_tabular, convert_json_for_llm, ConversionConfig

# text formats: csv, jsonl, yaml, xml
rows = read_tabular("csv", text=csv_string)

# binary formats: excel, parquet, orc, avro, numbers
rows = read_tabular("parquet", path=Path("data.parquet"))

json_text = json.dumps(rows, separators=(",", ":"))
outcome = convert_json_for_llm(json_text, ConversionConfig())
send_to_model(outcome.payload_text)

Structure-only analysis (no Node.js required):

from datoon.analyzer import analyze_payload
from datoon.models import ConversionConfig

analysis = analyze_payload(parsed_data, ConversionConfig())
print(analysis.is_candidate, analysis.reason)

MCP Server

datoon ships an MCP server with three tools:

Tool Description
convert_json Full JSON conversion with policy gating
convert_text Converts CSV, YAML, XML, or JSONL text with policy gating
analyze_json Structure analysis only — no Node.js needed

Claude Desktop / Cursor / Windsurf config:

{
  "mcpServers": {
    "datoon": {
      "command": "uvx",
      "args": ["--from", "datoon[mcp]", "datoon", "mcp"]
    }
  }
}

Run locally:

datoon mcp     # or the standalone script: datoon-mcp

Listed on the MCP Registry, Smithery, and Glama. See MARKETPLACES.md.


Claude Code Plugin

Install directly from GitHub:

claude plugin marketplace add andrii-su/datoon
claude plugin install datoon@datoon

Trigger in-session:

/datoon
convert this JSON to TOON if it saves tokens
use datoon mode for structured data

CLI Reference

Flag Default Description
--format auto Input format: json, csv, jsonl, yaml, xml, excel, parquet, avro, orc, numbers
--force false Bypass gating and minimum savings threshold
--min-savings 0.15 Minimum relative token savings required
--max-depth 6 Maximum nesting depth for auto-conversion
--min-uniform-rows 3 Minimum rows in uniform object arrays
--timeout 30 Seconds before TOON CLI call is aborted
--report <path> Write JSON conversion report to file
--report-stdout Print JSON conversion report to stderr
-o <path> stdout Output file path
--version Print version and exit

Format is auto-detected from file extension. Use --format to override or when reading from stdin.


Benchmarks

PYTHONPATH=src python benchmarks/run.py --dry-run
PYTHONPATH=src python benchmarks/run.py
PYTHONPATH=src python benchmarks/run.py --update-readme

Why auto mode outperforms forced conversion

Auto mode avoids low-benefit and high-risk payloads (orders-nested, mixed-non-uniform) while matching forced TOON's average token count on suitable ones. Every decision comes with a reasoned report.

Scenario JSON Baseline Forced TOON datoon Auto
Average tokens 77 50 50
Avg token saved 0.0% 26.8% 28.1%
Decision quality n/a Converts all Converts 3/5, skips harmful cases
Dataset JSON TOON (forced) Raw Saved Auto Auto Tokens Auto Saved
users-small 56 31 44.6% convert 31 44.6%
events-medium 198 111 43.9% convert 111 43.9%
orders-nested 93 91 2.2% skip 93 0.0%
mixed-non-uniform 35 37 -5.7% skip 35 0.0%
metrics-wide 133 63 52.6% convert 63 52.6%
Average 103 67 27.5% 3/5 convert 67 28.2%

Forced conversion succeeded for 5/5 payloads.

Format conversion benchmark

Token savings when converting from common structured formats (CSV, JSONL, XML, YAML). Baseline is the JSON representation of the same data — what an LLM would receive without datoon.

Dataset Format JSON Tokens TOON (forced) Auto Auto Tokens Auto Saved
users-csv csv 53 29 convert 29 45.3%
events-jsonl jsonl 194 109 convert 109 43.8%
catalog-xml xml 96 50 convert 50 47.9%
metrics-yaml yaml 129 61 convert 61 52.7%
Average 118 62 4/4 convert 62 47.4%

Forced conversion succeeded for 4/4 payloads.

Agent skill evaluation

Artifact-based subagent comparison — identical analysis tasks, two modes:

  • with_skill: agent received the datoon skill and followed the conversion workflow.
  • without_skill: agent used JSON directly, no TOON or datoon.

3 payload sizes × 3 iterations = 18 total agent runs. Both modes: 100% correct answers.

Scenario Avg JSON Tokens Avg TOON Tokens Avg Payload Saved
small 225 118 47.6%
medium 2,972 1,138 61.7%
large 17,757 6,673 62.4%

Full report and raw outputs: benchmarks/agent_skill_eval/. Savings are payload-token estimates, not full end-to-end model-token usage.


Development

Contributor workflow: CONTRIBUTING.md. Maintainer/agent notes: CLAUDE.md.

Setup:

uv sync --extra dev
uvx pre-commit install

Tests:

pytest -m "not integration"   # unit only (102 tests)
pytest                        # with integration (requires Node.js + npx)

Skill sync + plugin metadata:

python scripts/validate_skill_sync.py
python scripts/validate_plugin_metadata.py

Links


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datoon-1.7.1.tar.gz (193.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datoon-1.7.1-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file datoon-1.7.1.tar.gz.

File metadata

  • Download URL: datoon-1.7.1.tar.gz
  • Upload date:
  • Size: 193.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datoon-1.7.1.tar.gz
Algorithm Hash digest
SHA256 f095eee4c972129cf769bbe2a6a34446410b5a267b2932aaa711fff0d752aba0
MD5 5ac5c9b7d3712e4dc98ae2436fde8386
BLAKE2b-256 db5ec6aab1b29e82779b7981584bd7f9fa9c60f1c48a53e56738ecef150e475b

See more details on using hashes here.

Provenance

The following attestation bundles were made for datoon-1.7.1.tar.gz:

Publisher: publish.yml on andrii-su/datoon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datoon-1.7.1-py3-none-any.whl.

File metadata

  • Download URL: datoon-1.7.1-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datoon-1.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93f682e97059486dc06a686458b850569418e1f4f9c8d795af33e071f198dd86
MD5 250f4d8827f8ee52970b55e536502cc1
BLAKE2b-256 23ee095224737ca85c357d4fcc4720d57c4a71b226714a67e91bfaf3ac0a37bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for datoon-1.7.1-py3-none-any.whl:

Publisher: publish.yml on andrii-su/datoon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page