Semantic Data Interchange Format. Compact, canonicalizable and token-efficient structured data
Project description
Semantic Data Interchange Format
Compact, semantic and canonicalizable structured data
for AI agents, deterministic workflows and human-auditable records.
What is SDIF? · Quick start · Format at a glance · Token efficiency · Ecosystem · Documentation
|
Compact
Shape declared once. No repeated field names. |
Semantic
Tables, relations, metadata and intent. |
Canonical
Stable output for hashing, signing and comparison. |
Auditable
Designed to be read, reviewed and trusted. |
What is SDIF?
SDIF — Semantic Data Interchange Format is a compact, canonicalizable and AI-friendly data format for structured information that needs to move cleanly between humans, tools, agents and deterministic workflows.
It is designed for cases where data should be:
- small enough to be efficient in AI context windows;
- structured enough for machines to parse and validate;
- readable enough for humans to review;
- deterministic enough for hashing, signing and reproducible workflows;
- semantic enough to express tables, relations, metadata and intent.
SDIF also includes an AI projection surface, .sdif.ai, designed for token-dense agent exchange while remaining reversible back into canonical SDIF when the projection contract is respected.
Quick start
pip install -e '.[dev]'
sdif parse examples/plan.sdif
sdif canon examples/plan.sdif
sdif canon examples/plan.sdif --schema examples/schema.sdif
sdif hash examples/plan.sdif
sdif validate examples/plan.sdif --schema examples/schema.sdif
sdif tokens examples/plan.sdif
sdif to-json examples/plan.sdif
sdif from-json document.json
sdif ai examples/plan.sdif --alias kind=k --alias status=st
sdif tokens reports byte size, tokenizer identity and token count. It uses tiktoken/cl100k_base when available and falls back to a deterministic 4-bytes-per-token estimate.
Format at a glance
JSON repeats field names across every record:
[
{ "id": "R1", "status": "done", "owner": "build", "evidence": "reports/build.md" },
{ "id": "R2", "status": "open", "owner": "qa", "evidence": "reports/tests.md" },
{ "id": "R3", "status": "done", "owner": "security", "evidence": "reports/audit.md" }
]
SDIF declares the shape once and uses literal tabs between cells. Editors must preserve tabs — this is a deliberate tradeoff for compactness:
@sdif 1.0
kind Plan
id release.v1
title "Release readiness plan"
items[id,status,owner,evidence]:
R1 done build reports/build.md
R2 open qa reports/tests.md
R3 done security reports/audit.md
rel:
release.v1 validated_by R1
release.v1 blocked_by R2
release.v1 governed_by R3
Semantic relationships are first-class, not embedded strings.
Structured information closer to a document,
while still behaving like a contract.
Token efficiency
The benchmark derives every compared format from the same canonical JSON source in examples/golden/. Results below are from the most recent run across 21 documents and 3 tokenizers.
| Format | Consensus avg rank | Median ratio vs JSON Compact |
|---|---|---|
| SDIF AI | 1.10 | 56.8% |
| SDIF | 2.60 | 59.5% |
| CSV Bundle | 2.70 | 61.2% |
| YAML | 5.35 | 95.3% |
| JSON Compact | 5.65 | 100.0% |
| JSON Pretty | 7.00 | 137.3% |
| XML | 8.00 | 171.7% |
SDIF AI wins 57 of 63 tokenizer/document pairs. SDIF canonical wins 2.
The benchmark repository contains the exact corpus model, generated artifacts and methodology needed to reproduce these numbers.
These results are corpus-dependent. Not every data shape benefits equally from tabular projection. Claude and Llama tokenizers require separate opt-in before claiming results for those models.
For full methodology, corpus model and per-document breakdowns, see sdif-benchmarks.
Ecosystem
|
CORE FORMAT sdif
Specification, parser, canonicalizer and CLI. This repository. |
MEASUREMENT sdif-benchmarksReproducible benchmark datasets and reports comparing SDIF with existing formats across token efficiency, context packing, round-trip fidelity and retrieval accuracy. |
SYNTAX TOOLING tree-sitter-sdif
Tree-sitter grammar foundation for syntax highlighting and editor integrations.
Registers both |
What SDIF is not
SDIF does not try to replace JSON, YAML, CSV, Markdown, XML, Parquet or Protocol Buffers. Those formats are useful and battle-tested.
|
JSON
Universal and reliable, but noisy when repeated records dominate. |
YAML
Readable, but too permissive for deterministic workflows. |
CSV
Compact, but loses structure, relations and meaning quickly. |
Markdown
Great for humans, not enough when data must be parsed and verified. |
SDIF focuses on a narrower problem:
compact, semantic, canonicalizable structured data
that can move cleanly between humans, machines and AI systems.
Documentation
| Document | Description |
|---|---|
docs/spec.md |
Full v1.0.0 specification |
docs/canonicalization.md |
Canonicalization contract |
docs/comparison.md |
Format comparison |
docs/semantic-quality.md |
Semantic quality methodology |
docs/ai-speed-profile.md |
AI speed profile contract |
examples/ |
Annotated examples |
conformance/ |
Shared conformance fixtures |
Limitations
SDIF 1.0 is the stable core contract. Current benchmark results are promising, but should be read with these boundaries:
- results are corpus-dependent;
- not every data shape benefits equally from tabular projection;
- editors and tools must preserve literal tabs in table rows;
.sdif.aiis an agent projection surface, not the canonical signing surface;- Claude and Llama3 token counting must be enabled separately before claiming results for those tokenizers.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdif_format-1.0.0.tar.gz.
File metadata
- Download URL: sdif_format-1.0.0.tar.gz
- Upload date:
- Size: 856.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
873ac75d57722a607d93e29971cfd9ee90ee06a75cb3a69af10ff79b4b2e38f1
|
|
| MD5 |
aa6289f81ec76605396d57bc8edcc4e9
|
|
| BLAKE2b-256 |
ef35bc913edfda79127b37edcee29dc70005f1c91419d4586a4304034598901f
|
File details
Details for the file sdif_format-1.0.0-py3-none-any.whl.
File metadata
- Download URL: sdif_format-1.0.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
732b3c4123548815f4bc714cb2094d969f7d3d35a3e9944c2117e219348bd830
|
|
| MD5 |
c312881201674cc027f87205d02e9eee
|
|
| BLAKE2b-256 |
3b4726683acb67f9a05b2ac3d3c7e76a5ab5a688a33e3aaecb97d81f98791b70
|