Skip to main content

Semantic Data Interchange Format. Compact, canonicalizable and token-efficient structured data

Project description

SDIF Format

Semantic Data Interchange Format

Compact, semantic and canonicalizable structured data
for AI agents, deterministic workflows and human-auditable records.

What is SDIF? · Quick start · Format at a glance · Token efficiency · Ecosystem · Documentation

Spec v1.0.0 stable Python 3.10+ MIT AI native data


Compact

Shape declared once.
No repeated field names.
Semantic

Tables, relations,
metadata and intent.
Canonical

Stable output for hashing,
signing and comparison.
Auditable

Designed to be read,
reviewed and trusted.


What is SDIF?

SDIF — Semantic Data Interchange Format is a compact, canonicalizable and AI-friendly data format for structured information that needs to move cleanly between humans, tools, agents and deterministic workflows.

It is designed for cases where data should be:

  • small enough to be efficient in AI context windows;
  • structured enough for machines to parse and validate;
  • readable enough for humans to review;
  • deterministic enough for hashing, signing and reproducible workflows;
  • semantic enough to express tables, relations, metadata and intent.

SDIF also includes an AI projection surface, .sdif.ai, designed for token-dense agent exchange while remaining reversible back into canonical SDIF when the projection contract is respected.



Quick start

pip install -e '.[dev]'
sdif parse    examples/plan.sdif
sdif canon    examples/plan.sdif
sdif canon    examples/plan.sdif --schema examples/schema.sdif
sdif hash     examples/plan.sdif
sdif validate examples/plan.sdif --schema examples/schema.sdif
sdif tokens   examples/plan.sdif
sdif to-json  examples/plan.sdif
sdif from-json document.json
sdif ai       examples/plan.sdif --alias kind=k --alias status=st

sdif tokens reports byte size, tokenizer identity and token count. It uses tiktoken/cl100k_base when available and falls back to a deterministic 4-bytes-per-token estimate.



Format at a glance

JSON repeats field names across every record:

[
  { "id": "R1", "status": "done",    "owner": "build",    "evidence": "reports/build.md"  },
  { "id": "R2", "status": "open",    "owner": "qa",       "evidence": "reports/tests.md"  },
  { "id": "R3", "status": "done",    "owner": "security", "evidence": "reports/audit.md"  }
]

SDIF declares the shape once and uses literal tabs between cells. Editors must preserve tabs — this is a deliberate tradeoff for compactness:

@sdif 1.0

kind Plan
id   release.v1
title "Release readiness plan"

items[id,status,owner,evidence]:
  R1	done	build	reports/build.md
  R2	open	qa	reports/tests.md
  R3	done	security	reports/audit.md

rel:
  release.v1  validated_by  R1
  release.v1  blocked_by    R2
  release.v1  governed_by   R3

Semantic relationships are first-class, not embedded strings.


Structured information closer to a document,
while still behaving like a contract.



Token efficiency

The benchmark derives every compared format from the same canonical JSON source in examples/golden/. Results below are from the most recent run across 21 documents and 3 tokenizers.

Format Consensus avg rank Median ratio vs JSON Compact
SDIF AI 1.10 56.8%
SDIF 2.60 59.5%
CSV Bundle 2.70 61.2%
YAML 5.35 95.3%
JSON Compact 5.65 100.0%
JSON Pretty 7.00 137.3%
XML 8.00 171.7%

SDIF AI wins 57 of 63 tokenizer/document pairs. SDIF canonical wins 2.

The benchmark repository contains the exact corpus model, generated artifacts and methodology needed to reproduce these numbers.

These results are corpus-dependent. Not every data shape benefits equally from tabular projection. Claude and Llama tokenizers require separate opt-in before claiming results for those models.

For full methodology, corpus model and per-document breakdowns, see sdif-benchmarks.



Ecosystem

CORE FORMAT

sdif

Specification, parser, canonicalizer and CLI.
The normative reference for the format.

This repository.

MEASUREMENT

sdif-benchmarks

Reproducible benchmark datasets and reports comparing SDIF with existing formats across token efficiency, context packing, round-trip fidelity and retrieval accuracy.

View benchmarks →

SYNTAX TOOLING

tree-sitter-sdif

Tree-sitter grammar foundation for syntax highlighting and editor integrations. Registers both .sdif and .sdif.ai file types.

Open grammar →



What SDIF is not

SDIF does not try to replace JSON, YAML, CSV, Markdown, XML, Parquet or Protocol Buffers. Those formats are useful and battle-tested.

JSON

Universal and reliable, but noisy when repeated records dominate.
YAML

Readable, but too permissive for deterministic workflows.
CSV

Compact, but loses structure, relations and meaning quickly.
Markdown

Great for humans, not enough when data must be parsed and verified.

SDIF focuses on a narrower problem:

compact, semantic, canonicalizable structured data
that can move cleanly between humans, machines and AI systems.



Documentation

Document Description
docs/spec.md Full v1.0.0 specification
docs/canonicalization.md Canonicalization contract
docs/comparison.md Format comparison
docs/semantic-quality.md Semantic quality methodology
docs/ai-speed-profile.md AI speed profile contract
examples/ Annotated examples
conformance/ Shared conformance fixtures


Limitations

SDIF 1.0 is the stable core contract. Current benchmark results are promising, but should be read with these boundaries:

  • results are corpus-dependent;
  • not every data shape benefits equally from tabular projection;
  • editors and tools must preserve literal tabs in table rows;
  • .sdif.ai is an agent projection surface, not the canonical signing surface;
  • Claude and Llama3 token counting must be enabled separately before claiming results for those tokenizers.


License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdif_format-1.0.0.tar.gz (856.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdif_format-1.0.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file sdif_format-1.0.0.tar.gz.

File metadata

  • Download URL: sdif_format-1.0.0.tar.gz
  • Upload date:
  • Size: 856.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sdif_format-1.0.0.tar.gz
Algorithm Hash digest
SHA256 873ac75d57722a607d93e29971cfd9ee90ee06a75cb3a69af10ff79b4b2e38f1
MD5 aa6289f81ec76605396d57bc8edcc4e9
BLAKE2b-256 ef35bc913edfda79127b37edcee29dc70005f1c91419d4586a4304034598901f

See more details on using hashes here.

File details

Details for the file sdif_format-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: sdif_format-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sdif_format-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 732b3c4123548815f4bc714cb2094d969f7d3d35a3e9944c2117e219348bd830
MD5 c312881201674cc027f87205d02e9eee
BLAKE2b-256 3b4726683acb67f9a05b2ac3d3c7e76a5ab5a688a33e3aaecb97d81f98791b70

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page