Skip to main content

Foreign keys for your Markdown docs: cross-document referential integrity, not just schema validation.

Project description

cartulary

Foreign keys for your Markdown docs: cross-document referential integrity, not just schema validation.

cartulary validates a folder of Markdown files against a declarative YAML schema. Plenty of tools check one file's frontmatter or heading structure; cartulary's job is the part they don't: referential integrity across the whole set. Declare a primary key, point ref: fields at it, and cartulary resolves every cross-document reference, verifies the reciprocal link exists on the other side, and flags dangling, malformed, and duplicate ids — across different document types sharing one key namespace.

Think of it as JSON Schema for an interlinked corpus of Markdown — or, more plainly, foreign keys for your docs.

A cartulary is a medieval register that bound scattered charters into one cross-referenced, authoritative collection. This does the same for a folder of Markdown.


Who it's for

  • Docs-as-code / knowledge bases kept as plain Markdown in a repo, where pages reference each other (specs → ADRs, services → owners, terms → glossary) and you want CI to catch a broken or one-sided link.
  • Validating machine-generated Markdown — when an LLM fills a fixed document template, cartulary checks the output actually conforms: required fields and sections, typed tables, and that the ids it emitted resolve and reciprocate. (None of the incumbents below target this.)
  • Anyone who has outgrown frontmatter-only validation and wants the body — tables, reference lists, logs — held to a schema too.

If you only need single-file frontmatter or heading checks, the established tools below are lighter and a better fit.


What's novel — and what already exists

Being honest about the landscape, because most of it is solved:

Capability Already well-served by
Frontmatter against a schema remark-lint-frontmatter-schema, Astro + zod, JSON Schema
Required sections / heading order markdownlint (rule MD043)
Cross-file link & heading existence remark-validate-links
One file's body structure (headings/tables/lists) jackchuka/mdschema
Cross-collection references (framework-bound) Astro reference()
Reciprocal frontmatter links (one tool, one app) Obsidian Sync Semantic Links

The first four are table stakes — cartulary does them, but they're not the point. The last two are the closest prior art to cartulary's actual wedge, and both stop short:

  • Astro reference() resolves references between content collections — but it's bound to a JS framework and build step, and as of Astro 5 it no longer checks that the referenced entry exists (withastro/astro#13268), the exact guarantee cartulary is built to provide.
  • Obsidian's reciprocal-link tooling mirrors relationships — but only over frontmatter properties, inside a single vault, as an editor plugin, not a schema you can run in CI.

What I could not find packaged anywhere is cartulary's combination: a standalone, language-agnostic, declarative validator that does typed primary-key reference resolution + reciprocal (inverse) checking + duplicate-key detection across heterogeneous document types in one shared namespace, spanning both frontmatter and body constructs (tables, reference lists) — driven by a single YAML file and runnable over any folder in any CI. That's the gap this fills.


How it works

A small compiler pipeline:

markdown text
   │  marko (GFM)            lex + parse
   ▼
flat AST
   │  SectionTreeBuilder     impose the heading hierarchy marko doesn't
   ▼
section tree
   │  SchemaValidator        semantic analysis against the YAML schema
   ▼
list of (path, message, severity) findings

Cross-document checks run as a second and third pass over the whole file set: collect every primary key and reference, resolve refs against known keys, then verify reciprocity.


Install

pip install cartulary     # from PyPI

pip install -e .          # or, from a clone (for development)
# runtime deps: marko, PyYAML

Quickstart

# validate one or many files against a schema
python -m cartulary examples/library.schema.yaml tests/fixtures/*.md

# (after install) via the console script
cartulary examples/library.schema.yaml tests/fixtures/*.md

# machine-readable output for editors / CI
cartulary --json examples/library.schema.yaml docs/**/*.md

# SARIF 2.1.0 for GitHub code scanning (renders findings on the PR diff)
cartulary --sarif examples/library.schema.yaml docs/**/*.md > cartulary.sarif

Passing multiple files turns on cross-document reference checking.

from cartulary import validate_file, validate_files

errors = validate_file("schema.yaml", "doc.md")          # single file
results = validate_files("schema.yaml", ["a.md", "b.md"]) # corpus, with refs
for err in errors:
    print(err.severity, err.path, err.message)

The CLI prints a per-file report and exits non-zero if any error (as opposed to warning) is found.


A taste of the schema

# a book references its authors; each author must list the book back
schemas:
  book:
    primary_key: book_id
    title_pattern: "{title} ({year})"
    frontmatter:
      fields:
        document_type: { value: book, required: true }
        book_id:  { type: book_id_format, required: true }
        title:    { required: true }
        year:     { type: year, required: true }
    sections:
      - heading: Summary
        required: true
        content: { type: prose }
      - heading: Written By
        content:
          type: ref_list
          style: unlabeled
          ref: author_id        # foreign key → author primary key
          inverse: Books         # author's "Books" section must list this book
  author:
    frontmatter:
      fields:
        document_type: { value: author, required: true }
        author_id: { type: author_id_format, required: true, primary_key: true }
        name:      { required: true }
    sections:
      - heading: Books
        content: { type: ref_list, style: unlabeled, ref: book_id, inverse: Written By }

See examples/ for three worked schemas — a multi-type library catalogue, a single-schema note format, and a family tree (the use case this began as: every person is a file, and every parent/child/ spouse link must reciprocate) — and SCHEMA.md for the complete format reference.

The family-tree corpus is the quickest way to see the headline feature. It validates clean:

$ cartulary examples/family-tree.schema.yaml examples/family-tree/*.md
  ✓ Valid   (×7)

…but make a one-sided edit — say, remove Bilbo from his father's Children while he still lists his father under Parents — and cartulary catches the dangling half of the relationship that a frontmatter or link checker can't:

  ⚠ [section[Children]] Missing reciprocal reference: 'bilbo-baggins' lists
    'bungo-baggins' in Parents, but Children here does not reference 'bilbo-baggins'

Features

  • Frontmatter: required fields, exact-value, enums, regex/value_types, and list-valued fields (items, including any_of and object items).
  • value_types: named, reusable pattern / enum / any_of definitions, plus on-disk exists checks.
  • Title: {field} substitution with an optional ~ "circa" allowance.
  • Filenames: filename_must_match (stem equals a field) or the general filename_pattern template (e.g. "{slug}.md", "{year}-{slug}.md").
  • The schema is the contract: an invalid schema (unknown/misspelled or misplaced key, bad type reference) is a hard error — validate_files raises SchemaError and the CLI exits non-zero — so a malformed schema can't silently under-validate. x- prefixed keys are allowed for annotations.
  • Sections: required/deprecated, strict ordering, position: last, unknown-section policy, and recursive subsections.
  • Content types: prose, typed table (per-column types, nullable, min_rows), ref_list (labeled & unlabeled, with min_items/max_items cardinality), and log (regex per entry).
  • Cross-document: reference resolution (refs must resolve and point at the right document type), reciprocal (inverse) checks, duplicate-primary-key detection, and multi-schema routing by document_type over a shared key namespace.
  • Blast-radius scoping (--changed): validate the whole corpus but report only the findings a set of changed files is responsible for — ideal for gating a PR (see below).
  • Output for CI & editors: a per-file human report, --json, or --sarif (SARIF 2.1.0 — GitHub code scanning renders findings inline on the PR diff). Every finding carries a stable rule id (e.g. unresolved-reference, missing-reciprocal) so its identity survives message-wording changes.
  • Every finding is an error or warning; many rules let you pick.

Validating a change against the whole corpus

Referential integrity is a property of the entire graph, so cartulary always reads the whole corpus — there's no correct way to check one file in isolation. Two consequences trip people up:

  • Validating a single file on its own is not a cheaper subset of the work — it's wrong. Its references to other documents look dangling (the id namespace is just that one file), and one-sided links from elsewhere are invisible. Always pass the whole corpus.
  • A problem you introduce by editing file A is often reported on a different file B. Remove bilbo from his father's Children while Bilbo still lists that father under Parents, and the missing reciprocal is reported on the father, not on Bilbo — because the father is the side now missing a link.

That second point is why gating CI on "findings in the file I changed" would miss exactly the breakage that edit caused. --changed solves it by reporting every finding whose blast radius touches a changed file — including ones attributed to a counterpart — and nothing else:

# Validate the whole corpus, but only report (and fail on) findings that the
# files changed in this PR are responsible for:
cartulary schema.yaml docs/ --changed "$(git diff --name-only origin/main)"

The whole corpus is still validated; --changed only scopes the output and the exit status, so a repo with pre-existing findings elsewhere won't fail a PR that didn't touch them. Each finding's blast radius is also exposed as caused_by in --json output (and on ValidationError) for editor/CI use.


CI & pre-commit

pre-commit. cartulary ships a hook. In your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/jdhorne/cartulary
    rev: v0.2.0
    hooks:
      - id: cartulary
        args: [schema.yaml, docs/]   # your schema, then the corpus path(s)

The hook validates the whole corpus (not just the staged files — referential integrity is a whole-graph property), and runs whenever a Markdown or YAML file changes.

GitHub Action. A composite action runs cartulary and uploads findings to code scanning, so they render inline on the PR diff:

# .github/workflows/cartulary.yml
permissions:
  security-events: write        # required for the SARIF upload
jobs:
  cartulary:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: jdhorne/cartulary@v0.2.0
        with:
          schema: schema.yaml
          files: docs/
          # optional: only fail on / report findings the PR is responsible for
          changed: ${{ github.event_name == 'pull_request' && 'docs/' || '' }}

The action installs cartulary from its own checkout (no PyPI release needed), emits SARIF, uploads it, and fails the job if any error-level finding is in scope. Set upload-sarif: "false" to skip the upload (e.g. on forks).


Specification & conformance

cartulary is a specification, and this Python package is its reference implementation. The schema format is documented in SCHEMA.md, and its observable behaviour is pinned by a language-neutral conformance suite{schema, documents, expected-findings} cases that any implementation in any language can run to prove it conforms. The portable contract is the set of (document, path, severity) findings; exact wording is implementation-private. If you port cartulary to another language, make it pass conformance/.

Develop / test

pip install -e ".[dev]"
pytest

Two test layers: tests/test_conformance.py runs the portable conformance suite (the spec contract), and tests/test_validator.py covers implementation-specific internals.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cartulary-0.2.0.tar.gz (45.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cartulary-0.2.0-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file cartulary-0.2.0.tar.gz.

File metadata

  • Download URL: cartulary-0.2.0.tar.gz
  • Upload date:
  • Size: 45.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cartulary-0.2.0.tar.gz
Algorithm Hash digest
SHA256 86a1f84a40566e066e2edff8d42bb74cef9a1894767aeb88423fc23f66dfc9af
MD5 eb71cc1ef8ff9438eeb641b4c71611b0
BLAKE2b-256 98e627a556b7261c9f1d8fed2b8dd46c0be660752723ecda88d06558aae8dcce

See more details on using hashes here.

Provenance

The following attestation bundles were made for cartulary-0.2.0.tar.gz:

Publisher: release.yml on jdhorne/cartulary

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cartulary-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cartulary-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cartulary-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d99abf26d108035b754bbcc948d685047fbfc905998b0a3708ca4ea0d62fc27e
MD5 464b0877628310b41505bb6f92c9d0f2
BLAKE2b-256 4be074bba08cb3361324a05f7c19376fc319dba0cdaa2353fd3d2762f06bf0d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for cartulary-0.2.0-py3-none-any.whl:

Publisher: release.yml on jdhorne/cartulary

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page