Foreign keys for your Markdown docs: cross-document referential integrity, not just schema validation.
Project description
cartulary
Foreign keys for your Markdown docs: cross-document referential integrity, not just schema validation.
cartulary validates a folder of Markdown files against a declarative YAML
schema. Plenty of tools check one file's frontmatter or heading structure;
cartulary's job is the part they don't: referential integrity across the
whole set. Declare a primary key, point ref: fields at it, and cartulary
resolves every cross-document reference, verifies the reciprocal link exists on
the other side, and flags dangling, malformed, and duplicate ids — across
different document types sharing one key namespace.
Think of it as JSON Schema for an interlinked corpus of Markdown — or, more plainly, foreign keys for your docs.
A cartulary is a medieval register that bound scattered charters into one cross-referenced, authoritative collection. This does the same for a folder of Markdown.
Who it's for
- Docs-as-code / knowledge bases kept as plain Markdown in a repo, where pages reference each other (specs → ADRs, services → owners, terms → glossary) and you want CI to catch a broken or one-sided link.
- Validating machine-generated Markdown — when an LLM fills a fixed document template, cartulary checks the output actually conforms: required fields and sections, typed tables, and that the ids it emitted resolve and reciprocate. (None of the incumbents below target this.)
- Anyone who has outgrown frontmatter-only validation and wants the body — tables, reference lists, logs — held to a schema too.
If you only need single-file frontmatter or heading checks, the established tools below are lighter and a better fit.
What's novel — and what already exists
Being honest about the landscape, because most of it is solved:
| Capability | Already well-served by |
|---|---|
| Frontmatter against a schema | remark-lint-frontmatter-schema, Astro + zod, JSON Schema |
| Required sections / heading order | markdownlint (rule MD043) |
| Cross-file link & heading existence | remark-validate-links |
| One file's body structure (headings/tables/lists) | jackchuka/mdschema |
| Cross-collection references (framework-bound) | Astro reference() |
| Reciprocal frontmatter links (one tool, one app) | Obsidian Sync Semantic Links |
The first four are table stakes — cartulary does them, but they're not the point. The last two are the closest prior art to cartulary's actual wedge, and both stop short:
- Astro
reference()resolves references between content collections — but it's bound to a JS framework and build step, and as of Astro 5 it no longer checks that the referenced entry exists (withastro/astro#13268), the exact guarantee cartulary is built to provide. - Obsidian's reciprocal-link tooling mirrors relationships — but only over frontmatter properties, inside a single vault, as an editor plugin, not a schema you can run in CI.
What I could not find packaged anywhere is cartulary's combination: a
standalone, language-agnostic, declarative validator that does typed
primary-key reference resolution + reciprocal (inverse) checking +
duplicate-key detection across heterogeneous document types in one shared
namespace, spanning both frontmatter and body constructs (tables,
reference lists) — driven by a single YAML file and runnable over any folder in
any CI. That's the gap this fills.
How it works
A small compiler pipeline:
markdown text
│ marko (GFM) lex + parse
▼
flat AST
│ SectionTreeBuilder impose the heading hierarchy marko doesn't
▼
section tree
│ SchemaValidator semantic analysis against the YAML schema
▼
list of (path, message, severity) findings
Cross-document checks run as a second and third pass over the whole file set: collect every primary key and reference, resolve refs against known keys, then verify reciprocity.
Install
pip install cartulary # from PyPI
pip install -e . # or, from a clone (for development)
# runtime deps: marko, PyYAML
Quickstart
# validate one or many files against a schema
python -m cartulary examples/library.schema.yaml tests/fixtures/*.md
# (after install) via the console script
cartulary examples/library.schema.yaml tests/fixtures/*.md
# machine-readable output for editors / CI
cartulary --json examples/library.schema.yaml docs/**/*.md
# SARIF 2.1.0 for GitHub code scanning (renders findings on the PR diff)
cartulary --sarif examples/library.schema.yaml docs/**/*.md > cartulary.sarif
Passing multiple files turns on cross-document reference checking.
from cartulary import validate_file, validate_files
errors = validate_file("schema.yaml", "doc.md") # single file
results = validate_files("schema.yaml", ["a.md", "b.md"]) # corpus, with refs
for err in errors:
print(err.severity, err.path, err.message)
The CLI prints a per-file report and exits non-zero if any error (as opposed to warning) is found.
A taste of the schema
# a book references its authors; each author must list the book back
schemas:
book:
primary_key: book_id
title_pattern: "{title} ({year})"
frontmatter:
fields:
document_type: { value: book, required: true }
book_id: { type: book_id_format, required: true }
title: { required: true }
year: { type: year, required: true }
sections:
- heading: Summary
required: true
content: { type: prose }
- heading: Written By
content:
type: ref_list
style: unlabeled
ref: author_id # foreign key → author primary key
inverse: Books # author's "Books" section must list this book
author:
frontmatter:
fields:
document_type: { value: author, required: true }
author_id: { type: author_id_format, required: true, primary_key: true }
name: { required: true }
sections:
- heading: Books
content: { type: ref_list, style: unlabeled, ref: book_id, inverse: Written By }
See examples/ for three worked schemas — a multi-type library
catalogue, a single-schema note format, and a family tree
(the use case this began as: every person is a file, and every parent/child/
spouse link must reciprocate) — and SCHEMA.md for the complete
format reference.
The family-tree corpus is the quickest way to see the headline feature. It validates clean:
$ cartulary examples/family-tree.schema.yaml examples/family-tree/*.md
✓ Valid (×7)
…but make a one-sided edit — say, remove Bilbo from his father's Children
while he still lists his father under Parents — and cartulary catches the
dangling half of the relationship that a frontmatter or link checker can't:
⚠ [section[Children]] Missing reciprocal reference: 'bilbo-baggins' lists
'bungo-baggins' in Parents, but Children here does not reference 'bilbo-baggins'
Features
- Frontmatter: required fields, exact-value, enums, regex/
value_types, and list-valued fields (items, includingany_ofand object items). value_types: named, reusablepattern/enum/any_ofdefinitions, plus on-diskexistschecks.- Title:
{field}substitution with an optional~"circa" allowance. - Filenames:
filename_must_match(stem equals a field) or the generalfilename_patterntemplate (e.g."{slug}.md","{year}-{slug}.md"). - The schema is the contract: an invalid schema (unknown/misspelled or
misplaced key, bad type reference) is a hard error —
validate_filesraisesSchemaErrorand the CLI exits non-zero — so a malformed schema can't silently under-validate.x-prefixed keys are allowed for annotations. - Sections: required/deprecated, strict ordering,
position: last, unknown-section policy, and recursive subsections. - Content types:
prose, typedtable(per-column types,nullable,min_rows),ref_list(labeled & unlabeled, withmin_items/max_itemscardinality), andlog(regex per entry). - Cross-document: reference resolution (refs must resolve and point at
the right document type), reciprocal (
inverse) checks, duplicate-primary-key detection, and multi-schema routing bydocument_typeover a shared key namespace. - Blast-radius scoping (
--changed): validate the whole corpus but report only the findings a set of changed files is responsible for — ideal for gating a PR (see below). - Output for CI & editors: a per-file human report,
--json, or--sarif(SARIF 2.1.0 — GitHub code scanning renders findings inline on the PR diff). Every finding carries a stableruleid (e.g.unresolved-reference,missing-reciprocal) so its identity survives message-wording changes. - Every finding is an
errororwarning; many rules let you pick.
Validating a change against the whole corpus
Referential integrity is a property of the entire graph, so cartulary always reads the whole corpus — there's no correct way to check one file in isolation. Two consequences trip people up:
- Validating a single file on its own is not a cheaper subset of the work — it's wrong. Its references to other documents look dangling (the id namespace is just that one file), and one-sided links from elsewhere are invisible. Always pass the whole corpus.
- A problem you introduce by editing file A is often reported on a
different file B. Remove
bilbofrom his father'sChildrenwhile Bilbo still lists that father underParents, and the missing reciprocal is reported on the father, not on Bilbo — because the father is the side now missing a link.
That second point is why gating CI on "findings in the file I changed" would
miss exactly the breakage that edit caused. --changed solves it by reporting
every finding whose blast radius touches a changed file — including ones
attributed to a counterpart — and nothing else:
# Validate the whole corpus, but only report (and fail on) findings that the
# files changed in this PR are responsible for:
cartulary schema.yaml docs/ --changed "$(git diff --name-only origin/main)"
The whole corpus is still validated; --changed only scopes the output and
the exit status, so a repo with pre-existing findings elsewhere won't fail a
PR that didn't touch them. Each finding's blast radius is also exposed as
caused_by in --json output (and on ValidationError) for editor/CI use.
CI & pre-commit
pre-commit. cartulary ships a hook. In your .pre-commit-config.yaml:
repos:
- repo: https://github.com/jdhorne/cartulary
rev: v0.2.0
hooks:
- id: cartulary
args: [schema.yaml, docs/] # your schema, then the corpus path(s)
The hook validates the whole corpus (not just the staged files — referential integrity is a whole-graph property), and runs whenever a Markdown or YAML file changes.
GitHub Action. A composite action runs cartulary and uploads findings to code scanning, so they render inline on the PR diff:
# .github/workflows/cartulary.yml
permissions:
security-events: write # required for the SARIF upload
jobs:
cartulary:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: jdhorne/cartulary@v0.2.0
with:
schema: schema.yaml
files: docs/
# optional: only fail on / report findings the PR is responsible for
changed: ${{ github.event_name == 'pull_request' && 'docs/' || '' }}
The action installs cartulary from its own checkout (no PyPI release needed),
emits SARIF, uploads it, and fails the job if any error-level finding is in
scope. Set upload-sarif: "false" to skip the upload (e.g. on forks).
Specification & conformance
cartulary is a specification, and this Python package is its reference
implementation. The schema format is documented in SCHEMA.md,
and its observable behaviour is pinned by a language-neutral conformance
suite — {schema, documents, expected-findings} cases that any
implementation in any language can run to prove it conforms. The portable
contract is the set of (document, path, severity) findings; exact wording is
implementation-private. If you port cartulary to another language, make it pass
conformance/.
Develop / test
pip install -e ".[dev]"
pytest
Two test layers: tests/test_conformance.py runs the portable conformance suite (the spec contract), and tests/test_validator.py covers implementation-specific internals.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cartulary-0.2.0.tar.gz.
File metadata
- Download URL: cartulary-0.2.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86a1f84a40566e066e2edff8d42bb74cef9a1894767aeb88423fc23f66dfc9af
|
|
| MD5 |
eb71cc1ef8ff9438eeb641b4c71611b0
|
|
| BLAKE2b-256 |
98e627a556b7261c9f1d8fed2b8dd46c0be660752723ecda88d06558aae8dcce
|
Provenance
The following attestation bundles were made for cartulary-0.2.0.tar.gz:
Publisher:
release.yml on jdhorne/cartulary
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cartulary-0.2.0.tar.gz -
Subject digest:
86a1f84a40566e066e2edff8d42bb74cef9a1894767aeb88423fc23f66dfc9af - Sigstore transparency entry: 2043595927
- Sigstore integration time:
-
Permalink:
jdhorne/cartulary@7f8521944dad519df7cf9cddafa7586ba1b6826b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/jdhorne
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7f8521944dad519df7cf9cddafa7586ba1b6826b -
Trigger Event:
release
-
Statement type:
File details
Details for the file cartulary-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cartulary-0.2.0-py3-none-any.whl
- Upload date:
- Size: 29.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d99abf26d108035b754bbcc948d685047fbfc905998b0a3708ca4ea0d62fc27e
|
|
| MD5 |
464b0877628310b41505bb6f92c9d0f2
|
|
| BLAKE2b-256 |
4be074bba08cb3361324a05f7c19376fc319dba0cdaa2353fd3d2762f06bf0d6
|
Provenance
The following attestation bundles were made for cartulary-0.2.0-py3-none-any.whl:
Publisher:
release.yml on jdhorne/cartulary
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cartulary-0.2.0-py3-none-any.whl -
Subject digest:
d99abf26d108035b754bbcc948d685047fbfc905998b0a3708ca4ea0d62fc27e - Sigstore transparency entry: 2043595954
- Sigstore integration time:
-
Permalink:
jdhorne/cartulary@7f8521944dad519df7cf9cddafa7586ba1b6826b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/jdhorne
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7f8521944dad519df7cf9cddafa7586ba1b6826b -
Trigger Event:
release
-
Statement type: