Abstract document AST for KAOS — Block/Inline content model, provenance, annotations, serialization
Project description
kaos-content
Part of Kelvin Agentic OS (KAOS) — open agentic infrastructure for legal work, built by 273 Ventures. See the full KAOS package map for the rest of the stack.
kaos-content is the canonical document model for KAOS — a typed
Block/Inline AST with provenance, annotations, views, and round-trip
serializers.
Every KAOS document processor (kaos-pdf, kaos-office, kaos-web)
produces a ContentDocument; every downstream consumer (search,
chunking, LLM programs, MCP resources) reads one. The shape is inspired
by Pandoc's Block/Inline discipline and Docling's provenance model.
kaos-content does not parse PDFs, fetch URLs, or call LLMs —
companion packages do. This package is what makes them interoperable.
To expose kaos-content parsers and serializers over MCP, add the
companion package kaos-mcp.
Install
uv add kaos-content
# or
pip install kaos-content
kaos-content requires Python 3.13 or newer. Optional extras unlock
specific capabilities:
| Extra | Pulls in | Unlocks |
|---|---|---|
[markdown] |
markdown-it-py[plugins] |
parse_markdown round-trip |
[html] |
lxml>=6.1 |
HTML parser to AST |
[images] |
Pillow>=12.2, numpy |
KaosImage wrapper (PIL + DPI + provenance) |
[layout] |
numpy only |
Layout primitives (X-Y cut, projection profiles, clustering, valley detection). Operates on numeric coordinate arrays — no PIL, no raster IO. Install [images] separately if you also need to load source images. |
[polars] |
polars>=1.0 |
TabularDocument ↔ Polars DataFrame |
[duckdb] |
duckdb>=1.0 |
DuckDB SQL bridge for tabular |
[nlp] |
kaos-nlp-core |
BM25 search + sentence-level units + fuzzy_binary / minhash dedup levels |
[mcp] |
kaos-mcp |
MCP tool registration |
[dedup-perceptual] |
imagehash |
Perceptual page-image dedup (PerceptualHashLevel). For semantic embedding clustering, install kaos-nlp-transformers — it registers SemanticDedupLevel against this package's DedupLevel protocol. |
Quick start
Build a small document, serialize it, search it, and walk its sections:
from kaos_content.model.document import ContentDocument
from kaos_content.model.metadata import DocumentMetadata
from kaos_content.shortcuts import bold, heading, link, paragraph, text
from kaos_content.search import search_document
from kaos_content.serializers.markdown import serialize_markdown
from kaos_content.serializers.html import serialize_html
from kaos_content.views.document_view import DocumentView
doc = ContentDocument(
metadata=DocumentMetadata(title="Hello"),
body=(
heading(1, "Hello, KAOS"),
paragraph(
text("Built on a "),
bold("typed AST"),
text(" with "),
link("https://kelvin.legal", "provenance"),
text("."),
),
),
)
print(serialize_markdown(doc))
# # Hello, KAOS
#
# Built on a **typed AST** with [provenance](https://kelvin.legal).
print(serialize_html(doc, allow_raw_html=False))
# <h1>Hello, KAOS</h1><p>Built on a <strong>typed AST</strong> with
# <a href="https://kelvin.legal">provenance</a>.</p>
results = search_document(doc, "typed AST", top_k=5)
for r in results.results:
print(r.score, r.block_ref, r.text[:80])
for section in DocumentView(doc).flat_sections:
print(section.heading_text, section.heading_ref)
The AST is constructed entirely from frozen Pydantic models — every
field is type-validated at construction time, content-model
constraints (blocks contain blocks or inlines, never mixed) are
enforced before a node ever reaches a serializer, and the same
ContentDocument round-trips losslessly through JSON via
model_dump_json() / model_validate_json().
Concepts
The package is built around nine composable primitives.
| Concept | What it is |
|---|---|
ContentDocument |
Frozen Pydantic AST: metadata + body of Block nodes |
Block / Inline |
Strictly separated — blocks contain blocks or inlines, never mixed; enforced at construction |
Provenance |
Source file, page, bounding box, char span, confidence — on any node |
Attr |
Pandoc-style (id, classes, key-value) triple — universal extension mechanism |
Annotation |
Standoff layer for overlapping marks (redactions, defined terms, citations, NLP entities) |
node_ref |
JSON pointer addressing (e.g. #/body/5) — stable target for MCP resources |
DocumentView |
Dynamic hierarchical views (pages, sections, paragraphs, sentences) computed from the flat AST |
TabularDocument |
Universal tabular AST — peer to ContentDocument, 17-type column system, Polars/DuckDB bridges |
KaosImage |
PIL wrapper carrying DPI + provenance, with bomb-resistant load (100 MP cap) |
Compatibility & status
| Aspect | |
|---|---|
| Python | 3.13, 3.14 (CI runs both, including the 3.14t free-threaded build) |
| OS | Linux, macOS, Windows (pure-Python wheel; no native code) |
| Maturity | Alpha (0.1.0a1). The Block/Inline grammar is near-stable; serializer flags, traversal helpers, and the [dedup-perceptual] / [layout] APIs are subject to refinement during the alpha cycle. |
| Stability policy | Pre-1.0: minor bumps (0.x → 0.(x+1)) may break behaviour; patch bumps are additive only. Every change is documented in CHANGELOG.md. The MCP tool surface and the safe-by-default serializer/parser/SQL contracts are public API. |
| Test coverage | 2,068 unit tests + 42 Hypothesis property tests pass on Python 3.13. |
| Type checker | Validated with ty, Astral's Python type checker. |
Security model
kaos-content ships safe-by-default serializers and bridges. The
contract:
| Surface | Default | Override |
|---|---|---|
serialize_html(allow_raw_html=False) |
strips raw HTML blocks; neuters javascript: / data: / vbscript: / file: URLs to # with a data-unsafe-url forensic attribute |
allow_raw_html=True for trusted content |
serialize_markdown(allow_raw_html=False) |
same URL-neutering; <!-- raw {format} stripped --> placeholder for raw blocks |
allow_raw_html=True |
parsers.html |
URLs canonicalised through HTML-entity decode → percent-decode → whitespace removal before scheme checks; defeats jav	ascript: and javascript%3A style bypasses |
— |
bridges.duckdb.execute_query(untrusted_sql=True) |
application-level deny-list rejects read_csv, read_parquet, attach, copy, install, load, pragma (and SQL-comment evasions); strips line/block comments before matching |
untrusted_sql=False for fully trusted SQL |
bridges.duckdb.create_safe_connection() |
engine-level sandbox: enable_external_access=false, unsigned-extension loads disabled |
use a raw duckdb.connect() if you need filesystem access |
KaosImage.from_bytes / from_path |
rejects images > 100 MP via ImageDecompressionBombError (PIL's warning promoted to a hard error) |
kaos_content.images.model.MAX_IMAGE_PIXELS = N |
images.artifacts.load_image(max_bytes=50_000_000) |
rejects artifact bodies > 50 MB before decoding | max_bytes=None for trusted artifacts |
BoundingBox / Provenance / Cell / Image |
Pydantic Field constraints reject inverted boxes, page=0, confidence>1, zero/negative spans, zero/negative dimensions at construction time |
— |
See SECURITY.md for the disclosure policy and threat model.
MCP tools
kaos-content registers seven MCP tools through
kaos_content.tools.register_content_tools(runtime):
| Tool | What it does |
|---|---|
kaos-content-parse-markdown |
Parse markdown text into a ContentDocument artifact |
kaos-content-serialize |
Load an artifact and serialize to markdown / HTML / text |
kaos-content-chunk-document |
Split a document at heading boundaries, store chunks as artifacts |
kaos-content-search-document |
BM25 / term-frequency search with AST block_ref results |
kaos-content-search-table |
Case-insensitive substring search inside a TabularDocument |
kaos-content-extract-section |
Pull a section by heading ref into a standalone document |
kaos-content-extract-page |
Pull a single page (requires page provenance) |
Configuration
kaos-content has no module-level environment variables — its public
APIs are all in-process. Settings that affect behaviour are documented
inline at the call site (MAX_IMAGE_PIXELS, DEFAULT_LOAD_IMAGE_MAX_BYTES,
allow_raw_html, untrusted_sql).
Companion packages
kaos-content is one of the packages in the
Kelvin Agentic OS. The broader stack:
| Package | Layer | What it does |
|---|---|---|
kaos-core |
Core | Foundational runtime, MCP-native types, registries, execution engine, VFS |
kaos-content |
Core | Typed document AST: Block/Inline, provenance, views |
kaos-mcp |
Bridge | FastMCP server, kaos management CLI, MCP resource templates |
kaos-pdf |
Extraction | PDF → AST with provenance |
kaos-web |
Extraction | Web extraction, browser automation, search, domain intelligence |
kaos-office |
Extraction | DOCX / PPTX / XLSX readers + writers to AST |
kaos-tabular |
Extraction | DuckDB-powered SQL analytics |
kaos-source |
Data | Government + financial data connectors (Federal Register, eCFR, EDGAR, GovInfo, PACER, GLEIF) |
kaos-llm-client |
LLM | Multi-provider LLM transport |
kaos-llm-core |
LLM | Typed LLM programming (Signatures, Programs, Optimizers) |
kaos-nlp-core |
Primitives (Rust) | High-performance NLP primitives |
kaos-nlp-transformers |
ML | Dense embeddings + retrieval |
kaos-graph |
Primitives (Rust) | Graph algorithms + RDF/SPARQL |
kaos-ml-core |
Primitives (Rust) | Classical ML on the document AST |
kaos-citations |
Legal | Legal citation extraction, resolution, verification |
kaos-agents |
Agentic | Agent runtime, memory, recipes |
kaos-reference |
Sample | Reference module for module authors |
Packages depend on kaos-core; everything else is opt-in. Mix and match
the ones you need.
Development
git clone https://github.com/273v/kaos-content
cd kaos-content
uv sync --group dev
Install pre-commit hooks (recommended — they run the same checks as CI on every commit, scoped to staged files):
uvx pre-commit install
uvx pre-commit run --all-files # one-time full sweep
Manual QA commands (the same set CI runs):
uv run ruff format --check kaos_content tests
uv run ruff check kaos_content tests
uv run ty check kaos_content tests
uv run pytest -m "not live and not network and not slow"
Build from source
uv build
uv pip install dist/*.whl
Contributing
Issues and pull requests are welcome. By contributing you certify the
Developer Certificate of Origin v1.1 —
sign every commit with git commit -s. Please open an issue before starting
on a non-trivial change so we can align on scope.
Security
For security issues, please do not file a public issue. Report privately via GitHub Private Vulnerability Reporting or email security@273ventures.com. See SECURITY.md for the full disclosure policy.
License
Apache License 2.0 — see LICENSE and NOTICE.
Copyright 2026 273 Ventures LLC. Built for kelvin.legal.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kaos_content-0.1.0a2.tar.gz.
File metadata
- Download URL: kaos_content-0.1.0a2.tar.gz
- Upload date:
- Size: 996.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b252a4cb6448f92f8e09b2878efea4c179ab24d6d6c959b6d8649280fd1a811
|
|
| MD5 |
29c7c3769f347879b9f2727921c246c9
|
|
| BLAKE2b-256 |
d54e70e8390e197e331109d321e51756310955276f62c077bf08f93447e2945f
|
Provenance
The following attestation bundles were made for kaos_content-0.1.0a2.tar.gz:
Publisher:
release.yml on 273v/kaos-content
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kaos_content-0.1.0a2.tar.gz -
Subject digest:
8b252a4cb6448f92f8e09b2878efea4c179ab24d6d6c959b6d8649280fd1a811 - Sigstore transparency entry: 1463056444
- Sigstore integration time:
-
Permalink:
273v/kaos-content@158d167c8b7049d046ef8d67faa06970151df211 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/273v
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@158d167c8b7049d046ef8d67faa06970151df211 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kaos_content-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: kaos_content-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 222.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec416b228752eaef3e4470e91bde1bf22bf5ba65b0d36d172a3667c61c48de12
|
|
| MD5 |
8cd29ad5e01bb2d4c299dd91facb9a05
|
|
| BLAKE2b-256 |
a432eabd24f628ee6aa522350304c3442445c3aa40e653ea45d91da3cbedf35c
|
Provenance
The following attestation bundles were made for kaos_content-0.1.0a2-py3-none-any.whl:
Publisher:
release.yml on 273v/kaos-content
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kaos_content-0.1.0a2-py3-none-any.whl -
Subject digest:
ec416b228752eaef3e4470e91bde1bf22bf5ba65b0d36d172a3667c61c48de12 - Sigstore transparency entry: 1463056459
- Sigstore integration time:
-
Permalink:
273v/kaos-content@158d167c8b7049d046ef8d67faa06970151df211 -
Branch / Tag:
refs/tags/v0.1.0a2 - Owner: https://github.com/273v
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@158d167c8b7049d046ef8d67faa06970151df211 -
Trigger Event:
push
-
Statement type: