PDF extraction and document processing for KAOS — structured AST output with provenance
Project description
kaos-pdf
Part of Kelvin Agentic OS (KAOS) — open agentic infrastructure for legal work, built by 273 Ventures. See the full KAOS package map for the rest of the stack.
kaos-pdf is the PDF-extraction layer of KAOS — it turns a PDF byte stream
into a typed kaos-content
ContentDocument AST with provenance (page numbers, bounding boxes,
extraction confidence) on every node, plus a small set of read-only MCP
tools for agentic workflows. The engine is
pypdfium2 (Apache-2.0)
and all PDFium calls are serialised through a global lock so the library
is safe to call from threaded executors. No raw text strings escape — every
result is an AST node, a typed dataclass, or a KaosImage.
The base install is intentionally small: three runtime dependencies
(kaos-content[images,layout,markdown], kaos-core, pypdfium2) and no
compiled native code beyond the PDFium wheel. Heavier capabilities are
opt-in extras: [ocr] adds pytesseract for scanned pages (and requires
a system tesseract binary), [tables] adds pdfplumber (MIT, pure
Python — no Java, no GPU) for borderless and multi-line tables, and
[nlp] adds kaos-nlp-core for BM25 sentence-level search. VLM page
programs (describe / classify / OCR-via-VLM) live in
kaos-llm-core[vision] ≥ 0.1.0a3 — they were moved out of kaos-pdf to
keep the extraction → LLM dependency direction one-directional. We do not
and will not depend on AGPL or GPL libraries (this rules out Surya for
OCR and camelot-lattice / Tabula for tables).
Install
uv add kaos-pdf
# or
pip install kaos-pdf
# OCR for scanned PDFs (requires system tesseract binary)
uv add 'kaos-pdf[ocr]'
# Structured table extraction via pdfplumber
uv add 'kaos-pdf[tables]'
# BM25 sentence-level search via kaos-nlp-core
uv add 'kaos-pdf[nlp]'
kaos-pdf requires Python 3.13 or newer (3.14 is supported). The
package is pure Python — the only native code is the PDFium wheel shipped
by pypdfium2, which has prebuilt wheels for Linux, macOS, and Windows
on x86_64 and arm64.
Quick start
Extract a PDF into the document AST, render a page, and search for a term:
from kaos_pdf import (
extract_pdf,
get_pdf_metadata,
get_pdf_outline,
render_page,
search_document,
)
# Parse the whole document into a kaos-content ContentDocument
doc = extract_pdf("contract.pdf")
print(len(doc.body), "top-level blocks")
# Typed metadata (PdfMetadata dataclass; sparse to_dict() for JSON)
meta = get_pdf_metadata("contract.pdf")
print(meta.page_count, meta.title, meta.author)
# Outline / bookmarks (list[PdfOutlineEntry], also typed)
for entry in get_pdf_outline("contract.pdf"):
print(" " * entry.level, entry.title, "p", entry.page)
# Render the first page as a 300-DPI PIL image (returned as KaosImage)
image = render_page("contract.pdf", page_number=0, dpi=300)
image.pil.save("page-1.png")
# AST-grounded search — paragraph-level by default
hits = search_document(doc, "indemnification", top_k=5)
for hit in hits.results:
print(f"score={hit.score:.2f} :: {hit.text[:80]}")
Every node in the returned ContentDocument carries a Provenance
(source path, 1-based page, bounding box, extractor name, confidence)
so downstream consumers — citation verifiers, redaction tooling,
labelers — can ground answers back to the original PDF.
Concepts
The package is a thin, typed surface over pypdfium2. The most important
entries:
| Concept | What it is |
|---|---|
extract_pdf(path, *, pages=None, ocr="never", tables="geometric", extract_images=False, image_src_builder=...) |
Primary entry point. Returns a ContentDocument. pages selects 0-based indices; ocr is "never" / "auto" / "always"; tables is "geometric" / "engine" / "disabled"; image_src_builder lets callers control the image URI policy (default inlines as data: URLs). |
extract_pdf_bytes(data, ...) / extract_pdf_with_tables(path, ...) |
Bytes-input variant and the sidecar form that returns (ContentDocument, TabularDocument) when you want tables out of the body. |
render_page(path, page_number, *, dpi=300, grayscale=False) |
Renders a single page (0-based) to a KaosImage (PIL + DPI + provenance). |
extract_page_text(path, page_number) / get_page_count(path) |
Lightweight per-page text + page-count helpers that skip full AST construction. |
PdfMetadata / PdfOutlineEntry |
@dataclass(frozen=True, slots=True) result types returned by get_pdf_metadata() and get_pdf_outline(). Sparse to_dict() (None fields omitted) preserves the historical wire format. page_count lives on PdfMetadata directly — no extra get_page_count() call needed. |
classify_document(path) / classify_page(path, page_number) |
Lightweight document/page-type heuristics (e.g. text, scanned, mixed). |
search_document(doc, query, *, top_k=10, level="paragraph") |
Re-exported from kaos-content. AST-grounded ranked search returning SearchResults with total_matches / has_more for pagination. level="sentence" requires the [nlp] extra. |
OCRMode / OCREngine / TesseractEngine |
OCR pluggability. OCRMode is the extract_pdf(ocr=...) setting; OCREngine is the engine ABC; TesseractEngine is the Apache-2.0 default (install with [ocr] + system tesseract). OCR paragraphs carry Provenance.confidence so verifiers can weight them. |
TableMode / TableEngine / ExtractedTable / TableResult |
Table pluggability. pdfplumber is the MIT default behind [tables]. Extracted tables become TabularDocument with typed columns and live in the body with Provenance.extractor = "kaos-pdf/tables/{engine}". |
ParsePDFTool, GetPageTextTool, RenderPageTool, PDFMetadataTool, SearchDocumentTool, GetOutlineTool, ClassifyPageTool |
The seven KaosTool subclasses exposed over MCP as kaos-pdf-extract-parse, -extract-page-text, -render-page, -metadata, -search-document, -get-outline, -classify-page. All seven are readOnly, idempotent, non-destructive, non-open-world. Register with register_pdf_tools(runtime). |
Errors (KaosPdfError, PdfNotFoundError, PdfExtractionError, PdfRenderError) |
Dedicated exception hierarchy. MCP tools translate these into ToolResult.create_error() with the documented three-part recovery hint (what / how to fix / alternative tool). |
CLI
kaos-pdf ships two entry-point scripts. Every structured command on
the admin CLI supports --json for machine-readable output piped to
other agents:
kaos-pdf --help # admin CLI
kaos-pdf-serve --help # MCP server
kaos-pdf info contract.pdf --json # metadata + page count + classification
kaos-pdf outline contract.pdf --json # PDF bookmarks (falls back to detected headings)
kaos-pdf page contract.pdf 3 --json # plain text from a single page (1-based)
kaos-pdf extract contract.pdf -f markdown -p 1-5 # full AST → markdown / text / json / html
kaos-pdf render contract.pdf 1 --dpi 300 -o p1.png # render a page as PNG
kaos-pdf classify contract.pdf --page 1 --json # document- or page-level type
kaos-pdf search contract.pdf "indemnification" -k 5 # AST-grounded ranked search
kaos-pdf-serve # stdio (Claude Code / Desktop)
kaos-pdf-serve --http --port 8000 # streamable HTTP
The admin CLI uses 1-based page numbers (consistent with how the file
opens in any PDF viewer) and translates internally to the 0-based indices
the Python API uses. kaos-pdf-serve exposes the seven MCP tools listed
in Concepts above.
Compatibility & status
| Aspect | |
|---|---|
| Python | 3.13, 3.14 |
| OS | Linux, macOS, Windows (pure-Python wheel; the only native code is the PDFium wheel shipped by pypdfium2) |
| Maturity | Alpha (Development Status :: 3 - Alpha). The public API is documented in kaos_pdf.__all__. |
| Stability policy | Pre-1.0: minor bumps may change behaviour. Every change is documented in CHANGELOG.md. The MCP tool surface (kaos-pdf-* names) and the KAOS_PDF_* environment-variable namespace are public API and follow the same policy. |
| Test coverage | 340 unit tests plus a small integration tier hitting the MCP wire end-to-end. Bounded unit gate (pytest tests/unit -q --no-cov) finishes in ~35s. |
| Type checker | Validated with ty, Astral's Python type checker. |
Companion packages
kaos-pdf is one of the packages in the
Kelvin Agentic OS. The broader stack:
| Package | Layer | What it does |
|---|---|---|
kaos-core |
Core | Foundational runtime, MCP-native types, registries, execution engine, VFS |
kaos-content |
Core | Typed document AST: Block/Inline, provenance, views |
kaos-mcp |
Bridge | FastMCP server, kaos management CLI, MCP resource templates |
kaos-pdf |
Extraction | PDF → AST with provenance |
kaos-web |
Extraction | Web extraction, browser automation, search, domain intelligence |
kaos-office |
Extraction | DOCX / PPTX / XLSX readers + writers to AST |
kaos-tabular |
Extraction | DuckDB-powered SQL analytics |
kaos-source |
Data | Government + financial data connectors (Federal Register, eCFR, EDGAR, GovInfo, PACER, GLEIF) |
kaos-llm-client |
LLM | Multi-provider LLM transport |
kaos-llm-core |
LLM | Typed LLM programming (Signatures, Programs, Optimizers) |
kaos-nlp-core |
Primitives (Rust) | High-performance NLP primitives |
kaos-nlp-transformers |
ML | Dense embeddings + retrieval |
kaos-graph |
Primitives (Rust) | Graph algorithms + RDF/SPARQL |
kaos-ml-core |
Primitives (Rust) | Classical ML on the document AST |
kaos-citations |
Legal | Legal citation extraction, resolution, verification |
kaos-agents |
Agentic | Agent runtime, memory, recipes |
kaos-reference |
Sample | Reference module for module authors |
Packages depend on kaos-core; everything else is opt-in. Mix and match the
ones you need.
Development
git clone https://github.com/273v/kaos-pdf
cd kaos-pdf
uv sync --group dev
Install pre-commit hooks (recommended — they run the same checks as CI on every commit, scoped to staged files):
uvx pre-commit install
uvx pre-commit run --all-files # one-time full sweep
Manual QA commands (the same set CI runs):
uv run ruff format --check kaos_pdf tests
uv run ruff check kaos_pdf tests
uv run ty check kaos_pdf tests
uv run pytest tests/unit -q --no-cov
Build from source
uv build
uv pip install dist/*.whl
python -c "import kaos_pdf; print(kaos_pdf.__version__)" # smoke import
Contributing
Issues and pull requests are welcome. By contributing you certify the
Developer Certificate of Origin v1.1 —
sign every commit with git commit -s. Please open an issue before starting
on a non-trivial change so we can align on scope.
Security
For security issues, please do not file a public issue. Report privately via GitHub Private Vulnerability Reporting or email security@273ventures.com. See SECURITY.md for the full disclosure policy.
License
Apache License 2.0 — see LICENSE and NOTICE.
Copyright 2026 273 Ventures LLC. Built for kelvin.legal.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kaos_pdf-0.1.0a1.tar.gz.
File metadata
- Download URL: kaos_pdf-0.1.0a1.tar.gz
- Upload date:
- Size: 60.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4cd6f674a8c283971893274b0499c3e591cf65efd5c0c74a7b0c58bfa900ed0
|
|
| MD5 |
c2ffb5ef7cb482df4b1dce825784d41d
|
|
| BLAKE2b-256 |
33de29f44ad2f1c58e90614908548e38333d1f28e68634f331445ff564d7ece4
|
Provenance
The following attestation bundles were made for kaos_pdf-0.1.0a1.tar.gz:
Publisher:
release.yml on 273v/kaos-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kaos_pdf-0.1.0a1.tar.gz -
Subject digest:
b4cd6f674a8c283971893274b0499c3e591cf65efd5c0c74a7b0c58bfa900ed0 - Sigstore transparency entry: 1465704149
- Sigstore integration time:
-
Permalink:
273v/kaos-pdf@2002b215360ed21ce75c9423740fc0277e7ae6a7 -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/273v
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2002b215360ed21ce75c9423740fc0277e7ae6a7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kaos_pdf-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: kaos_pdf-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 65.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10ea08722040d90f6fbcae01ce77f731088ea4830b56d70d80d26b2b12468daf
|
|
| MD5 |
dcce89efba9991c716cd35fa3bcbf00b
|
|
| BLAKE2b-256 |
c5901b1e8f4bfe71b33527b1d9e4dcbf4b88c208a44d699db405005e5b07c88b
|
Provenance
The following attestation bundles were made for kaos_pdf-0.1.0a1-py3-none-any.whl:
Publisher:
release.yml on 273v/kaos-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kaos_pdf-0.1.0a1-py3-none-any.whl -
Subject digest:
10ea08722040d90f6fbcae01ce77f731088ea4830b56d70d80d26b2b12468daf - Sigstore transparency entry: 1465704240
- Sigstore integration time:
-
Permalink:
273v/kaos-pdf@2002b215360ed21ce75c9423740fc0277e7ae6a7 -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/273v
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2002b215360ed21ce75c9423740fc0277e7ae6a7 -
Trigger Event:
push
-
Statement type: