Skip to main content

PDF extraction and document processing for KAOS — structured AST output with provenance

Project description

kaos-pdf

Part of Kelvin Agentic OS (KAOS) — open agentic infrastructure for legal work, built by 273 Ventures. See the full KAOS package map for the rest of the stack.

PyPI - Version Python License CI

kaos-pdf is the PDF-extraction layer of KAOS — it turns a PDF byte stream into a typed kaos-content ContentDocument AST with provenance (page numbers, bounding boxes, extraction confidence) on every node, plus a small set of read-only MCP tools for agentic workflows. The engine is pypdfium2 (Apache-2.0) and all PDFium calls are serialised through a global lock so the library is safe to call from threaded executors. No raw text strings escape — every result is an AST node, a typed dataclass, or a KaosImage.

The base install is intentionally small: three runtime dependencies (kaos-content[images,layout,markdown], kaos-core, pypdfium2) and no compiled native code beyond the PDFium wheel. Heavier capabilities are opt-in extras: [ocr] adds pytesseract for scanned pages (and requires a system tesseract binary), [tables] adds pdfplumber (MIT, pure Python — no Java, no GPU) for borderless and multi-line tables, and [nlp] adds kaos-nlp-core for BM25 sentence-level search. VLM page programs (describe / classify / OCR-via-VLM) live in kaos-llm-core[vision] ≥ 0.1.0a3 — they were moved out of kaos-pdf to keep the extraction → LLM dependency direction one-directional. We do not and will not depend on AGPL or GPL libraries (this rules out Surya for OCR and camelot-lattice / Tabula for tables).

Install

uv add kaos-pdf
# or
pip install kaos-pdf

# OCR for scanned PDFs (requires system tesseract binary)
uv add 'kaos-pdf[ocr]'

# Structured table extraction via pdfplumber
uv add 'kaos-pdf[tables]'

# BM25 sentence-level search via kaos-nlp-core
uv add 'kaos-pdf[nlp]'

kaos-pdf requires Python 3.13 or newer (3.14 is supported). The package is pure Python — the only native code is the PDFium wheel shipped by pypdfium2, which has prebuilt wheels for Linux, macOS, and Windows on x86_64 and arm64.

Quick start

Extract a PDF into the document AST, render a page, and search for a term:

from kaos_pdf import (
    extract_pdf,
    get_pdf_metadata,
    get_pdf_outline,
    render_page,
    search_document,
)

# Parse the whole document into a kaos-content ContentDocument
doc = extract_pdf("contract.pdf")
print(len(doc.body), "top-level blocks")

# Typed metadata (PdfMetadata dataclass; sparse to_dict() for JSON)
meta = get_pdf_metadata("contract.pdf")
print(meta.page_count, meta.title, meta.author)

# Outline / bookmarks (list[PdfOutlineEntry], also typed)
for entry in get_pdf_outline("contract.pdf"):
    print("  " * entry.level, entry.title, "p", entry.page)

# Render the first page as a 300-DPI PIL image (returned as KaosImage)
image = render_page("contract.pdf", page_number=0, dpi=300)
image.pil.save("page-1.png")

# AST-grounded search — paragraph-level by default
hits = search_document(doc, "indemnification", top_k=5)
for hit in hits.results:
    print(f"score={hit.score:.2f} :: {hit.text[:80]}")

Every node in the returned ContentDocument carries a Provenance (source path, 1-based page, bounding box, extractor name, confidence) so downstream consumers — citation verifiers, redaction tooling, labelers — can ground answers back to the original PDF.

Concepts

The package is a thin, typed surface over pypdfium2. The most important entries:

Concept What it is
extract_pdf(path, *, pages=None, ocr="never", tables="geometric", extract_images=False, image_src_builder=...) Primary entry point. Returns a ContentDocument. pages selects 0-based indices; ocr is "never" / "auto" / "always"; tables is "geometric" / "engine" / "disabled"; image_src_builder lets callers control the image URI policy (default inlines as data: URLs).
extract_pdf_bytes(data, ...) / extract_pdf_with_tables(path, ...) Bytes-input variant and the sidecar form that returns (ContentDocument, TabularDocument) when you want tables out of the body.
render_page(path, page_number, *, dpi=300, grayscale=False) Renders a single page (0-based) to a KaosImage (PIL + DPI + provenance).
extract_page_text(path, page_number) / get_page_count(path) Lightweight per-page text + page-count helpers that skip full AST construction.
PdfMetadata / PdfOutlineEntry @dataclass(frozen=True, slots=True) result types returned by get_pdf_metadata() and get_pdf_outline(). Sparse to_dict() (None fields omitted) preserves the historical wire format. page_count lives on PdfMetadata directly — no extra get_page_count() call needed.
classify_document(path) / classify_page(path, page_number) Lightweight document/page-type heuristics (e.g. text, scanned, mixed).
search_document(doc, query, *, top_k=10, level="paragraph") Re-exported from kaos-content. AST-grounded ranked search returning SearchResults with total_matches / has_more for pagination. level="sentence" requires the [nlp] extra.
OCRMode / OCREngine / TesseractEngine OCR pluggability. OCRMode is the extract_pdf(ocr=...) setting; OCREngine is the engine ABC; TesseractEngine is the Apache-2.0 default (install with [ocr] + system tesseract). OCR paragraphs carry Provenance.confidence so verifiers can weight them.
TableMode / TableEngine / ExtractedTable / TableResult Table pluggability. pdfplumber is the MIT default behind [tables]. Extracted tables become TabularDocument with typed columns and live in the body with Provenance.extractor = "kaos-pdf/tables/{engine}".
ParsePDFTool, GetPageTextTool, RenderPageTool, PDFMetadataTool, SearchDocumentTool, GetOutlineTool, ClassifyPageTool The seven KaosTool subclasses exposed over MCP as kaos-pdf-extract-parse, -extract-page-text, -render-page, -metadata, -search-document, -get-outline, -classify-page. All seven are readOnly, idempotent, non-destructive, non-open-world. Register with register_pdf_tools(runtime).
Errors (KaosPdfError, PdfNotFoundError, PdfExtractionError, PdfRenderError) Dedicated exception hierarchy. MCP tools translate these into ToolResult.create_error() with the documented three-part recovery hint (what / how to fix / alternative tool).

CLI

kaos-pdf ships two entry-point scripts. Every structured command on the admin CLI supports --json for machine-readable output piped to other agents:

kaos-pdf --help                                     # admin CLI
kaos-pdf-serve --help                               # MCP server

kaos-pdf info contract.pdf --json                   # metadata + page count + classification
kaos-pdf outline contract.pdf --json                # PDF bookmarks (falls back to detected headings)
kaos-pdf page contract.pdf 3 --json                 # plain text from a single page (1-based)
kaos-pdf extract contract.pdf -f markdown -p 1-5    # full AST → markdown / text / json / html
kaos-pdf render contract.pdf 1 --dpi 300 -o p1.png  # render a page as PNG
kaos-pdf classify contract.pdf --page 1 --json      # document- or page-level type
kaos-pdf search contract.pdf "indemnification" -k 5 # AST-grounded ranked search

kaos-pdf-serve                                      # stdio (Claude Code / Desktop)
kaos-pdf-serve --http --port 8000                   # streamable HTTP

The admin CLI uses 1-based page numbers (consistent with how the file opens in any PDF viewer) and translates internally to the 0-based indices the Python API uses. kaos-pdf-serve exposes the seven MCP tools listed in Concepts above.

Compatibility & status

Aspect
Python 3.13, 3.14
OS Linux, macOS, Windows (pure-Python wheel; the only native code is the PDFium wheel shipped by pypdfium2)
Maturity Alpha (Development Status :: 3 - Alpha). The public API is documented in kaos_pdf.__all__.
Stability policy Pre-1.0: minor bumps may change behaviour. Every change is documented in CHANGELOG.md. The MCP tool surface (kaos-pdf-* names) and the KAOS_PDF_* environment-variable namespace are public API and follow the same policy.
Test coverage 340 unit tests plus a small integration tier hitting the MCP wire end-to-end. Bounded unit gate (pytest tests/unit -q --no-cov) finishes in ~35s.
Type checker Validated with ty, Astral's Python type checker.

Companion packages

kaos-pdf is one of the packages in the Kelvin Agentic OS. The broader stack:

Package Layer What it does
kaos-core Core Foundational runtime, MCP-native types, registries, execution engine, VFS
kaos-content Core Typed document AST: Block/Inline, provenance, views
kaos-mcp Bridge FastMCP server, kaos management CLI, MCP resource templates
kaos-pdf Extraction PDF → AST with provenance
kaos-web Extraction Web extraction, browser automation, search, domain intelligence
kaos-office Extraction DOCX / PPTX / XLSX readers + writers to AST
kaos-tabular Extraction DuckDB-powered SQL analytics
kaos-source Data Government + financial data connectors (Federal Register, eCFR, EDGAR, GovInfo, PACER, GLEIF)
kaos-llm-client LLM Multi-provider LLM transport
kaos-llm-core LLM Typed LLM programming (Signatures, Programs, Optimizers)
kaos-nlp-core Primitives (Rust) High-performance NLP primitives
kaos-nlp-transformers ML Dense embeddings + retrieval
kaos-graph Primitives (Rust) Graph algorithms + RDF/SPARQL
kaos-ml-core Primitives (Rust) Classical ML on the document AST
kaos-citations Legal Legal citation extraction, resolution, verification
kaos-agents Agentic Agent runtime, memory, recipes
kaos-reference Sample Reference module for module authors

Packages depend on kaos-core; everything else is opt-in. Mix and match the ones you need.

Development

git clone https://github.com/273v/kaos-pdf
cd kaos-pdf
uv sync --group dev

Install pre-commit hooks (recommended — they run the same checks as CI on every commit, scoped to staged files):

uvx pre-commit install
uvx pre-commit run --all-files     # one-time full sweep

Manual QA commands (the same set CI runs):

uv run ruff format --check kaos_pdf tests
uv run ruff check kaos_pdf tests
uv run ty check kaos_pdf tests
uv run pytest tests/unit -q --no-cov

Build from source

uv build
uv pip install dist/*.whl
python -c "import kaos_pdf; print(kaos_pdf.__version__)"  # smoke import

Contributing

Issues and pull requests are welcome. By contributing you certify the Developer Certificate of Origin v1.1 — sign every commit with git commit -s. Please open an issue before starting on a non-trivial change so we can align on scope.

Security

For security issues, please do not file a public issue. Report privately via GitHub Private Vulnerability Reporting or email security@273ventures.com. See SECURITY.md for the full disclosure policy.

License

Apache License 2.0 — see LICENSE and NOTICE.

Copyright 2026 273 Ventures LLC. Built for kelvin.legal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaos_pdf-0.1.0a2.tar.gz (61.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaos_pdf-0.1.0a2-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file kaos_pdf-0.1.0a2.tar.gz.

File metadata

  • Download URL: kaos_pdf-0.1.0a2.tar.gz
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kaos_pdf-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 5f7ef5da746d93358e7438495de400609f1b82165c64e7383b417d1ee4d9e801
MD5 5f2a0a8d7ca129b763dd48b2782e94ba
BLAKE2b-256 85f2dee3b93e4b407a5f4c2de14ab7ec0aa47c42ef1c3a0114f9a3c1b2584958

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_pdf-0.1.0a2.tar.gz:

Publisher: release.yml on 273v/kaos-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_pdf-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: kaos_pdf-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kaos_pdf-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 e880e1886291331acf68f4b562e7a42c7eae93d06f320735d4baabc8cccb68d6
MD5 deb67e785308550618de311bd0df45ad
BLAKE2b-256 c619993ebcf2aef590bb6388315a98f5d309dd861e12c714875453ac8d88e899

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_pdf-0.1.0a2-py3-none-any.whl:

Publisher: release.yml on 273v/kaos-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page