Skip to main content

Office document extraction for KAOS — DOCX, XLSX, PPTX to structured AST with provenance

Project description

kaos-office

Part of Kelvin Agentic OS (KAOS) — open agentic infrastructure for legal work, built by 273 Ventures. See the full KAOS package map for the rest of the stack.

PyPI - Version Python License CI

kaos-office is the Office-document layer of KAOS — it turns Microsoft Office files (.docx, .pptx, .xlsx) into typed kaos-content AST models with provenance, and turns those models back into round-trip-fidelity Office files. DOCX and PPTX produce ContentDocument (Block / Inline flow content with headings, paragraphs, lists, tables, footnotes, annotations, tracked changes, and per-section page setup); XLSX produces TabularDocument (typed columns over a 13-type ColumnType system, one Table per sheet, formulas and merged ranges preserved as metadata). The package also ships 17 read / write MCP tools and a 12-subcommand admin CLI for agentic workflows.

The base install is intentionally small — three runtime dependencies (kaos-content[markdown], kaos-core, lxml) and no compiled native code beyond the lxml wheel. Everything OOXML-shaped is parsed and written with lxml directly so the read / write paths stay symmetric; the optional extras only kick in when you want a different engine. [pptx] adds python-pptx (MIT) for the PPTX writer; [xlsx] aggregates python-calamine (MIT, Rust — 7-28× faster XLSX read) and openpyxl (MIT, formula extraction) — pick [xlsx-calamine] or [xlsx-formulas] individually if you want only one. We do not and will not depend on AGPL or GPL libraries.

Install

uv add kaos-office
# or
pip install kaos-office

# PPTX writer (python-pptx)
uv add 'kaos-office[pptx]'

# Calamine XLSX fast-path + openpyxl formula extraction
uv add 'kaos-office[xlsx]'

# BM25 sentence-level search via kaos-nlp-core
uv add 'kaos-office[nlp]'

kaos-office requires Python 3.13 or newer (3.14 is supported). The package is pure Python — the only native code is the lxml wheel, which has prebuilt wheels for Linux, macOS, and Windows on x86_64 and arm64.

Quick start

Read a DOCX into the AST, render it as markdown, and search it; then read an XLSX as a typed tabular document:

from kaos_office import (
    extract_to_markdown,
    parse_docx,
    parse_pptx,
    search_document,
)
from kaos_office.xlsx import list_sheets, parse_xlsx

# DOCX → ContentDocument with Block/Inline + provenance on every node
doc = parse_docx("contract.docx")
print(len(doc.body), "top-level blocks")
print(doc.metadata.title, doc.metadata.source.uri)

# Same shape for PPTX (each slide becomes a Div(classes="slide"))
deck = parse_pptx("brief.pptx")
print(deck.metadata.extra.get("slide_count"), "slides")

# AST-grounded search — paragraph-level by default
hits = search_document(doc, "indemnification", top_k=5)
for hit in hits.results:
    print(f"score={hit.score:.2f} :: {hit.text[:80]}")

# XLSX → TabularDocument (one Table per sheet, typed columns)
tab = parse_xlsx("report.xlsx")
for table in tab.tables:
    print(f"{table.name}: {table.row_count} rows × {len(table.columns)} cols")
print(list_sheets("report.xlsx"))  # cheap workbook metadata, no parse

# Format-agnostic shortcut: any of the three → markdown
print(extract_to_markdown("contract.docx")[:200])

Every node in the returned ContentDocument carries a Provenance (source URI, page or slide number, char span, extractor name) so downstream consumers — citation verifiers, redaction tooling, labelers — can ground answers back to the original file.

Concepts

The package is a thin, typed surface over the OOXML wire format. The most important entries:

Concept What it is
parse_docx(path, *, track_changes=False, image_src_builder=...) DOCX reader. Returns a ContentDocument with paragraphs, headings, lists, tables, footnotes, comments (as annotations), hyperlinks, embedded images, and per-section page setup. track_changes=True preserves w:ins / w:del / w:moveFrom / w:moveTo as Span / Div with rev-* classes plus TRACKED_CHANGE annotations.
parse_pptx(path) PPTX reader. Each slide → Div(classes="slide", slide_number=N). Uses python-pptx for shape traversal and falls back to OPC/lxml for SmartArt text — the only Python tool that does. Charts linearize to Tables with category + series columns. Speaker notes land as Div(classes="speaker-notes").
parse_xlsx(path, *, sheets=None, max_rows=None, header_row=0, include_formulas=False, engine="native") XLSX reader. Returns a TabularDocument. Default engine="native" is pure lxml; engine="calamine" switches to the Rust fast-path ([xlsx-calamine]). include_formulas=True extracts cell formulas via openpyxl ([xlsx-formulas]).
write_docx(doc, path) / write_docx_bytes(doc) DOCX writer (lxml). Round-trips the DOCX feature surface: paragraphs / headings, bullet + ordered lists with proper numbering.xml, tables with grid spans, hyperlinks (with proper rels), footnotes, endnotes, comments, headers, footers, page setup, multi-section documents, embedded images (data: / file:// URIs), and SDT / content-control wrappers.
write_pptx(doc, path, *, template=None, overflow="warn"/"autofit"/"extend") PPTX writer (python-pptx, lazy-imported with [pptx] install hint at call time). Auto-segments at Heading(depth=1). overflow controls how text that may not fit a shape is handled — "warn" (default) emits a logger warning, "autofit" shrinks the font, "extend" grows the shape.
write_xlsx(doc, path, *, bold_headers=True, auto_width=True, freeze_header=True) / write_xlsx_bytes(doc) XLSX writer (lxml — no extras needed). Native SpreadsheetML output with proper date formats, money formats, percentage / float / integer formats per ColumnType, auto-sized columns, bold header row, and frozen panes.
search_document(doc, query, *, top_k=10, level="paragraph") Re-exported from kaos-content. AST-grounded ranked search returning SearchResults with total_matches / has_more for pagination. level="sentence" requires the [nlp] extra.
extract_to_markdown(path, **kwargs) Format-agnostic convenience wrapper. Dispatches by extension to parse_docx + serialize_markdown, parse_pptx + serialize_markdown, or parse_xlsx + serialize_tabular_markdown.
17 MCP tools ParseDocxTool, GetDocxTextTool, GetDocxMarkdownTool, DocxMetadataTool, SearchDocxTool (5 DOCX) · ParsePptxTool, ListSlidesTool, GetSlideTool, GetSlideNotesTool, SearchPptxTool (5 PPTX) · ParseXlsxTool, ListSheetsXlsxTool, GetSheetXlsxTool, XlsxMetadataTool (4 XLSX) · WriteDocxTool, WritePptxTool, WriteXlsxTool (3 writers). All readers are readOnly + idempotent + non-destructive + non-open-world; writers refuse silent overwrites unless force=true. Register with register_office_tools(runtime).
Errors (KaosOfficeError, DocxExtractionError, PptxExtractionError, XlsxExtractionError) Dedicated exception hierarchy. MCP tools translate these into ToolResult.create_error() with the documented three-part recovery hint (what / how to fix / alternative tool).

CLI

kaos-office ships two entry-point scripts. Every structured command on the admin CLI supports --json for machine-readable output piped to other agents:

kaos-office --help                                  # admin CLI
kaos-office-serve --help                            # MCP server

# DOCX
kaos-office extract contract.docx -f markdown       # AST → markdown / text / json / html
kaos-office search contract.docx "indemnification"  # AST-grounded ranked search
kaos-office metadata contract.docx --json           # title, author, page setup, sections

# PPTX
kaos-office pptx-extract brief.pptx -f markdown
kaos-office pptx-slides brief.pptx --json           # slide inventory (number, title, layout)
kaos-office pptx-slide brief.pptx 3                 # text from a single slide (1-based)

# XLSX
kaos-office xlsx-extract report.xlsx -f markdown    # tabular markdown
kaos-office xlsx-sheets report.xlsx --json          # sheet names + dimensions
kaos-office xlsx-sheet report.xlsx Revenue          # one sheet as TSV

# Writers (JSON file or '-' for stdin)
kaos-office write-docx body.json out.docx --force
kaos-office write-pptx body.json out.pptx --template brand.pptx
kaos-office write-xlsx tabular.json out.xlsx

kaos-office-serve                                   # stdio (Claude Code / Desktop)
kaos-office-serve --http --port 8000                # streamable HTTP

The admin CLI uses 1-based slide / page numbers (consistent with how the file opens in any viewer) and translates internally to the 0-based indices the Python API uses. kaos-office-serve exposes the 17 MCP tools listed in Concepts above.

Compatibility & status

Aspect
Python 3.13, 3.14
OS Linux, macOS, Windows (pure-Python wheel; the only native code is the lxml wheel)
Maturity Alpha (Development Status :: 3 - Alpha). The public API is documented in kaos_office.__all__.
Stability policy Pre-1.0: minor bumps may change behaviour. Every change is documented in CHANGELOG.md. The MCP tool surface (kaos-office-* names) and the KAOS_OFFICE_* environment-variable namespace are public API and follow the same policy.
Test coverage 492 unit tests plus a 144-test integration tier covering DOCX / PPTX / XLSX round-trip fidelity against real-world fixtures. Bounded unit gate (pytest tests/unit -q --no-cov) finishes in ~30s.
Type checker Validated with ty, Astral's Python type checker.

Companion packages

kaos-office is one of the packages in the Kelvin Agentic OS. The broader stack:

Package Layer What it does
kaos-core Core Foundational runtime, MCP-native types, registries, execution engine, VFS
kaos-content Core Typed document AST: Block/Inline, provenance, views
kaos-mcp Bridge FastMCP server, kaos management CLI, MCP resource templates
kaos-pdf Extraction PDF → AST with provenance
kaos-web Extraction Web extraction, browser automation, search, domain intelligence
kaos-office Extraction DOCX / PPTX / XLSX readers + writers to AST
kaos-tabular Extraction DuckDB-powered SQL analytics
kaos-source Data Government + financial data connectors (Federal Register, eCFR, EDGAR, GovInfo, PACER, GLEIF)
kaos-llm-client LLM Multi-provider LLM transport
kaos-llm-core LLM Typed LLM programming (Signatures, Programs, Optimizers)
kaos-nlp-core Primitives (Rust) High-performance NLP primitives
kaos-nlp-transformers ML Dense embeddings + retrieval
kaos-graph Primitives (Rust) Graph algorithms + RDF/SPARQL
kaos-ml-core Primitives (Rust) Classical ML on the document AST
kaos-citations Legal Legal citation extraction, resolution, verification
kaos-agents Agentic Agent runtime, memory, recipes
kaos-reference Sample Reference module for module authors

Packages depend on kaos-core; everything else is opt-in. Mix and match the ones you need.

Development

git clone https://github.com/273v/kaos-office
cd kaos-office
uv sync --group dev

Install pre-commit hooks (recommended — they run the same checks as CI on every commit, scoped to staged files):

uvx pre-commit install
uvx pre-commit run --all-files     # one-time full sweep

Manual QA commands (the same set CI runs):

uv run ruff format --check kaos_office tests
uv run ruff check kaos_office tests
uv run ty check kaos_office tests
uv run pytest tests/unit -q --no-cov

Build from source

uv build
uv pip install dist/*.whl
python -c "import kaos_office; print(kaos_office.__version__)"  # smoke import

Contributing

Issues and pull requests are welcome. By contributing you certify the Developer Certificate of Origin v1.1 — sign every commit with git commit -s. Please open an issue before starting on a non-trivial change so we can align on scope.

Security

For security issues, please do not file a public issue. Report privately via GitHub Private Vulnerability Reporting or email security@273ventures.com. See SECURITY.md for the full disclosure policy.

License

Apache License 2.0 — see LICENSE and NOTICE.

Copyright 2026 273 Ventures LLC. Built for kelvin.legal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaos_office-0.1.0a1.tar.gz (103.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaos_office-0.1.0a1-py3-none-any.whl (116.4 kB view details)

Uploaded Python 3

File details

Details for the file kaos_office-0.1.0a1.tar.gz.

File metadata

  • Download URL: kaos_office-0.1.0a1.tar.gz
  • Upload date:
  • Size: 103.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kaos_office-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 0f08f96338babdd3f306f62e6b07809f97728e0c8b901a9a977faae2dc1091b9
MD5 9971897cb36f1035b116632c595578b8
BLAKE2b-256 cb27c13fd130cac4c337a3406d9f3f5165b2c129f2773c2a20d26998d42fbc61

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_office-0.1.0a1.tar.gz:

Publisher: release.yml on 273v/kaos-office

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kaos_office-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: kaos_office-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 116.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kaos_office-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba282d55570c1671b78b0e17a054a010173acceb6c95377e3b10362118ed3073
MD5 5bda4f44aa25fc16d089d165684854ac
BLAKE2b-256 0a542eff867bdebaabf5afc29d8b86cc4f4f6d4d5724d0fe8316bd509460162c

See more details on using hashes here.

Provenance

The following attestation bundles were made for kaos_office-0.1.0a1-py3-none-any.whl:

Publisher: release.yml on 273v/kaos-office

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page