Document parsing that never loses provenance: every block of output knows its source page, section, and location.
Project description
AgentContext
Document parsing that never loses the plot — or the page number.
AgentContext converts documents into clean Markdown and structured JSON, and unlike other converters, every block of output carries provenance: source, page, hierarchical section path, and character span. When your agent cites something, you can prove where it came from.
PDF / DOCX / HTML / Markdown → Markdown + JSON, fully traceable
Why another parser?
Tools like MarkItDown and Docling produce good Markdown — and then throw away most of what an AI agent needs to be trustworthy:
| MarkItDown | Docling | AgentContext | |
|---|---|---|---|
| Clean Markdown | ✅ | ✅ | ✅ |
| Structured JSON model | ❌ | ✅ | ✅ |
| Page-level provenance on every block | ❌ | partial | ✅ |
| Hierarchical section path per block | ❌ | ❌ | ✅ |
| Inline citation anchors in Markdown | ❌ | ❌ | ✅ |
| Bounding boxes | ❌ | partial | 🔜 v0.2 |
| Built for downstream RAG citations | ❌ | ❌ | ✅ |
If your LLM answer says "revenue grew 12%", AgentContext lets you point at page 7, section "3. Financials > 3.2 Revenue" — automatically. (Bounding boxes land in v0.2 with layout analysis.)
Install
# core: txt / md / html parsing, zero dependencies
pip install agentcontext-core
# with PDF + DOCX support
pip install "agentcontext-core[pdf,docx]"
(The PyPI name is agentcontext-core — plain agentcontext is name-blocked by
an unrelated existing project. The import is still import agentcontext.)
No GPU. No torch. No API keys. Pure parsing.
Quickstart
from agentcontext import Document
doc = Document.parse("report.pdf")
print(doc.to_markdown()) # clean, structured markdown
print(doc.to_json()) # full document model, lossless
for block in doc.blocks:
print(block.text[:60], "→ page", block.provenance.page)
for table in doc.tables:
print(table.to_rows()) # structured cells, with provenance
Or from the command line:
agentcontext parse report.pdf # writes report.md next to the source
agentcontext parse report.pdf --json # writes report.json (full document model)
agentcontext parse report.pdf --cite inline # markdown with provenance anchors
What the output looks like
--cite inline gives you Markdown that renders normally but carries its receipts:
# Refund Policy <!-- src: policy.md | Refund Policy -->
Customers may request a full refund within 30 days
of purchase. <!-- src: policy.md | Refund Policy -->
## Exceptions <!-- src: policy.md | Refund Policy > Exceptions -->
Digital goods are excluded. <!-- src: policy.md | Refund Policy > Exceptions -->
--json gives you the full Unified Document Model. Unknown provenance fields are explicit null, never omitted — a block without provenance is a bug:
{
"udm_version": "0.1",
"metadata": {
"title": null, "author": null, "created": null,
"source_path": "/abs/path/report.pdf",
"sha256": "444cd23e4ba2b0a1…",
"parser": "pdf", "parser_version": "pdf-parser/0.1"
},
"blocks": [
{
"type": "paragraph",
"text": "Revenue grew 12% year over year...",
"level": null,
"provenance": {
"source": "report.pdf",
"page": 7,
"section_path": "3. Financials > 3.2 Revenue",
"bbox": null,
"char_span": null,
"confidence": 0.9,
"parser": "pdf",
"version": "pdf-parser/0.1"
}
}
],
"tables": [ ... ]
}
Supported formats (v0.1)
- PDF (digital / text-layer)
- DOCX
- HTML
- Markdown (normalization + provenance) and plain text
OCR for scanned documents, PPTX, and XLSX are next on the roadmap.
Benchmarks
A public benchmark against MarkItDown and Docling on a golden corpus (papers, reports, contracts, invoices) — measuring text accuracy, structure accuracy, table cell accuracy, and provenance accuracy — is under construction: see BENCHMARKS.md.
We will publish the numbers even where we lose. Trust is the product.
Roadmap
- v0.1 (now): PDF/DOCX/HTML/MD → Markdown + JSON with full provenance. CLI + Python SDK.
- v0.2: OCR for scanned documents, PPTX/XLSX parsers, provenance-preserving chunking.
- v0.3: Embedding adapters, citation-aware retrieval helpers.
- Later: Context packages for agents — retrieval that returns not just chunks, but summaries, tables, entities, and citations in one structured payload.
The long-term vision is a full open context-engineering layer for AI agents (a working preview of the whole pipeline lives on the platform branch). The short-term promise is simpler: the most trustworthy parser you can put in a RAG pipeline.
Design principles
- Provenance is not optional. A block without a source location is a bug.
- Small core, pluggable edges. Parsers, OCR engines, and exporters implement a small protocol.
- No heavyweight dependencies in core.
pip installand go. - Honest benchmarks. Measured in CI, published publicly.
Contributing
The Parser protocol makes new formats easy to add:
from agentcontext import Document, Parser, register_parser
class EpubParser(Parser):
name = "epub"
version = "epub-parser/0.1"
extensions = ("epub",)
def parse(self, path: str) -> Document:
...
register_parser(EpubParser())
See CONTRIBUTING.md.
Author
Built by Harish — @harish-ai-engineer
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentcontext_core-0.1.0.tar.gz.
File metadata
- Download URL: agentcontext_core-0.1.0.tar.gz
- Upload date:
- Size: 71.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7bde97a8c0e3e7c0179bc6dafdf30e9c7b06cb811152f60a23dd33254cc387d
|
|
| MD5 |
7ba32c1aa74479ae92ec8b27c085710e
|
|
| BLAKE2b-256 |
2c743eeb0404ea0ad3cca3354a6c403fb3a1521b0581e25040472ea1d3bce745
|
File details
Details for the file agentcontext_core-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentcontext_core-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b71ccb3dcbb1d41fe455ca6385fff4905a48ab02a3337f135770ebff35eb403
|
|
| MD5 |
21582f17ad448187db19c571996b0c60
|
|
| BLAKE2b-256 |
52b67d8a6a40c48974584e5f2282f755c21ba8888d3e0a8a9ad521fa72aafdc4
|