Document parsing that never loses provenance: every block of output knows its source page, section, and location.

These details have not been verified by PyPI

Project links

Project description

AgentContext

Document parsing that never loses the plot — or the page number.

AgentContext converts documents into clean Markdown and structured JSON, and unlike other converters, every block of output carries provenance: source, page, hierarchical section path, and character span. When your agent cites something, you can prove where it came from.

PDF / DOCX / HTML / Markdown  →  Markdown + JSON, fully traceable

Why another parser?

Tools like MarkItDown and Docling produce good Markdown — and then throw away most of what an AI agent needs to be trustworthy:

	MarkItDown	Docling	AgentContext
Clean Markdown	✅	✅	✅
Structured JSON model	❌	✅	✅
Page-level provenance on every block	❌	partial	✅
Hierarchical section path per block	❌	❌	✅
Inline citation anchors in Markdown	❌	❌	✅
Bounding boxes	❌	partial	🔜 v0.2
Built for downstream RAG citations	❌	❌	✅

If your LLM answer says "revenue grew 12%", AgentContext lets you point at page 7, section "3. Financials > 3.2 Revenue" — automatically. (Bounding boxes land in v0.2 with layout analysis.)

Install

# core: txt / md / html parsing, zero dependencies
pip install agentcontext-core

# with PDF + DOCX support
pip install "agentcontext-core[pdf,docx]"

(The PyPI name is agentcontext-core — plain agentcontext is name-blocked by an unrelated existing project. The import is still import agentcontext.)

No GPU. No torch. No API keys. Pure parsing.

Quickstart

from agentcontext import Document

doc = Document.parse("report.pdf")

print(doc.to_markdown())              # clean, structured markdown
print(doc.to_json())                  # full document model, lossless

for block in doc.blocks:
    print(block.text[:60], "→ page", block.provenance.page)

for table in doc.tables:
    print(table.to_rows())            # structured cells, with provenance

Or from the command line:

agentcontext parse report.pdf                # writes report.md next to the source
agentcontext parse report.pdf --json         # writes report.json (full document model)
agentcontext parse report.pdf --cite inline  # markdown with provenance anchors

What the output looks like

--cite inline gives you Markdown that renders normally but carries its receipts:

# Refund Policy <!-- src: policy.md | Refund Policy -->

Customers may request a full refund within 30 days
of purchase. <!-- src: policy.md | Refund Policy -->

## Exceptions <!-- src: policy.md | Refund Policy > Exceptions -->

Digital goods are excluded. <!-- src: policy.md | Refund Policy > Exceptions -->

--json gives you the full Unified Document Model. Unknown provenance fields are explicit null, never omitted — a block without provenance is a bug:

{
  "udm_version": "0.1",
  "metadata": {
    "title": null, "author": null, "created": null,
    "source_path": "/abs/path/report.pdf",
    "sha256": "444cd23e4ba2b0a1…",
    "parser": "pdf", "parser_version": "pdf-parser/0.1"
  },
  "blocks": [
    {
      "type": "paragraph",
      "text": "Revenue grew 12% year over year...",
      "level": null,
      "provenance": {
        "source": "report.pdf",
        "page": 7,
        "section_path": "3. Financials > 3.2 Revenue",
        "bbox": null,
        "char_span": null,
        "confidence": 0.9,
        "parser": "pdf",
        "version": "pdf-parser/0.1"
      }
    }
  ],
  "tables": [ ... ]
}

Supported formats (v0.1)

PDF (digital / text-layer)
DOCX
HTML
Markdown (normalization + provenance) and plain text

OCR for scanned documents, PPTX, and XLSX are next on the roadmap.

Benchmarks

A public benchmark against MarkItDown and Docling on a golden corpus (papers, reports, contracts, invoices) — measuring text accuracy, structure accuracy, table cell accuracy, and provenance accuracy — is under construction: see BENCHMARKS.md.

We will publish the numbers even where we lose. Trust is the product.

Roadmap

v0.1 (now): PDF/DOCX/HTML/MD → Markdown + JSON with full provenance. CLI + Python SDK.
v0.2: OCR for scanned documents, PPTX/XLSX parsers, provenance-preserving chunking.
v0.3: Embedding adapters, citation-aware retrieval helpers.
Later: Context packages for agents — retrieval that returns not just chunks, but summaries, tables, entities, and citations in one structured payload.

The long-term vision is a full open context-engineering layer for AI agents (a working preview of the whole pipeline lives on the platform branch). The short-term promise is simpler: the most trustworthy parser you can put in a RAG pipeline.

Design principles

Provenance is not optional. A block without a source location is a bug.
Small core, pluggable edges. Parsers, OCR engines, and exporters implement a small protocol.
No heavyweight dependencies in core. pip install and go.
Honest benchmarks. Measured in CI, published publicly.

Contributing

The Parser protocol makes new formats easy to add:

from agentcontext import Document, Parser, register_parser

class EpubParser(Parser):
    name = "epub"
    version = "epub-parser/0.1"
    extensions = ("epub",)

    def parse(self, path: str) -> Document:
        ...

register_parser(EpubParser())

See CONTRIBUTING.md.

Author

Built by Harish — @harish-ai-engineer

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentcontext_core-0.1.0.tar.gz (71.6 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentcontext_core-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file agentcontext_core-0.1.0.tar.gz.

File metadata

Download URL: agentcontext_core-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 71.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for agentcontext_core-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f7bde97a8c0e3e7c0179bc6dafdf30e9c7b06cb811152f60a23dd33254cc387d`
MD5	`7ba32c1aa74479ae92ec8b27c085710e`
BLAKE2b-256	`2c743eeb0404ea0ad3cca3354a6c403fb3a1521b0581e25040472ea1d3bce745`

See more details on using hashes here.

File details

Details for the file agentcontext_core-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentcontext_core-0.1.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 20.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for agentcontext_core-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b71ccb3dcbb1d41fe455ca6385fff4905a48ab02a3337f135770ebff35eb403`
MD5	`21582f17ad448187db19c571996b0c60`
BLAKE2b-256	`52b67d8a6a40c48974584e5f2282f755c21ba8888d3e0a8a9ad521fa72aafdc4`

See more details on using hashes here.

agentcontext-core 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentContext

Why another parser?

Install

Quickstart

What the output looks like

Supported formats (v0.1)

Benchmarks

Roadmap

Design principles

Contributing

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes