LLM-ready document understanding package.

These details have not been verified by PyPI

Project links

Project description

Documa

概覽

Documa 是一個以 Python 套件為核心、面向 LLM 的文件理解套件。它的目標不是取代底層 PDF parser，而是把 parser 產出的低階訊號整理成穩定、可追溯、可被 agent、RAG、MCP 與 tool-calling 工作流使用的文件中介表示。

目前定位是 Alpha 階段的開發者套件：

核心是 Python package，不在核心 repo 內建產品 UI。
PDF parsing 透過 adapter 接入，目前包含 PyMuPDFAdapter；core 不直接依賴 parser-native object。
內部文字使用 Python Unicode str；JSON 與檔案輸出預設 UTF-8，JSON 使用 ensure_ascii=False。
保留原始文字與正規化文字，不靜默覆蓋原文。
CLI、Python tool layer、OpenAI tool schema 與 MCP wrapper 共用同一組結構化結果契約。

為什麼需要 Documa

一般 PDF parser 多半負責「從 PDF 拿到文字、座標、圖片或表格候選」。這很重要，但對 LLM 應用通常還不夠：PDF 的自然閱讀順序、段落、頁首頁尾、表格脈絡、註腳、圖片 caption、RAG chunk provenance，以及 agent 需要的漸進式讀取介面，都需要額外的整理層。

Documa 補的是這一層。

flowchart LR
    A["PDF / Markdown"] --> B["Parser adapter"]
    B --> C["Parser-neutral Documa IR"]
    C --> D["理解 pipeline"]
    D --> E["輸出: JSON / Markdown / RAG JSON / block JSON"]
    D --> F["介面: CLI / Python tools / OpenAI schemas / MCP"]

與一般 PDF parser 的差異

面向	一般 PDF parser	Documa
邊界	直接暴露 parser API 或 parser-specific result	透過 adapter 轉成 parser-neutral `DocumentIR`
文字處理	常只輸出 parser text	同時保留 `raw_text` 與 `normalized_text`
版面理解	提供座標、block、span 等低階訊號	在 pipeline 裡建立 reading order、paragraph、layout class、table/image normalization
可追溯性	下游自行追頁碼與座標	IR 保存 page refs、bbox refs、source refs、relations、metadata
LLM ingestion	通常輸出純文字或簡單 chunk	產生 RAG chunk、block tree、keyword metadata、provenance
Agent 讀取	常需要一次餵完整文件	提供 `list/search/read block` 的 progressive reading tools
整合方式	綁定特定 library	同一能力可透過 CLI、Python tool layer、OpenAI function schema、MCP wrapper 使用
品質門檻	多靠手動測試	內建 `doctor`、fixture benchmark 與 regression tests

Documa 對 PDF parser 的態度是「adapter-based composition」：底層 parser 專注 extraction，Documa 專注把 extraction result 變成穩定、可維護、可供 LLM 使用的文件理解資料層。

核心概念

DocumentIR：Documa 的 parser-neutral source of truth，定義在 src/documa/core/ir.py。
Adapter：把外部格式轉成 DocumentIR，例如 PyMuPDFAdapter 與 MarkdownAdapter。
Pipeline stage：對 IR 做保守、可測試的轉換，例如 reading order、inline semantics、paragraph grouping、table normalization、relations、block tree、chunking。
Relation：用來表達 footnote、TOC、caption、provenance 等可追溯連結；不確定時保留 unresolved evidence，而不是假裝成功。
Document block：面向 agent 的 progressive disclosure tree。agent 可以先看 metadata，再只讀相關 section、paragraph、table 或 image block。
Exporter：把 IR 輸出成 json、markdown、rag-json 或 block-json。
Tool layer：documa_parse、documa_process、documa_search_blocks 等工具共用同一組 structured result。

快速開始

以下命令以 PowerShell 為例。若在 macOS/Linux，虛擬環境啟動指令可改成 source .venv/bin/activate。

前置條件（Prerequisites / Requirements）

Python 3.10 或更新版本。
這個 repo 的 checkout，或已安裝的 documa package。
若要透過 PyMuPDFAdapter 處理 PDF，需安裝 pdf extra。
若要使用 MCP 或 demo 整合，需視情況安裝 mcp 與 demo extras。

1. 從 package index 安裝

發布到 PyPI 或相容的 private index 後，可直接安裝 runtime package：

python -m pip install "documa[pdf]"

若只需要 Markdown adapter、IR、pipeline、exporters 與 tool schema，可省略 pdf extra：

python -m pip install documa

2. 從 Git repo 安裝

尚未發布到 package index，或要安裝特定 Git revision 時，可透過 pip 的 direct URL 安裝：

python -m pip install "documa[pdf] @ git+https://github.com/AllanYiin/Documa.git"

3. 從本地 wheel 安裝

在 repo checkout 內先建立 distribution artifacts：

python -m pip install --upgrade build
python -m build

接著安裝產出的 wheel：

python -m pip install ".\dist\documa-0.1.0-py3-none-any.whl[pdf]"

4. 從 repo checkout 做開發安裝

git clone https://github.com/AllanYiin/Documa.git
cd Documa

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e ".[dev,pdf]"

若要一次安裝完整整合面：

python -m pip install -e ".[dev,pdf,mcp,demo]"

可選 extras：

Extra	用途
`pdf`	安裝 PyMuPDF，讓 `PyMuPDFAdapter` 可處理 PDF。
`mcp`	安裝 `documa-mcp` 需要的 MCP 依賴。
`demo`	安裝 demo 需要的選用依賴，例如 token counting 與 OpenAI SDK support。
`dev`	安裝測試依賴。

5. 驗證環境

documa doctor

預期結果：回傳結構化 JSON 診斷。缺少選用整合時會以 warning 呈現，與 core package failure 分開。

如果 console script 還不能使用，可改用 module 入口：

$env:PYTHONPATH="src"
python -m documa.cli doctor

6. 處理 PDF

documa process "<path-to-report.pdf>" `
  --out ".\out\report" `
  --lang zh-Hant,en `
  --export-format block-json

預期輸出：

.\out\report\documa.ir.json：完整 Documa IR。
.\out\report\documa.rag.json：帶有來源 metadata 的 retrieval chunks。
.\out\report\documa.blocks.json：漸進式讀取用的 block tree。
.\out\report\assets\...：可用時輸出 page preview 或抽取出的 assets。

process 是高階 ingestion 指令，會執行 parse 加上預設理解 pipeline。若只需要 adapter boundary，可使用 parse：

documa parse "<path-to-report.pdf>" --out ".\out\parsed" --lang zh-Hant,en

7. 像 agent 一樣讀文件

完成 process 後，可以先檢查與搜尋 logical blocks，再只讀相關內容：

documa blocks ".\out\report\documa.ir.json"
documa search-blocks ".\out\report\documa.ir.json" --query "主要風險"
documa block ".\out\report\documa.ir.json" --id "<block-id>"
documa block ".\out\report\documa.ir.json" --id "<block-id>" --read

這是主要的 LLM-facing 使用模式：先 list 或 search metadata，再載入選定 block body 與 provenance。

8. 輸出給下游系統

documa export ".\out\report\documa.ir.json" --format markdown --out ".\out\report\documa.md"
documa export ".\out\report\documa.ir.json" --format rag-json --out ".\out\report\documa.rag.json"
documa export ".\out\report\documa.ir.json" --format block-json --out ".\out\report\documa.blocks.json"

支援格式：

json：完整 Documa IR。
markdown：帶 page markers 的可讀 Markdown。
rag-json：供 retrieval ingestion 使用的 chunk records。
block-json：供 progressive reading workflow 使用的 document block tree。

9. 使用 tool schemas 或 MCP

列出 Documa tools：

documa tools

在 Python 中取得 OpenAI-compatible function tool descriptors：

from documa.interfaces import openai_tool_schemas

tools = openai_tool_schemas(strict=True)

直接呼叫共用 tool layer：

from documa.interfaces import call_documa_tool

result = call_documa_tool(
    "documa_process",
    {
        "source": "report.pdf",
        "out": "out/report",
        "languages": ["zh-Hant", "en"],
        "export_formats": ["rag-json", "block-json"],
    },
)

啟動選用 MCP server：

python -m pip install -e ".[mcp]"
documa-mcp

10. 執行測試

python -m pytest

若沒有安裝 pytest，也可使用 unittest：

python -m unittest discover -s tests

使用方式

CLI ingestion：使用 documa process 走預設 parser-plus-pipeline 路徑。
Python library usage：把 Documa 嵌入其他 package 時，可直接呼叫 adapters 與 pipeline stages。
Tool-calling usage：由 agent runtime 負責執行工具時，使用 call_documa_tool() 或 openai_tool_schemas()。
MCP usage：需要讓 MCP host discover Documa tools 時，執行 documa-mcp。
Example usage：要建立下游 app 時，可從 examples/pdf_chat_like/ 或 examples/pdf_chat_like_web/ 開始。

專案結構

路徑	用途
`src/documa/core/`	IR models、serialization、encoding、language 與 text normalization utilities。
`src/documa/adapters/`	Parser adapters。外部 parser objects 不應跨出此邊界。
`src/documa/pipeline/`	Parser-neutral understanding stages 與 default pipeline orchestration。
`src/documa/exporters/`	JSON、Markdown、RAG JSON 與 block JSON exporters。
`src/documa/interfaces/`	Shared tool functions、JSON schemas、OpenAI schema wrapper 與 MCP server wrapper。
`src/documa/quality/`	Doctor checks 與 fixture benchmark support。
`examples/`	使用 public package surface 建立的可執行 integration examples。這些是 examples，不是核心 UI。
`fixtures/pdf/`	PDF parsing risk coverage 的 fixture manifest。
`docs/documa/`	Architecture notes 與 expert review log。
`tests/`	IR、pipeline、adapters、CLI 與 tool interfaces 的 unit / regression tests。

以開發者方式使用 Documa

使用 Python API

from documa.adapters.base import ParseOptions
from documa.adapters.pymupdf_adapter import PyMuPDFAdapter
from documa.pipeline.runner import run_default_pipeline

document = PyMuPDFAdapter().parse(
    "report.pdf",
    ParseOptions(languages=["zh-Hant", "en"]),
)
pipeline_run = run_default_pipeline(document)
processed_document = pipeline_run.document

新增 parser adapter

新增 parser integration 時：

實作 ParserAdapter。
Adapter 只回傳 Documa IR objects。
Parser-native objects 必須留在 adapter boundary 內。
原始文字與正規化文字要分開保留。
補上 repair loops 需要的 page refs、bbox refs、source refs 與 metadata。
為 adapter 與它啟用的新 pipeline behavior 加測試。

新增 public tool 或物件

新增公開能力時，需同步考慮完整 lifecycle：

create/update/delete/state behavior，若該能力適用；
CLI surface；
Python tool function；
MCP wrapper；
tool-calling schema；
structured success 與 error payloads；
regression tests。

範例

examples/pdf_chat_like/ 示範 CLI-first 的 PDF progressive reading workflow：

$env:PYTHONPATH="src"
python examples\pdf_chat_like\pdf_chat_example.py "<path-to-report.pdf>" `
  --question "這份文件的主要風險是什麼？" `
  --out ".\out\documa-pdf-chat"

examples/pdf_chat_like_web/ 示範 local browser UI 如何呼叫 Documa，同時讓 Documa core 維持 UI-free：

$env:PYTHONPATH="src"
python examples\pdf_chat_like_web\server.py --port 8765

接著開啟 http://127.0.0.1:8765。

品質門檻

發布或改動 public behavior 前，建議先跑：

documa doctor
documa benchmark
python -m pytest

如果 release gate 要求每個宣告的 fixture PDF 都必須存在，使用 --require-files：

documa benchmark --require-files --out ".\out\documa-benchmark.json"

已知限制

Documa core 不是 UI product。UI code 應留在 examples 或 downstream applications。
Documa 不從零實作 PDF parser。PDF extraction 透過 adapters 委派給底層 parser。
目前 pipeline stages 是保守 baseline，不是完美的 document understanding model。
OCR-only 或 image-only PDFs 需要 core 假設以外的 parser/OCR 支援。
選用 LLM usage 只出現在 demos 或 downstream integrations。Core parsing 與 processing 是 deterministic，可離線執行。

疑難排解

症狀	可能原因	修正方式
`PyMuPDF is required for PyMuPDFAdapter.`	尚未安裝 `pdf` extra。	在 repo checkout 內執行 `python -m pip install -e ".[pdf]"`。
找不到 `documa` 指令。	Virtual environment 未啟用，或 package 尚未 editable install。	啟用 `.venv`，重新執行 `python -m pip install -e ".[dev,pdf]"`，或改跑 `$env:PYTHONPATH="src"; python -m documa.cli ...`。
`documa-mcp` 無法 import MCP modules。	尚未安裝 `mcp` extra。	執行 `python -m pip install -e ".[mcp]"`。
PDF 結果的 reading order 不如預期。	PDF text order 受來源 PDF 的產生方式影響。	檢查 `documa.ir.json`，使用 block search/read tools，並在改 pipeline heuristics 前補 fixture coverage。

上游背景

以下上游概念是 Documa 邊界設計的依據：

PyMuPDF 文件說明，plain PDF text extraction 不一定保留 natural reading order，而 block/word extraction 會帶有可用於 layout repair 的 position information：PyMuPDF text recipes。
OpenAI function calling 使用 JSON-schema-defined tools，並由 application side 執行 tool call：OpenAI function calling guide。
MCP tools 回傳 content 與 structured results，並支援 structured output schemas：MCP tools specification。

如果你要改 Documa 的 integration layer，請先重新核對目前的 upstream docs，再變更 schemas 或 protocol behavior。

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documa-0.1.0.tar.gz (94.6 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

documa-0.1.0-py3-none-any.whl (86.0 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file documa-0.1.0.tar.gz.

File metadata

Download URL: documa-0.1.0.tar.gz
Upload date: Jun 10, 2026
Size: 94.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for documa-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0f39ccfcf28c65dd291c1d155163c7712993e60d4c9fed745eb313dbd7aeef00`
MD5	`8352c4bd27eca5a2bf66a653687beb0a`
BLAKE2b-256	`921ed3995b1cee2d1bc6c47ed5da265c3c0f2d33c01a264d5c8d89d1e97ac2bc`

See more details on using hashes here.

File details

Details for the file documa-0.1.0-py3-none-any.whl.

File metadata

Download URL: documa-0.1.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 86.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for documa-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77085bec73edb71385d2135bf0c06d362549f733b076ce3405bc6cb59847e3f6`
MD5	`6e012587f9ad87d79f289775df480e44`
BLAKE2b-256	`802a30bca1e8959eca3a776d3f42c509a1c00fe5226b9a868a8d2df868003357`

See more details on using hashes here.

documa 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Documa

概覽

為什麼需要 Documa

與一般 PDF parser 的差異

核心概念

快速開始

前置條件（Prerequisites / Requirements）

1. 從 package index 安裝

2. 從 Git repo 安裝

3. 從本地 wheel 安裝

4. 從 repo checkout 做開發安裝

5. 驗證環境

6. 處理 PDF

7. 像 agent 一樣讀文件

8. 輸出給下游系統

9. 使用 tool schemas 或 MCP

10. 執行測試

使用方式

專案結構

以開發者方式使用 Documa

使用 Python API

新增 parser adapter

新增 public tool 或物件

範例

品質門檻

已知限制

疑難排解

上游背景

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes