Agentic Purchase Order Intelligence — multi-agent document extraction with visual grounding and schema memory.
Project description
Phares — Agentic Purchase Order Intelligence
A production-grade, multi-agent document processing system that turns any PDF, scan, image, DOCX, or XLSX Purchase Order into a strict, validated JSON object with full detail preservation, visual grounding, and schema memory for repeat templates.
Built on CrewAI + LangChain + (optional) LangGraph, with PyMuPDF, pdfplumber, Tesseract, and optional HuggingFace (Donut / LayoutLMv3) and Ollama multimodal backends.
1. Architecture
INPUT (pdf/png/jpg/docx/xlsx)
│
▼
┌──────────────┐ ┌─────────────────┐
│ Planner │────── fingerprint hit? ─────▶│ Memory Agent │
│ Agent │◀─────── reuse schema ────────│ (FAISS) │
└──────┬───────┘ └─────────────────┘
│
┌─────┴──────┐ ┌─────────────┐ ┌──────────────────┐
│ Loader │───▶│ OCR/Vision │───▶│ Structure │
│ (pymupdf, │ │ (Tesseract, │ │ (tables + KVs) │
│ pdfplumber│ │ Donut, │ │ │
│ docx,xlsx)│ │ LayoutLMv3,│ │ │
│ │ │ llava) │ │ │
└────────────┘ └──────┬──────┘ └─────────┬────────┘
│ │
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Labeling Agent │
│ (Ollama LLM, JSON │
│ mode, few-shot) │
└─────────┬───────────┘
│ low-confidence?
▼
┌────────────────┐
│ Web Research │
│ (DuckDuckGo) │
└────────┬───────┘
▼
┌────────────────┐
│ Output Agent │
│ (Pydantic │
│ validation) │
└────────┬───────┘
▼
output/<file>.json
│
▼
Memory Agent persists
learned schema for reuse
Agent roster
| Agent | Role | Key tools |
|---|---|---|
| Planner | Decides per-file pipeline; rejects non-POs if PO_ONLY=true |
classify_pdf |
| Loader | Extracts text + layout + images | pdf_extract, docx_extract, xlsx_extract, pdf_pages_to_images |
| OCR & Vision | Recovers text from scans/images | ocr_pdf, ocr_image, donut_parse, vision_describe |
| Structure Extractor | Tables + key/value pairs | extract_tables, extract_kv_pairs |
| Memory | Schema fingerprint → FAISS recall / persist | schema_fingerprint, memory_lookup, memory_store |
| Labeler | Strict-JSON PO field extraction | (LLM) |
| Web Researcher | Clarifies unknown labels / vendor formats | web_search, fetch_url |
| Output Validator | Pydantic-validates final object | (LLM) |
LangSmith tracing
Set the following in .env to stream every LLM / chain / tool call to
LangSmith (https://smith.langchain.com):
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=email-parsh
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
src/config.py exports these into both the modern LANGSMITH_* and legacy
LANGCHAIN_* environment variable names at import time, so every LangChain /
CrewAI call is automatically traced without any code changes. Turn it off by
setting LANGSMITH_TRACING=false.
LLM backend options
| Provider | .env keys |
When to use |
|---|---|---|
| Ollama (local) | LLM_PROVIDER=ollama, LLM_MODEL=llama3.1:8b |
Zero API cost, offline, slower on CPU |
| HuggingFace Inference | LLM_PROVIDER=hf, HF_TOKEN=..., HF_INFERENCE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct |
Fast, no local GPU needed, uses HF serverless |
| OpenAI-compatible | LLM_PROVIDER=openai, OPENAI_API_KEY=..., LLM_MODEL=gpt-4o-mini |
Production SLAs, any OpenAI-compatible endpoint |
HF_TOKEN is also picked up automatically by transformers, huggingface_hub,
and sentence-transformers for model downloads (gated Donut/LayoutLMv3, Nougat,
private models). Set it once in .env and everything works.
Model selection — why these
- Donut (
naver-clova-ix/donut-base-finetuned-cord-v2) — best open model for end-to-end document understanding without relying on OCR; handles noisy receipts/POs very well. Feature-flagged (ENABLE_DONUT=true) because of GPU weight. - LayoutLMv3 — when you need token-level visual grounding for training/custom models.
- Tesseract — dependable baseline OCR with word-level confidences + boxes.
- Ollama llama3.1/mistral — strong local reasoning for labeling; zero API cost; JSON mode gives us schema-valid output.
- Ollama llava — multimodal fallback for image-only POs.
- sentence-transformers MiniLM — fast, small embeddings for FAISS schema memory.
2. Installation
# 1) Python env
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
# 2) OCR engine (needed for scanned PDFs)
winget install --id UB-Mannheim.TesseractOCR -e
# Then set TESSERACT_CMD in .env to the install path.
# 3) (optional) Poppler — needed by pdf2image on Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases
# Unzip and set POPPLER_PATH in .env to the bin directory.
# 4) LLM backend — Ollama (recommended)
# https://ollama.com/download
ollama pull llama3.1:8b
ollama pull llava:7b # only if you want vision fallback
# 5) Configure
copy .env.example .env
# edit paths / LLM settings
3. Usage
# Process every file under samples/
python run.py
# Process one file
python run.py "samples\PO-TVP-548.pdf"
# Or via the full CrewAI agentic orchestrator (verbose trace)
python run.py --crew
Output JSON files land in output/ with the same basename. A summary table is
printed to the console.
4. Output schema
{
"document_type": "purchase_order",
"document_type_confidence": 0.97,
"is_purchase_order": true,
"metadata": {
"file_name": "PO-TVP-548.pdf",
"file_path": "...",
"file_size_bytes": 70011,
"mime_type": "application/pdf",
"page_count": 1,
"is_scanned": false,
"pipeline_path": "digital_pdf",
"processing_seconds": 6.12
},
"fields": {
"po_number": {"value": "PO-TVP-548", "raw": "PO No: PO-TVP-548", "confidence": 0.96},
"po_date": {"value": "2025-04-12", "raw": "12-Apr-2025", "confidence": 0.93},
"vendor": {"name": {"value": "Acme Traders", "confidence": 0.95}, "address": {...}},
"buyer": {"name": {"value": "Lancer International", "confidence": 0.94}},
"ship_to": {"address": {"value": "...", "confidence": 0.88}},
"items": [
{
"description": {"value": "SS Flange 150#", "confidence": 0.91},
"quantity": {"value": 10, "confidence": 0.95},
"unit_price": {"value": 1250.0, "confidence": 0.93},
"line_total": {"value": 12500.0, "confidence": 0.94}
}
],
"totals": {
"subtotal": {"value": 12500.0, "confidence": 0.94},
"tax_total": {"value": 2250.0, "confidence": 0.93},
"grand_total":{"value": 14750.0, "confidence": 0.95},
"currency": {"value": "INR", "confidence": 0.92}
},
"extras": { "incoterm": "FOB Chennai" }
},
"tables": [{...}],
"raw_text": "…full document text…",
"confidence_scores": {"po_number": 0.96, "totals.grand_total": 0.95, ...},
"warnings": [],
"schema_fingerprint": "acme po no date vendor… :: TBL:desc|qty|rate|amount",
"reused_memory_template": null
}
Every leaf field is a GroundedValue with value, raw, optional bbox, and
confidence. Unknown fields land in fields.extras so no data is ever dropped.
5. Intelligence / learning loop
- For each new document we compute a layout fingerprint from top-of-page tokens, table headers, and KV keys.
- FAISS returns the nearest known template. If similarity ≥ 0.92 we inject that schema as few-shot context, skipping re-learning.
- After labeling, if no hit existed, the new schema is persisted — subsequent runs on the same template are faster and more consistent.
6. Extensibility
- Add a new labeled field → extend
models/schemas.py+ mention it inagents/labeling_agent.LABELING_SYSTEM. - Add a new file type → add a tool in
src/tools/, register it intools/__init__.py, and teach the Planner. - Swap the LLM → change
LLM_PROVIDER/LLM_MODELin.env.
7. Email ingestion (IMAP → extraction)
The email subsystem is an additive feature — nothing in the existing
extraction pipeline is modified. An Email Ingestion Agent monitors the
configured IMAP inbox, classifies each message as PO or not, downloads
attachments, and forwards supported files into pipeline.graph.run_graph.
Security
- Credentials are read from
IMAP_USER/IMAP_PASSin.envor the process environment. Never hardcoded..envis in.gitignore. - Connection is always
IMAP4_SSL(default port 993). - For Gmail, create an App Password (16 chars). Your normal password will not work with 2FA.
- For production, swap the
.envsource for AWS Secrets Manager / Azure Key Vault by overridingIMAP_USER/IMAP_PASSin the process environment before import;src/config.pywill pick them up transparently.
.env keys
IMAP_USER=you@example.com
IMAP_PASS=xxxxxxxxxxxxxxxx
IMAP_HOST=imap.gmail.com
IMAP_PORT=993
IMAP_FOLDER=INBOX
IMAP_POLL_SECONDS=60
IMAP_MARK_SEEN=true # marks \Seen on the server after processing
IMAP_SEARCH=UNSEEN # IMAP SEARCH criterion
IMAP_LOOKBACK_DAYS=7
IMAP_MAX_BATCH=25
ATTACHMENT_DIR=<path to save attachments>
SEEN_STORE_PATH=<path to jsonl dedup log>
Classification pipeline
For each unread email:
- Keyword prefilter — subject/body regex + attachment presence. Strong matches (score ≥ 0.75) or the clearly-null case skip the LLM.
- LLM classifier — when the signal is ambiguous, an Ollama JSON call
returns
{is_purchase_order, confidence, reason}. Blended with the keyword signal. - Result is appended to
memory_store/email_runs.jsonlfor audit.
Attachment handling
- Only
.pdf .png .jpg .jpeg .tif .tiff .docx .xlsxare retained. - Filenames are sanitized (
[^A-Za-z0-9._\- ]→_); collisions get__1,__2. - Size cap 25 MB per attachment.
- One folder per email:
attachments/uid-<UID>-<mid-hash>/.
Dedup
Two guards so the same email is never processed twice:
- Server-side:
IMAP_MARK_SEEN=truesets the\Seenflag after processing so subsequentUNSEENsearches skip it. - Client-side: a JSONL seen-store at
SEEN_STORE_PATHkeyed by Message-ID (fallback UID) — survives even if the server flag is cleared.
Usage
# one cycle, then exit (good for cron / Task Scheduler)
python run_email.py --once
# long-running poller
python run_email.py --poll 60
# widen the IMAP SINCE window to 30 days, skip LLM classifier
python run_email.py --once --lookback 30 --no-llm
# verbose logging
python run_email.py --once -v
The runner prints a Rich summary table per cycle and appends full details to
memory_store/email_runs.jsonl. Each PO attachment produces its own
output/<basename>.json via the existing extraction graph — unchanged.
Files added
src/email_ingestion/
├── __init__.py
├── imap_client.py # IMAP4_SSL + parsing + retry
├── seen_store.py # JSONL dedup
├── classifier.py # keyword + LLM PO classifier
├── attachment_handler.py # sanitize, validate, save
└── runner.py # poll loop + wiring to run_graph
src/tools/email_tools.py # LangChain tools for the Email Agent
src/agents/email_agent.py # CrewAI "Email Ingestion Specialist"
run_email.py # CLI entrypoint
8. Files
Phares/
├── run.py # entry
├── requirements.txt
├── .env.example
├── README.md
├── samples/ # input POs
├── output/ # JSON results
├── memory_store/ # FAISS + records.jsonl
└── src/
├── main.py # CLI
├── config.py
├── models/
│ ├── schemas.py # Pydantic contract
│ └── model_loader.py # LLM + embedder factories
├── tools/ # LangChain tools
│ ├── pdf_tools.py
│ ├── ocr_tools.py
│ ├── vision_tools.py
│ ├── layout_tools.py
│ ├── office_tools.py
│ ├── memory_tools.py
│ └── search_tools.py
├── memory/vector_store.py # FAISS schema memory
├── agents/ # CrewAI agents
│ ├── planner_agent.py
│ ├── loader_agent.py
│ ├── ocr_vision_agent.py
│ ├── structure_agent.py
│ ├── labeling_agent.py
│ ├── web_research_agent.py
│ ├── memory_agent.py
│ └── output_agent.py
├── pipeline/
│ ├── graph.py # deterministic state machine (default)
│ └── crew.py # CrewAI orchestration (--crew)
└── utils/file_utils.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phares-0.1.1.tar.gz.
File metadata
- Download URL: phares-0.1.1.tar.gz
- Upload date:
- Size: 60.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b47f36f772d24c2e9f8405f95f4e2f4bf46d3c84e93b521cd712d2ae97f28f1f
|
|
| MD5 |
04de39c8eaac1dcc98387d1063630b30
|
|
| BLAKE2b-256 |
1648a61ca20e7f2e81c9ff19c281f398d5e30e7fc42794d61d9e397a9eab5a34
|
File details
Details for the file phares-0.1.1-py3-none-any.whl.
File metadata
- Download URL: phares-0.1.1-py3-none-any.whl
- Upload date:
- Size: 61.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b44d18faaba017a9581afe2239aedda784097a2a69318967c291b482a2801c1
|
|
| MD5 |
8e4dd18deec04bcde01803d5a52091c8
|
|
| BLAKE2b-256 |
10adafeb62ff8569e7232d3ceb65e5b098cbb24639fecc43c6229a17d5789a5e
|