Skip to main content

Agentic Purchase Order Intelligence — multi-agent document extraction with visual grounding and schema memory.

Project description

Phares — Agentic Purchase Order Intelligence

A production-grade, multi-agent document processing system that turns any PDF, scan, image, DOCX, or XLSX Purchase Order into a strict, validated JSON object with full detail preservation, visual grounding, and schema memory for repeat templates.

Built on CrewAI + LangChain + (optional) LangGraph, with PyMuPDF, pdfplumber, Tesseract, and optional HuggingFace (Donut / LayoutLMv3) and Ollama multimodal backends.


1. Architecture

 INPUT (pdf/png/jpg/docx/xlsx)
        │
        ▼
  ┌──────────────┐                              ┌─────────────────┐
  │  Planner     │────── fingerprint hit? ─────▶│ Memory Agent    │
  │  Agent       │◀─────── reuse schema ────────│ (FAISS)         │
  └──────┬───────┘                              └─────────────────┘
         │
   ┌─────┴──────┐    ┌─────────────┐    ┌──────────────────┐
   │ Loader     │───▶│ OCR/Vision  │───▶│ Structure        │
   │ (pymupdf,  │    │ (Tesseract, │    │ (tables + KVs)   │
   │  pdfplumber│    │  Donut,     │    │                  │
   │  docx,xlsx)│    │  LayoutLMv3,│    │                  │
   │            │    │  llava)     │    │                  │
   └────────────┘    └──────┬──────┘    └─────────┬────────┘
                            │                     │
                            └─────────┬───────────┘
                                      ▼
                           ┌─────────────────────┐
                           │ Labeling Agent      │
                           │ (Ollama LLM, JSON   │
                           │  mode, few-shot)    │
                           └─────────┬───────────┘
                                     │  low-confidence?
                                     ▼
                            ┌────────────────┐
                            │ Web Research   │
                            │ (DuckDuckGo)   │
                            └────────┬───────┘
                                     ▼
                            ┌────────────────┐
                            │ Output Agent   │
                            │ (Pydantic      │
                            │  validation)   │
                            └────────┬───────┘
                                     ▼
                              output/<file>.json
                                     │
                                     ▼
                          Memory Agent persists
                          learned schema for reuse

Agent roster

Agent Role Key tools
Planner Decides per-file pipeline; rejects non-POs if PO_ONLY=true classify_pdf
Loader Extracts text + layout + images pdf_extract, docx_extract, xlsx_extract, pdf_pages_to_images
OCR & Vision Recovers text from scans/images ocr_pdf, ocr_image, donut_parse, vision_describe
Structure Extractor Tables + key/value pairs extract_tables, extract_kv_pairs
Memory Schema fingerprint → FAISS recall / persist schema_fingerprint, memory_lookup, memory_store
Labeler Strict-JSON PO field extraction (LLM)
Web Researcher Clarifies unknown labels / vendor formats web_search, fetch_url
Output Validator Pydantic-validates final object (LLM)

LangSmith tracing

Set the following in .env to stream every LLM / chain / tool call to LangSmith (https://smith.langchain.com):

LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=email-parsh
LANGSMITH_ENDPOINT=https://api.smith.langchain.com

src/config.py exports these into both the modern LANGSMITH_* and legacy LANGCHAIN_* environment variable names at import time, so every LangChain / CrewAI call is automatically traced without any code changes. Turn it off by setting LANGSMITH_TRACING=false.

LLM backend options

Provider .env keys When to use
Ollama (local) LLM_PROVIDER=ollama, LLM_MODEL=llama3.1:8b Zero API cost, offline, slower on CPU
HuggingFace Inference LLM_PROVIDER=hf, HF_TOKEN=..., HF_INFERENCE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct Fast, no local GPU needed, uses HF serverless
OpenAI-compatible LLM_PROVIDER=openai, OPENAI_API_KEY=..., LLM_MODEL=gpt-4o-mini Production SLAs, any OpenAI-compatible endpoint

HF_TOKEN is also picked up automatically by transformers, huggingface_hub, and sentence-transformers for model downloads (gated Donut/LayoutLMv3, Nougat, private models). Set it once in .env and everything works.

Model selection — why these

  • Donut (naver-clova-ix/donut-base-finetuned-cord-v2) — best open model for end-to-end document understanding without relying on OCR; handles noisy receipts/POs very well. Feature-flagged (ENABLE_DONUT=true) because of GPU weight.
  • LayoutLMv3 — when you need token-level visual grounding for training/custom models.
  • Tesseract — dependable baseline OCR with word-level confidences + boxes.
  • Ollama llama3.1/mistral — strong local reasoning for labeling; zero API cost; JSON mode gives us schema-valid output.
  • Ollama llava — multimodal fallback for image-only POs.
  • sentence-transformers MiniLM — fast, small embeddings for FAISS schema memory.

2. Installation

# 1) Python env
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

# 2) OCR engine (needed for scanned PDFs)
winget install --id UB-Mannheim.TesseractOCR -e
# Then set TESSERACT_CMD in .env to the install path.

# 3) (optional) Poppler — needed by pdf2image on Windows
#    Download from https://github.com/oschwartz10612/poppler-windows/releases
#    Unzip and set POPPLER_PATH in .env to the bin directory.

# 4) LLM backend — Ollama (recommended)
# https://ollama.com/download
ollama pull llama3.1:8b
ollama pull llava:7b       # only if you want vision fallback

# 5) Configure
copy .env.example .env
#   edit paths / LLM settings

3. Usage

# Process every file under samples/
python run.py

# Process one file
python run.py "samples\PO-TVP-548.pdf"

# Or via the full CrewAI agentic orchestrator (verbose trace)
python run.py --crew

Output JSON files land in output/ with the same basename. A summary table is printed to the console.


4. Output schema

{
  "document_type": "purchase_order",
  "document_type_confidence": 0.97,
  "is_purchase_order": true,
  "metadata": {
    "file_name": "PO-TVP-548.pdf",
    "file_path": "...",
    "file_size_bytes": 70011,
    "mime_type": "application/pdf",
    "page_count": 1,
    "is_scanned": false,
    "pipeline_path": "digital_pdf",
    "processing_seconds": 6.12
  },
  "fields": {
    "po_number":   {"value": "PO-TVP-548", "raw": "PO No: PO-TVP-548", "confidence": 0.96},
    "po_date":     {"value": "2025-04-12", "raw": "12-Apr-2025",       "confidence": 0.93},
    "vendor":   {"name": {"value": "Acme Traders", "confidence": 0.95}, "address": {...}},
    "buyer":    {"name": {"value": "Lancer International", "confidence": 0.94}},
    "ship_to":  {"address": {"value": "...", "confidence": 0.88}},
    "items": [
      {
        "description": {"value": "SS Flange 150#", "confidence": 0.91},
        "quantity":    {"value": 10, "confidence": 0.95},
        "unit_price":  {"value": 1250.0, "confidence": 0.93},
        "line_total":  {"value": 12500.0, "confidence": 0.94}
      }
    ],
    "totals": {
      "subtotal":   {"value": 12500.0, "confidence": 0.94},
      "tax_total":  {"value": 2250.0,  "confidence": 0.93},
      "grand_total":{"value": 14750.0, "confidence": 0.95},
      "currency":   {"value": "INR",    "confidence": 0.92}
    },
    "extras": { "incoterm": "FOB Chennai" }
  },
  "tables": [{...}],
  "raw_text": "…full document text…",
  "confidence_scores": {"po_number": 0.96, "totals.grand_total": 0.95, ...},
  "warnings": [],
  "schema_fingerprint": "acme po no date vendor… :: TBL:desc|qty|rate|amount",
  "reused_memory_template": null
}

Every leaf field is a GroundedValue with value, raw, optional bbox, and confidence. Unknown fields land in fields.extras so no data is ever dropped.


5. Intelligence / learning loop

  1. For each new document we compute a layout fingerprint from top-of-page tokens, table headers, and KV keys.
  2. FAISS returns the nearest known template. If similarity ≥ 0.92 we inject that schema as few-shot context, skipping re-learning.
  3. After labeling, if no hit existed, the new schema is persisted — subsequent runs on the same template are faster and more consistent.

6. Extensibility

  • Add a new labeled field → extend models/schemas.py + mention it in agents/labeling_agent.LABELING_SYSTEM.
  • Add a new file type → add a tool in src/tools/, register it in tools/__init__.py, and teach the Planner.
  • Swap the LLM → change LLM_PROVIDER / LLM_MODEL in .env.

7. Email ingestion (IMAP → extraction)

The email subsystem is an additive feature — nothing in the existing extraction pipeline is modified. An Email Ingestion Agent monitors the configured IMAP inbox, classifies each message as PO or not, downloads attachments, and forwards supported files into pipeline.graph.run_graph.

Security

  • Credentials are read from IMAP_USER / IMAP_PASS in .env or the process environment. Never hardcoded. .env is in .gitignore.
  • Connection is always IMAP4_SSL (default port 993).
  • For Gmail, create an App Password (16 chars). Your normal password will not work with 2FA.
  • For production, swap the .env source for AWS Secrets Manager / Azure Key Vault by overriding IMAP_USER/IMAP_PASS in the process environment before import; src/config.py will pick them up transparently.

.env keys

IMAP_USER=you@example.com
IMAP_PASS=xxxxxxxxxxxxxxxx
IMAP_HOST=imap.gmail.com
IMAP_PORT=993
IMAP_FOLDER=INBOX
IMAP_POLL_SECONDS=60
IMAP_MARK_SEEN=true        # marks \Seen on the server after processing
IMAP_SEARCH=UNSEEN         # IMAP SEARCH criterion
IMAP_LOOKBACK_DAYS=7
IMAP_MAX_BATCH=25
ATTACHMENT_DIR=<path to save attachments>
SEEN_STORE_PATH=<path to jsonl dedup log>

Classification pipeline

For each unread email:

  1. Keyword prefilter — subject/body regex + attachment presence. Strong matches (score ≥ 0.75) or the clearly-null case skip the LLM.
  2. LLM classifier — when the signal is ambiguous, an Ollama JSON call returns {is_purchase_order, confidence, reason}. Blended with the keyword signal.
  3. Result is appended to memory_store/email_runs.jsonl for audit.

Attachment handling

  • Only .pdf .png .jpg .jpeg .tif .tiff .docx .xlsx are retained.
  • Filenames are sanitized ([^A-Za-z0-9._\- ]_); collisions get __1, __2.
  • Size cap 25 MB per attachment.
  • One folder per email: attachments/uid-<UID>-<mid-hash>/.

Dedup

Two guards so the same email is never processed twice:

  • Server-side: IMAP_MARK_SEEN=true sets the \Seen flag after processing so subsequent UNSEEN searches skip it.
  • Client-side: a JSONL seen-store at SEEN_STORE_PATH keyed by Message-ID (fallback UID) — survives even if the server flag is cleared.

Usage

# one cycle, then exit (good for cron / Task Scheduler)
python run_email.py --once

# long-running poller
python run_email.py --poll 60

# widen the IMAP SINCE window to 30 days, skip LLM classifier
python run_email.py --once --lookback 30 --no-llm

# verbose logging
python run_email.py --once -v

The runner prints a Rich summary table per cycle and appends full details to memory_store/email_runs.jsonl. Each PO attachment produces its own output/<basename>.json via the existing extraction graph — unchanged.

Files added

src/email_ingestion/
├── __init__.py
├── imap_client.py        # IMAP4_SSL + parsing + retry
├── seen_store.py         # JSONL dedup
├── classifier.py         # keyword + LLM PO classifier
├── attachment_handler.py # sanitize, validate, save
└── runner.py             # poll loop + wiring to run_graph

src/tools/email_tools.py  # LangChain tools for the Email Agent
src/agents/email_agent.py # CrewAI "Email Ingestion Specialist"
run_email.py              # CLI entrypoint

8. Files

Phares/
├── run.py                        # entry
├── requirements.txt
├── .env.example
├── README.md
├── samples/                      # input POs
├── output/                       # JSON results
├── memory_store/                 # FAISS + records.jsonl
└── src/
    ├── main.py                   # CLI
    ├── config.py
    ├── models/
    │   ├── schemas.py            # Pydantic contract
    │   └── model_loader.py       # LLM + embedder factories
    ├── tools/                    # LangChain tools
    │   ├── pdf_tools.py
    │   ├── ocr_tools.py
    │   ├── vision_tools.py
    │   ├── layout_tools.py
    │   ├── office_tools.py
    │   ├── memory_tools.py
    │   └── search_tools.py
    ├── memory/vector_store.py    # FAISS schema memory
    ├── agents/                   # CrewAI agents
    │   ├── planner_agent.py
    │   ├── loader_agent.py
    │   ├── ocr_vision_agent.py
    │   ├── structure_agent.py
    │   ├── labeling_agent.py
    │   ├── web_research_agent.py
    │   ├── memory_agent.py
    │   └── output_agent.py
    ├── pipeline/
    │   ├── graph.py              # deterministic state machine (default)
    │   └── crew.py               # CrewAI orchestration (--crew)
    └── utils/file_utils.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phares-0.1.1.tar.gz (60.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phares-0.1.1-py3-none-any.whl (61.3 kB view details)

Uploaded Python 3

File details

Details for the file phares-0.1.1.tar.gz.

File metadata

  • Download URL: phares-0.1.1.tar.gz
  • Upload date:
  • Size: 60.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for phares-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b47f36f772d24c2e9f8405f95f4e2f4bf46d3c84e93b521cd712d2ae97f28f1f
MD5 04de39c8eaac1dcc98387d1063630b30
BLAKE2b-256 1648a61ca20e7f2e81c9ff19c281f398d5e30e7fc42794d61d9e397a9eab5a34

See more details on using hashes here.

File details

Details for the file phares-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: phares-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 61.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for phares-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b44d18faaba017a9581afe2239aedda784097a2a69318967c291b482a2801c1
MD5 8e4dd18deec04bcde01803d5a52091c8
BLAKE2b-256 10adafeb62ff8569e7232d3ceb65e5b098cbb24639fecc43c6229a17d5789a5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page