Agentic Purchase Order Intelligence — multi-agent document extraction with visual grounding and schema memory.

These details have not been verified by PyPI

Project links

Project description

Phares — Agentic Purchase Order Intelligence

A production-grade, multi-agent document processing system that turns any PDF, scan, image, DOCX, or XLSX Purchase Order into a strict, validated JSON object with full detail preservation, visual grounding, and schema memory for repeat templates.

Built on CrewAI + LangChain + (optional) LangGraph, with PyMuPDF, pdfplumber, Tesseract, and optional HuggingFace (Donut / LayoutLMv3) and Ollama multimodal backends.

1. Architecture

 INPUT (pdf/png/jpg/docx/xlsx)
        │
        ▼
  ┌──────────────┐                              ┌─────────────────┐
  │  Planner     │────── fingerprint hit? ─────▶│ Memory Agent    │
  │  Agent       │◀─────── reuse schema ────────│ (FAISS)         │
  └──────┬───────┘                              └─────────────────┘
         │
   ┌─────┴──────┐    ┌─────────────┐    ┌──────────────────┐
   │ Loader     │───▶│ OCR/Vision  │───▶│ Structure        │
   │ (pymupdf,  │    │ (Tesseract, │    │ (tables + KVs)   │
   │  pdfplumber│    │  Donut,     │    │                  │
   │  docx,xlsx)│    │  LayoutLMv3,│    │                  │
   │            │    │  llava)     │    │                  │
   └────────────┘    └──────┬──────┘    └─────────┬────────┘
                            │                     │
                            └─────────┬───────────┘
                                      ▼
                           ┌─────────────────────┐
                           │ Labeling Agent      │
                           │ (Ollama LLM, JSON   │
                           │  mode, few-shot)    │
                           └─────────┬───────────┘
                                     │  low-confidence?
                                     ▼
                            ┌────────────────┐
                            │ Web Research   │
                            │ (DuckDuckGo)   │
                            └────────┬───────┘
                                     ▼
                            ┌────────────────┐
                            │ Output Agent   │
                            │ (Pydantic      │
                            │  validation)   │
                            └────────┬───────┘
                                     ▼
                              output/<file>.json
                                     │
                                     ▼
                          Memory Agent persists
                          learned schema for reuse

Agent roster

Agent	Role	Key tools
Planner	Decides per-file pipeline; rejects non-POs if `PO_ONLY=true`	`classify_pdf`
Loader	Extracts text + layout + images	`pdf_extract`, `docx_extract`, `xlsx_extract`, `pdf_pages_to_images`
OCR & Vision	Recovers text from scans/images	`ocr_pdf`, `ocr_image`, `donut_parse`, `vision_describe`
Structure Extractor	Tables + key/value pairs	`extract_tables`, `extract_kv_pairs`
Memory	Schema fingerprint → FAISS recall / persist	`schema_fingerprint`, `memory_lookup`, `memory_store`
Labeler	Strict-JSON PO field extraction	(LLM)
Web Researcher	Clarifies unknown labels / vendor formats	`web_search`, `fetch_url`
Output Validator	Pydantic-validates final object	(LLM)

LangSmith tracing

Set the following in .env to stream every LLM / chain / tool call to LangSmith (https://smith.langchain.com):

LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=email-parsh
LANGSMITH_ENDPOINT=https://api.smith.langchain.com

src/config.py exports these into both the modern LANGSMITH_* and legacy LANGCHAIN_* environment variable names at import time, so every LangChain / CrewAI call is automatically traced without any code changes. Turn it off by setting LANGSMITH_TRACING=false.

LLM backend options

Provider	`.env` keys	When to use
Ollama (local)	`LLM_PROVIDER=ollama`, `LLM_MODEL=llama3.1:8b`	Zero API cost, offline, slower on CPU
HuggingFace Inference	`LLM_PROVIDER=hf`, `HF_TOKEN=...`, `HF_INFERENCE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct`	Fast, no local GPU needed, uses HF serverless
OpenAI-compatible	`LLM_PROVIDER=openai`, `OPENAI_API_KEY=...`, `LLM_MODEL=gpt-4o-mini`	Production SLAs, any OpenAI-compatible endpoint

HF_TOKEN is also picked up automatically by transformers, huggingface_hub, and sentence-transformers for model downloads (gated Donut/LayoutLMv3, Nougat, private models). Set it once in .env and everything works.

Model selection — why these

Donut (naver-clova-ix/donut-base-finetuned-cord-v2) — best open model for end-to-end document understanding without relying on OCR; handles noisy receipts/POs very well. Feature-flagged (ENABLE_DONUT=true) because of GPU weight.
LayoutLMv3 — when you need token-level visual grounding for training/custom models.
Tesseract — dependable baseline OCR with word-level confidences + boxes.
Ollama llama3.1/mistral — strong local reasoning for labeling; zero API cost; JSON mode gives us schema-valid output.
Ollama llava — multimodal fallback for image-only POs.
sentence-transformers MiniLM — fast, small embeddings for FAISS schema memory.

2. Installation

# 1) Python env
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

# 2) OCR engine (needed for scanned PDFs)
winget install --id UB-Mannheim.TesseractOCR -e
# Then set TESSERACT_CMD in .env to the install path.

# 3) (optional) Poppler — needed by pdf2image on Windows
#    Download from https://github.com/oschwartz10612/poppler-windows/releases
#    Unzip and set POPPLER_PATH in .env to the bin directory.

# 4) LLM backend — Ollama (recommended)
# https://ollama.com/download
ollama pull llama3.1:8b
ollama pull llava:7b       # only if you want vision fallback

# 5) Configure
copy .env.example .env
#   edit paths / LLM settings

3. Usage

# Process every file under samples/
python run.py

# Process one file
python run.py "samples\PO-TVP-548.pdf"

# Or via the full CrewAI agentic orchestrator (verbose trace)
python run.py --crew

Output JSON files land in output/ with the same basename. A summary table is printed to the console.

4. Output schema

{
  "document_type": "purchase_order",
  "document_type_confidence": 0.97,
  "is_purchase_order": true,
  "metadata": {
    "file_name": "PO-TVP-548.pdf",
    "file_path": "...",
    "file_size_bytes": 70011,
    "mime_type": "application/pdf",
    "page_count": 1,
    "is_scanned": false,
    "pipeline_path": "digital_pdf",
    "processing_seconds": 6.12
  },
  "fields": {
    "po_number":   {"value": "PO-TVP-548", "raw": "PO No: PO-TVP-548", "confidence": 0.96},
    "po_date":     {"value": "2025-04-12", "raw": "12-Apr-2025",       "confidence": 0.93},
    "vendor":   {"name": {"value": "Acme Traders", "confidence": 0.95}, "address": {...}},
    "buyer":    {"name": {"value": "Lancer International", "confidence": 0.94}},
    "ship_to":  {"address": {"value": "...", "confidence": 0.88}},
    "items": [
      {
        "description": {"value": "SS Flange 150#", "confidence": 0.91},
        "quantity":    {"value": 10, "confidence": 0.95},
        "unit_price":  {"value": 1250.0, "confidence": 0.93},
        "line_total":  {"value": 12500.0, "confidence": 0.94}
      }
    ],
    "totals": {
      "subtotal":   {"value": 12500.0, "confidence": 0.94},
      "tax_total":  {"value": 2250.0,  "confidence": 0.93},
      "grand_total":{"value": 14750.0, "confidence": 0.95},
      "currency":   {"value": "INR",    "confidence": 0.92}
    },
    "extras": { "incoterm": "FOB Chennai" }
  },
  "tables": [{...}],
  "raw_text": "…full document text…",
  "confidence_scores": {"po_number": 0.96, "totals.grand_total": 0.95, ...},
  "warnings": [],
  "schema_fingerprint": "acme po no date vendor… :: TBL:desc|qty|rate|amount",
  "reused_memory_template": null
}

Every leaf field is a GroundedValue with value, raw, optional bbox, and confidence. Unknown fields land in fields.extras so no data is ever dropped.

5. Intelligence / learning loop

For each new document we compute a layout fingerprint from top-of-page tokens, table headers, and KV keys.
FAISS returns the nearest known template. If similarity ≥ 0.92 we inject that schema as few-shot context, skipping re-learning.
After labeling, if no hit existed, the new schema is persisted — subsequent runs on the same template are faster and more consistent.

6. Extensibility

Add a new labeled field → extend models/schemas.py + mention it in agents/labeling_agent.LABELING_SYSTEM.
Add a new file type → add a tool in src/tools/, register it in tools/__init__.py, and teach the Planner.
Swap the LLM → change LLM_PROVIDER / LLM_MODEL in .env.

7. Email ingestion (IMAP → extraction)

The email subsystem is an additive feature — nothing in the existing extraction pipeline is modified. An Email Ingestion Agent monitors the configured IMAP inbox, classifies each message as PO or not, downloads attachments, and forwards supported files into pipeline.graph.run_graph.

Security

Credentials are read from IMAP_USER / IMAP_PASS in .env or the process environment. Never hardcoded. .env is in .gitignore.
Connection is always IMAP4_SSL (default port 993).
For Gmail, create an App Password (16 chars). Your normal password will not work with 2FA.
For production, swap the .env source for AWS Secrets Manager / Azure Key Vault by overriding IMAP_USER/IMAP_PASS in the process environment before import; src/config.py will pick them up transparently.

.env keys

IMAP_USER=you@example.com
IMAP_PASS=xxxxxxxxxxxxxxxx
IMAP_HOST=imap.gmail.com
IMAP_PORT=993
IMAP_FOLDER=INBOX
IMAP_POLL_SECONDS=60
IMAP_MARK_SEEN=true        # marks \Seen on the server after processing
IMAP_SEARCH=UNSEEN         # IMAP SEARCH criterion
IMAP_LOOKBACK_DAYS=7
IMAP_MAX_BATCH=25
ATTACHMENT_DIR=<path to save attachments>
SEEN_STORE_PATH=<path to jsonl dedup log>

Classification pipeline

For each unread email:

Keyword prefilter — subject/body regex + attachment presence. Strong matches (score ≥ 0.75) or the clearly-null case skip the LLM.
LLM classifier — when the signal is ambiguous, an Ollama JSON call returns {is_purchase_order, confidence, reason}. Blended with the keyword signal.
Result is appended to memory_store/email_runs.jsonl for audit.

Attachment handling

Only .pdf .png .jpg .jpeg .tif .tiff .docx .xlsx are retained.
Filenames are sanitized ([^A-Za-z0-9._\- ] → _); collisions get __1, __2.
Size cap 25 MB per attachment.
One folder per email: attachments/uid-<UID>-<mid-hash>/.

Dedup

Two guards so the same email is never processed twice:

Server-side: IMAP_MARK_SEEN=true sets the \Seen flag after processing so subsequent UNSEEN searches skip it.
Client-side: a JSONL seen-store at SEEN_STORE_PATH keyed by Message-ID (fallback UID) — survives even if the server flag is cleared.

Usage

# one cycle, then exit (good for cron / Task Scheduler)
python run_email.py --once

# long-running poller
python run_email.py --poll 60

# widen the IMAP SINCE window to 30 days, skip LLM classifier
python run_email.py --once --lookback 30 --no-llm

# verbose logging
python run_email.py --once -v

The runner prints a Rich summary table per cycle and appends full details to memory_store/email_runs.jsonl. Each PO attachment produces its own output/<basename>.json via the existing extraction graph — unchanged.

Files added

src/email_ingestion/
├── __init__.py
├── imap_client.py        # IMAP4_SSL + parsing + retry
├── seen_store.py         # JSONL dedup
├── classifier.py         # keyword + LLM PO classifier
├── attachment_handler.py # sanitize, validate, save
└── runner.py             # poll loop + wiring to run_graph

src/tools/email_tools.py  # LangChain tools for the Email Agent
src/agents/email_agent.py # CrewAI "Email Ingestion Specialist"
run_email.py              # CLI entrypoint

8. Files

Phares/
├── run.py                        # entry
├── requirements.txt
├── .env.example
├── README.md
├── samples/                      # input POs
├── output/                       # JSON results
├── memory_store/                 # FAISS + records.jsonl
└── src/
    ├── main.py                   # CLI
    ├── config.py
    ├── models/
    │   ├── schemas.py            # Pydantic contract
    │   └── model_loader.py       # LLM + embedder factories
    ├── tools/                    # LangChain tools
    │   ├── pdf_tools.py
    │   ├── ocr_tools.py
    │   ├── vision_tools.py
    │   ├── layout_tools.py
    │   ├── office_tools.py
    │   ├── memory_tools.py
    │   └── search_tools.py
    ├── memory/vector_store.py    # FAISS schema memory
    ├── agents/                   # CrewAI agents
    │   ├── planner_agent.py
    │   ├── loader_agent.py
    │   ├── ocr_vision_agent.py
    │   ├── structure_agent.py
    │   ├── labeling_agent.py
    │   ├── web_research_agent.py
    │   ├── memory_agent.py
    │   └── output_agent.py
    ├── pipeline/
    │   ├── graph.py              # deterministic state machine (default)
    │   └── crew.py               # CrewAI orchestration (--crew)
    └── utils/file_utils.py

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 27, 2026

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phares-0.1.1.tar.gz (60.0 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phares-0.1.1-py3-none-any.whl (61.3 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file phares-0.1.1.tar.gz.

File metadata

Download URL: phares-0.1.1.tar.gz
Upload date: Apr 27, 2026
Size: 60.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for phares-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b47f36f772d24c2e9f8405f95f4e2f4bf46d3c84e93b521cd712d2ae97f28f1f`
MD5	`04de39c8eaac1dcc98387d1063630b30`
BLAKE2b-256	`1648a61ca20e7f2e81c9ff19c281f398d5e30e7fc42794d61d9e397a9eab5a34`

See more details on using hashes here.

File details

Details for the file phares-0.1.1-py3-none-any.whl.

File metadata

Download URL: phares-0.1.1-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 61.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for phares-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b44d18faaba017a9581afe2239aedda784097a2a69318967c291b482a2801c1`
MD5	`8e4dd18deec04bcde01803d5a52091c8`
BLAKE2b-256	`10adafeb62ff8569e7232d3ceb65e5b098cbb24639fecc43c6229a17d5789a5e`

See more details on using hashes here.

phares 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Phares — Agentic Purchase Order Intelligence

1. Architecture

Agent roster

LangSmith tracing

LLM backend options

Model selection — why these

2. Installation

3. Usage

4. Output schema

5. Intelligence / learning loop

6. Extensibility

7. Email ingestion (IMAP → extraction)

Security

.env keys

Classification pipeline

Attachment handling

Dedup

Usage

Files added

8. Files

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes