Skip to main content

Contract-driven deterministic normalization engine for Python.

Project description

Paxman

Contract-driven deterministic normalization engine for Python.

CI License: MIT Python Code style: ruff Type checked: mypy --strict py.typed

Paxman transforms arbitrary input (PDFs, scans, emails, spreadsheets, APIs, free text) into evidence-backed, replayable normalized artifacts conforming to caller-supplied contracts (Pydantic, JSON Schema, OpenAPI, or a built-in Dict DSL).

from decimal import Decimal
from pydantic import BaseModel

import paxman

# IMPORTANT: import the adapter(s) you need so they self-register.
# Pydantic is an optional extra; the core package ships the registry
# but not the adapters themselves.
import paxman.contract.adapters.pydantic  # noqa: F401  (triggers self-registration)
import paxman.contract.adapters.dict_dsl  # noqa: F401


# Caller-owned contract (Pydantic example)
class Invoice(BaseModel):
    supplier_name: str
    total_amount: float
    currency_code: str
    line_items: list[LineItem]


# Normalize raw input against the contract
result = paxman.normalize(
    input_data=raw_invoice_bytes,
    contract=Invoice,
    budget=paxman.Budget(max_total_cost_usd=Decimal("0.10")),  # Decimal per ADR-0004
    policy=paxman.Policy(allow_remote_inference=True),
)

# Inspect or consume
print(result.normalized_data)        # matches the Invoice shape
print(result.unresolved_fields)      # any fields Paxman could not resolve
print(result.replay_hash)            # deterministic signature for replay

# Replay later from the artifact alone
rehydrated = paxman.replay(result, contract=Invoice)
assert rehydrated == result  # byte-equal

Why Paxman?

  • Contract-driven. You bring the contract. Paxman doesn't own your schema.
  • Field-centric, deterministic planning. Each required field gets its own plan.
  • Evidence-backed. Every resolved value carries provenance and confidence.
  • Replayable. Rehydrate the artifact without recomputation.
  • Honest. Unresolved fields are explicit, never silent.

What Paxman is NOT

  • Not a workflow engine.
  • Not a general-purpose agent framework.
  • Not a RAG framework.
  • Not a persistence layer.
  • Not a schema registry.
  • Not a standard library.
  • Not a domain ontology.

If you need any of these, wrap Paxman from the outside (see §When to use Paxman vs When to wrap Paxman below).

When to use Paxman vs When to wrap Paxman

Paxman is a library that produces an evidence-backed, replayable normalized artifact. Use Paxman directly when your problem is one of the following:

  • You have arbitrary input (text, PDF, JSON, HTML) that needs to be normalized against a caller-owned contract (Pydantic / JSON Schema / OpenAPI / Dict DSL).
  • You need evidence-backed normalization — every resolved value carries provenance, and every step is auditable.
  • You need replay — the ability to rehydrate a stored artifact without re-running the pipeline.
  • You need field-centric confidence — different fields can have different confidence, and the Reconciler grades the candidates with a single, fixed rubric.
  • You are integrating into a service (or a SaaS) that needs auditable normalization without owning a normalization engine.

Wrap Paxman from the outside when your problem is one of the following:

  • You need a workflow engine (DAG of long-running tasks, retries, human-in-the-loop, …). Wrap Paxman in a workflow engine.
  • You need a general-purpose agent framework (multi-turn reasoning, tool use, planning across many turns). Wrap Paxman behind an agent's tool call.
  • You need a RAG framework (vector search, retrieval, ranking). Wrap Paxman behind a RAG pipeline; the contract becomes the structured extraction step.
  • You need a persistence layer (database, ORM, migration tooling). Wrap Paxman in a service that stores the artifact.
  • You need a schema registry (catalog of contracts, versioning of contracts, governance). Wrap Paxman in a registry.
  • You need a standard library (general-purpose data transformation). Paxman is opinionated about evidence, replay, and confidence; it is not a general-purpose library.
  • You need a domain ontology (taxonomy, classification, knowledge graph). Wrap Paxman behind an ontology lookup.

In short: Paxman is the normalization step in a larger system. It is not the larger system. If you find yourself wanting to add workflow, persistence, or agentic features to Paxman itself, that is a signal to wrap Paxman from the outside.

Install

pip install paxman                          # core (no adapters)
pip install paxman[pydantic]                # + Pydantic adapter
pip install paxman[all]                     # + all V1 adapters

Paxman is in pre-release (v0.x). Public API may change between minor versions until 1.0.

Documentation

Doc Purpose
PRD.md Product vision, philosophy, V1 success metrics and acceptance criteria.
ARCHITECTURE.md Subsystem design, sequence diagram, error model, versioning, observability.
PACKAGE_STRUCTURE.md Module layout, dependency DAG, public/private API split, packaging.
GLOSSARY.md Single source of truth for Paxman vocabulary.
V1_ACCEPTANCE_CRITERIA.md Definition of done for the 1.0 release.
REPLAY_AND_DETERMINISM.md Deep dive on replay and determinism.
SECURITY.md Threat model, PII handling, provider secrets, vulnerability reporting.
TESTING_STRATEGY.md Test seams, property tests, replay tests, fixtures.
docs/TEST_DATA.md Test data policy, dataset catalog, licensing rules.
DEVELOPMENT.md Local dev setup, common tasks, release process.
EXTENDING.md How to add a new contract adapter, capability, or inference provider.
DEPENDENCIES.md Core vs optional dependencies, packaging policy.
docs/adr/ Architecture Decision Records.
docs/concepts/ Conceptual docs (contracts, capabilities, planning, reconciliation, replay, MIGRATION_GUIDE).
docs/howto/ Quick-start how-tos (add adapter, add capability, add inference provider, replay artifact).
CONTRIBUTING.md Contribution workflow + ADR-driven process.
CODE_OF_CONDUCT.md Community standards (Contributor Covenant v2.1).
CHANGELOG.md Release notes.

Quickstart (5 minutes)

Note: Paxman V1 is in pre-release. The quickstart below is verified end-to-end in CI (see .github/workflows/ci.yml). For a full migration walkthrough (e.g. from LlamaIndex, LangChain, or a hand-rolled pipeline), see docs/concepts/MIGRATION_GUIDE.md.

1. Install

pip install paxman[pydantic]

2. Define a contract (Pydantic)

from pydantic import BaseModel, Field


class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float


class Invoice(BaseModel):
    supplier_name: str = Field(..., description="The supplier's name.")
    total_amount: float = Field(..., description="Total invoice amount.")
    currency_code: str = Field(..., description="ISO-4217 currency code.")
    line_items: list[LineItem] = Field(default_factory=list)

3. Normalize raw input

import paxman

# IMPORTANT: import the adapter(s) you need so they self-register.
# Pydantic is an optional extra; the core package ships the registry
# but not the adapters themselves.
import paxman.contract.adapters.pydantic  # noqa: F401
import paxman.contract.adapters.dict_dsl  # noqa: F401

raw_invoice = """
ACME Corp
Invoice #1234
Total: $1,234.56 USD
- Widget: 2 @ $500.00
- Gadget: 1 @ $234.56
"""

artifact = paxman.normalize(
    input_data=raw_invoice,
    contract=Invoice,
)

print(artifact.status)               # Status.SUCCESS or Status.PARTIAL_SUCCESS
print(artifact.normalized_data)      # {"supplier_name": "ACME Corp", ...}
print(artifact.unresolved_fields)    # []  (or list of fields Paxman could not resolve)
print(artifact.replay_hash)          # "a3f8..."

4. Replay

# Later, with just the artifact and the contract
rehydrated = paxman.replay(artifact, contract=Invoice)
assert rehydrated == artifact  # byte-equal

Examples

Paxman ships with 3 reference examples covering the 3 target personas. Each is a standalone mini-package. Clone the repo, cd into the example, and run it.

Backend service (Persona A: backend developer)

A minimal FastAPI service exposing POST /normalize for contract-driven normalization. Accepts raw text input, returns structured evidence-backed JSON with a deterministic replay hash.

cd examples/backend_service
uv pip install -e "../../[pydantic]" -e ".[dev]"
uvicorn backend_service.app:app --reload --port 8000

AI agent ingest (Persona B: AI engineer)

A stdlib-only agent tool-calling loop that invokes paxman.normalize() as a tool. Zero framework dependencies. Port the NormalizeTool to LangChain, LlamaIndex, or any custom agent.

  • Path: examples/ai_agent_ingest/
  • What it demonstrates: Agent tool loop, framework-agnostic design, evidence-backed extraction
cd examples/ai_agent_ingest
uv pip install -e ".[dev]"
uv run python -m ai_agent_ingest

SaaS procurement pipeline (Persona C: SaaS team)

A CSV-batch invoice/quotation pipeline. Reads a manifest of raw input files, normalizes each against a Pydantic contract, writes artifacts to disk, and verifies cross-run replay-hash reproducibility.

  • Path: examples/saas_procurement/
  • What it demonstrates: Batch normalization, on-disk artifact storage, replay-hash determinism (D10.7 fixture)
cd examples/saas_procurement
uv pip install -e ".[dev]"
uv run python -m saas_procurement data/manifest.csv output/

Use cases

Paxman is designed for:

  • Invoice/quotation/procurement normalization — compare offers across suppliers and currencies.
  • Agentic ingestion flows — auditable, evidence-backed extraction for RAG or agent pipelines.
  • Document understanding services — wrap Paxman inside a SaaS without giving up replay or evidence.
  • Multi-source data pipelines — normalize email, OCR, CSV, and API inputs into one canonical schema.

See PRD.md §7 Primary Use Cases for detailed examples.

Status

  • v0.0.0 (Sprint 6) — Shipped: Full pipeline — contract adaptation, planning, execution, reconciliation, artifact, and public API (paxman.normalize(), paxman.replay()).
  • v0.0.0 + Sprint 7 — Shipped: paxman.testing (Hypothesis strategies), golden artifacts, end-to-end integration tests, per-subsystem coverage thresholds.
  • v0.0.0 + Sprint 8 — In progress: Documentation site (docs/concepts/, docs/howto/), community files (CONTRIBUTING.md, CODE_OF_CONDUCT.md), CI hardening (pyright, interrogate, bandit, pip-audit), 9-check make ci.
  • v0.1.0 (initial preview): planner + one adapter + one capability work end-to-end. (Pending.)
  • v0.5.0 (feature-complete beta): 80% of V1 features. (Pending.)
  • 1.0.0: All V1 acceptance criteria met. (Pending.)

Install (developer setup, Sprint 1)

Paxman uses uv for package management. The first preview is not published to PyPI yet; developers install the project from a working tree.

# Clone the repository
git clone https://github.com/nexusnv/paxman.git
cd paxman

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package + all dev dependencies (editable)
uv sync --all-extras --dev

# Verify the install
uv run python -c "import paxman; print(f'paxman {paxman.__version__}')"

Expected output: paxman 0.0.0.

Local CI

Run the full local-CI pipeline (the same checks run on GitHub Actions):

make ci

This runs, in order: install-frozen → lint → format-check → typecheck → typecheck-pyright → imports → docs-check → security → test-cov. All 9 checks must pass before opening a PR. Each check is also runnable individually (e.g. make lint, make typecheck, make docs-check, make security).

Project structure

paxman/
├── src/paxman/              # the package (src-layout)
│   ├── __init__.py          # exposes __version__ + public API
│   ├── py.typed             # PEP 561 marker
│   ├── errors.py            # PaxmanError hierarchy
│   ├── types.py             # Status, ConfidenceBand, FieldType enums
│   ├── protocols.py         # internal Protocol definitions
│   ├── versioning.py        # version constants and helpers
│   ├── logging.py           # structlog factory (no timestamps in replay)
│   ├── budget.py            # Budget, Policy, CurrencyPolicy
│   ├── clock.py             # injectable Clock + FakeClock
│   ├── ids.py               # prefixed ID helpers
│   ├── serialization.py     # stable JSON encoder (RFC 8785-style)
│   ├── contract/            # adapter + validation (4 formats → CanonicalContract)
│   ├── planner/             # rule-based field-centric planning
│   ├── capabilities/        # 5 V1 capabilities (text/regex/lookup/inference/validation)
│   ├── executor/            # sequential execution + budget tracking
│   ├── reconciler/          # truth resolution + confidence + MONEY
│   ├── artifact/            # ExecutionArtifact + replay hash + diagnostics
│   ├── api/                 # public API (normalize, replay, register_*)
│   └── testing/             # public Hypothesis strategies (paxman.testing)
├── tests/                   # pytest test suite (unit / property / integration / public_api)
├── docs/                    # design specs, ADRs, sprint plan, concepts, howtos
├── pyproject.toml           # PEP 621 metadata + tooling config
├── Makefile                 # `make ci`, `make test`, `make build`, …
├── .pre-commit-config.yaml
├── .github/                 # workflows + issue/PR templates
├── LICENSE                  # MIT (per ADR-0008)
├── CONTRIBUTING.md          # contribution workflow + ADR-driven process
├── CODE_OF_CONDUCT.md       # Contributor Covenant v2.1
└── CHANGELOG.md             # release notes

See V1_ACCEPTANCE_CRITERIA.md for the full definition of done.

Contributing

We welcome contributions of all sizes — from typo fixes to new subsystems. See CONTRIBUTING.md for the contribution workflow and the ADR-driven process.

For local development setup, see DEVELOPMENT.md. For extension guides (adding a new contract adapter, capability, or inference provider), see EXTENDING.md.

Significant architectural changes require an ADR; see docs/adr/README.md. Community standards are in CODE_OF_CONDUCT.md.

License

MIT. See LICENSE. Per ADR-0008, MIT is the chosen license for V1. Apache-2.0 is the documented alternative if patent concerns emerge (see docs/specs/license-decision.md for the full trade-off analysis).

See also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paxman-1.0.0.tar.gz (922.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paxman-1.0.0-py3-none-any.whl (212.3 kB view details)

Uploaded Python 3

File details

Details for the file paxman-1.0.0.tar.gz.

File metadata

  • Download URL: paxman-1.0.0.tar.gz
  • Upload date:
  • Size: 922.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for paxman-1.0.0.tar.gz
Algorithm Hash digest
SHA256 18975ca3314688724bd98cc2e51908c525a0e498eeafff38198b1b20751433c7
MD5 dc75a935c1adcecb53fd67b67873a701
BLAKE2b-256 825d67033334e3c8f4a10cd7f3cc7ba04065f11ce6ca8c28c15a51a228de604a

See more details on using hashes here.

Provenance

The following attestation bundles were made for paxman-1.0.0.tar.gz:

Publisher: release.yml on nexusnv/paxman

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paxman-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: paxman-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 212.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for paxman-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64738453af1ebc3f72e7388759b3af807bfa7e2abf9953ef51e23f2d71d6e93a
MD5 612f997ff17cf143c8cd4b3d9f2485f1
BLAKE2b-256 52f380e9ab65293f5f194abdba3318ce88d99956281c3665ba7665cd07d5b8e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for paxman-1.0.0-py3-none-any.whl:

Publisher: release.yml on nexusnv/paxman

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page