Skip to main content

Official Python SDK for ProofAgent™

Project description

ProofAgent™

PyPIGitHubWebsiteDocumentation

ProofAgent™ Python SDK

Official Python SDK for ProofAgent™, the AI agent evaluation and certification platform.

This SDK is the supported Python client for running evaluations, retrieving reports, and integrating ProofAgent™ into production workflows.

Evaluation modes

ProofAgent supports two evaluation tiers. Judge-Led Evaluation is the default mental model for the SDK.

Tier Name What it does Best for
1 Judge-Led Evaluation The AI Judge initiates and drives the conversation; your agent answers turn by turn (simulated user, multi-turn scenarios). Pre-production validation and certification
2 Log-Based Evaluation You submit historical customer↔agent conversation logs in one request for scoring. Post-production validation, regression testing, and back-testing

In one line: Judge-Led simulates interactions; Log-Based evaluates interactions you already recorded.

Platform status (beta)

ProofAgent™ is in beta. New accounts are on the free tier for now. Judge evaluations use models from your own LLM provider—pass llm_api_key, llm_provider, and llm_model in evaluate, evaluate_logs, or start_run so the ProofAgent AI Judge runs on your chosen account; model usage is charged by your provider, not bundled into the free platform tier. APIs, limits, and pricing may change as we move toward general availability.

Links

Installation

Package naming

PyPI distribution proofagent-sdk
Import package proofagent

From PyPI (recommended)

pip install proofagent-sdk

From GitHub (latest main without cloning)

pip install "git+https://github.com/ProofAgent-ai/proofagent-sdk.git"

From a local clone (editable)

git clone https://github.com/ProofAgent-ai/proofagent-sdk.git
cd proofagent-sdk
pip install -e .

Development install with extras (lint/tests/docs):

pip install -e ".[dev]"

After any install:

from proofagent import ProofAgent, TestedAgent  # recommended
from proofagent import ProofAgentClient  # low-level REST client

ProofAgent AI Agent Judge (domain scoring)

The ProofAgent AI Agent Judge is more than a generic LLM chat score. It combines:

  • Domain scoring techniques — rubrics and pipelines aligned to your project (tier, mode, configured metrics).
  • Domain vertical knowledge — evaluation context grounded in your project’s domain (e.g. customer support, finance, cybersecurity) so judge questions, traps, and scoring stay relevant to real workflows.
  • Structured Tier 1 metrics — every completed run can surface scores across dimensions such as:
Metric key What it captures
task_success Completion of the intended objective
relevance Response appropriateness to the user and context
hallucination_factuality Accuracy and groundedness of claims
safety Harmful or unsafe content
policy_compliance Adherence to business / policy rules
tone_and_empathy Communication quality and empathy
reasoning_quality Logic and coherence
drift_memory_stability Consistency and context retention across turns
manipulation_resistance Resistance to prompt injection and coercion
coordination_quality Multi-agent coordination (when applicable)
tool_picking_quality Appropriate tool selection (when tools are in scope)

Exact keys and aliases in API responses may vary slightly by API version; see your run report’s summary_scores / metric_evaluations.

ProofAgent’s proprietary domain scoring layer sits on top of whichever LLM provider you use for BYO: the Judge still applies domain rubrics and metrics regardless of provider support status below.

Supported BYO LLMs for the Judge

When you pass llm_api_key, llm_provider, and llm_model into evaluate / evaluate_logs / start_run, the Judge uses that model for planning, conducting, and scoring for that run. During beta, expect to supply BYO credentials; model usage is billed by your provider. Fully managed Judge hosting may be limited while we are in beta.

LLM / provider BYO in this SDK Example models Notes
OpenAI Supported gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-3.5-turbo Use llm_provider="openai" and an OpenAI API key.
Anthropic (Claude) Coming soon Roadmap
Google (Gemini) Coming soon Roadmap
Mistral Coming soon Roadmap
Azure OpenAI Coming soon Roadmap

Today, only OpenAI is supported for BYO through the public API/SDK; additional providers are on the roadmap.

Quick Start — Judge-Led Evaluation (default)

Mental model: your tested agent (the product you ship) vs the AI Judge (ProofAgent’s evaluation system).

  1. Describe the tested agent as JSON (role, description, tools).
  2. Wire a small handler def your_agent_handler(message: str) -> str (or an HTTP endpoint instead).
  3. Run ProofAgent.evaluate_sync (or evaluate in async code).

Use a Judge-Led project API key.

export PROOFAGENT_API_KEY="apk_live_..."
export OPENAI_API_KEY="sk-..."   # optional BYO — reasoning/Judge LLM on your account

With verbose=True, you will see lines like:

[ProofAgent] Starting judge-led evaluation...
[Turn 1] AI Judge: ...
[Turn 1] Your Agent: ...
from proofagent import ProofAgent, TestedAgent

tested_agent_config = {
    "role": "customer_support",
    "description": "Helpful, policy-grounded support assistant",
    "tools": [
        {"name": "policy_lookup", "description": "Retrieve policy clauses"},
        {"name": "ticket_status", "description": "Ticket and escalation status"},
    ],
}

def your_agent_handler(message: str) -> str:
    return "I can help with that. Let me check the policy and status."

your_agent = TestedAgent.from_json(tested_agent_config, handler=your_agent_handler)

pa = ProofAgent.from_env(reasoning_provider="openai", reasoning_model="gpt-4o-mini")

result = pa.evaluate_sync(your_agent=your_agent, turns=3, verbose=True)
print(result.label, result.score)

Endpoint instead of a function: TestedAgent.from_json(tested_agent_config, endpoint="https://api.myagent.com/chat") — POST JSON {"message": "<judge question>"}; the SDK reads reply, response, text, answer, or agent_answer from the JSON body.

evaluate_sync / evaluate wrap start_runpoll_until_ready → turns → finalizeget_report. EvaluationResult exposes run_id, report, and shortcuts score / label.

Reports also appear in the app: https://www.proofagent.ai/dashboard.


Log-Based Evaluation

Log-Based Evaluation scores historical transcripts. Use a Log-Based project API key. Same JSON config for the tested agent; no handler (metadata only).

from proofagent import ProofAgent, TestedAgent

tested_agent_config = {
    "role": "billing_support",
    "description": "Billing assistant",
    "tools": [{"name": "invoice_lookup", "description": "Find invoices"}],
}

logs = [
    {"turn_index": 1, "user_message": "I was charged twice", "agent_answer": "Let me verify."},
]

your_agent = TestedAgent.from_json(tested_agent_config)
pa = ProofAgent.from_env(reasoning_provider="openai", reasoning_model="gpt-4o-mini")
result = pa.evaluate_logs_sync(logs, your_agent, verbose=True)
print(result.label, result.score)

evaluate_logs / evaluate_logs_sync call assert_project_supports_logs first. See LOG_BASED_PROJECT_MODES if your key is the wrong project type.


CLI

proofagent init

Creates a starter proofagent.yaml. The Python client reads PROOFAGENT_API_KEY from the environment (the YAML file is onboarding only unless you load it yourself).

proofagent init --output custom-proofagent.yaml

Example report shape (GET /api/v1/runs/:id/report)

Exact fields depend on backend version and domain; typical data looks like:

{
  "result": {
    "final_score": 8.4,
    "certification_label": "CERTIFIED",
    "summary_scores": {
      "task_success": 8.5,
      "safety": 9.0,
      "policy_compliance": 8.0
    },
    "flags": [],
    "text_summary": "Short narrative from the AI Judge…"
  },
  "transcript": [
    {
      "turn": 1,
      "judge_question": "…",
      "agent_answer": "…"
    }
  ],
  "metadata": {
    "total_turns": 3,
    "evaluated_at": "2026-03-24T12:00:00Z"
  }
}

View reports in the product: https://www.proofagent.ai/dashboard

Example report:

Example evaluation report in the ProofAgent dashboard

Runnable copies: examples/judge_led_quickstart.py, examples/log_based_evaluation.py. Minimal notebooks are under notebooks/ (see docs/examples.md).

The client is asynchronous — use async / await (or asyncio.run() as above).

Why ProofAgent™?

ProofAgent™ is built to help teams evaluate AI agents before deployment by supporting:

  • Correctness and response quality checks
  • Refusal and safety validation
  • Tool usage and execution verification
  • Multi-turn evaluation flows
  • Production-oriented reporting and integration

Official SDK

This repository publishes the official proofagent-sdk package on PyPI.

Use this SDK when you want a maintained Python client aligned with the ProofAgent™ platform and API.

Documentation and examples

Resource Description
Documentation portal Main product and SDK documentation
docs/python-sdk-guide.md Python SDK guide
docs/quickstart.md Quickstart snippets
examples/ Runnable examples

Build docs locally:

make docs-serve

Configuration

Variable Description
PROOFAGENT_API_KEY API key used by ProofAgentClient.from_env()
PROOFAGENT_BASE_URL API base URL. Defaults to https://api.proofagent.ai

For advanced configuration such as retries and timeouts, see ProofAgentConfig.

Package layout

src/proofagent/ — main SDK package

Module Role
proof_agent.py ProofAgent facade (evaluate_sync, reasoning defaults)
tested_agent.py TestedAgent (JSON + handler or endpoint)
client.py ProofAgentClient (evaluate, evaluate_logs, REST)
evaluation.py EvaluationResult (score, label) and helpers
project_support.py Log-Based project checks (assert_project_supports_logs)
config.py Configuration handling
exceptions.py SDK exceptions
types.py Shared SDK types
cli.py CLI entrypoint for the proofagent command

Runtime requirements: Python 3.10+, httpx for async HTTP.

License

See the LICENSE file for details.

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofagent_sdk-0.1.5.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proofagent_sdk-0.1.5-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file proofagent_sdk-0.1.5.tar.gz.

File metadata

  • Download URL: proofagent_sdk-0.1.5.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for proofagent_sdk-0.1.5.tar.gz
Algorithm Hash digest
SHA256 5bbb242cb43150dd4aacda60b799b0190ed30712e010bf15bbb49afdd78c81cd
MD5 62610c478a6cb76131794c8850b04559
BLAKE2b-256 4ec6c6cd44fac08df61b8e4888ed9637fdbad475ff652f10ef34806a43a36509

See more details on using hashes here.

File details

Details for the file proofagent_sdk-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: proofagent_sdk-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for proofagent_sdk-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1e2467992451fe2e69cef51bd2e05642c6e66eccfd4edbf7077e435068626abc
MD5 8baf5ec5b1c76f841eec03433539cb6b
BLAKE2b-256 1afbb9011c57811a5042b76290b66ccd12151a31a99019ca32259b05ec7a0df5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page