Skip to main content

pytest-based behavioral testing framework for AI agents

Project description

AgentProof

pytest-based behavioral testing for AI agents.

PyPI version Python 3.10+ License: MIT Tests

pip install agentproof

One line. No config. Works with any agent framework.


The Problem

You can observe your agents. You can trace them. You can log every token.

But can you test them?

89% of teams have agent observability. Only 52% have agent evaluation. Zero have behavioral testing in CI.

Your agent calls the wrong tool? Ships to production. Costs $40 on a $0.50 task? Ships to production. Hallucinates the answer? Ships. To. Production.

The Solution

AgentProof brings pytest-style behavioral testing to AI agents. Test what your agent does, not just what it outputs.

# test_booking_agent.py
import agentproof
from agentproof import assert_tool_called, assert_tool_order, assert_max_cost

def test_booking_agent_searches_before_booking(agent_run):
    run = make_booking_run()  # your agent execution

    # Did it use the right tools?
    assert_tool_called(run, "search_flights")
    assert_tool_called(run, "book_flight")

    # Did it use them in the right order?
    assert_tool_order(run, ["search_flights", "compare_prices", "book_flight"])

    # Did it stay within budget?
    assert_max_cost(run, max_usd=0.50)
$ pytest
========================= test session starts =========================
test_booking_agent.py::test_booking_agent_searches_before_booking PASSED
test_booking_agent.py::test_cost_stays_under_budget PASSED
test_booking_agent.py::test_no_hallucination PASSED
========================= 3 passed in 0.04s ============================

Quickstart

1. Install

pip install agentproof

2. Record an agent run

import agentproof

@agentproof.record
def my_agent(prompt: str) -> str:
    # Your agent code here
    agentproof.add_tool_call("search", arguments={"query": prompt})
    agentproof.add_llm_call("gpt-4o", prompt_tokens=500, completion_tokens=200)
    return "The answer is 42"

result = my_agent("What is the meaning of life?")
run = my_agent.last_run

3. Write tests

from agentproof import (
    assert_tool_called,
    assert_tool_order,
    assert_max_cost,
    assert_max_steps,
    assert_no_hallucination,
)

def test_my_agent():
    run = my_agent.last_run

    assert_tool_called(run, "search")
    assert_max_cost(run, 0.10)
    assert_max_steps(run, 5)
    assert_no_hallucination(run, ground_truth="42")

Core Assertions

Assertion What it tests
assert_tool_called(run, "search") Tool was used (optionally N times)
assert_tool_order(run, ["a", "b", "c"]) Tools called in correct sequence
assert_max_cost(run, 0.50) Total cost under budget (200+ models)
assert_max_steps(run, 10) Agent didn't spin out
assert_no_hallucination(run, truth) Output grounded in source (TF-IDF, no API)

Framework Adapters

Works with any agent framework. Install the adapter you need:

pip install agentproof[langchain]
pip install agentproof[crewai]
pip install agentproof[openai]
pip install agentproof[otel]      # Any OTEL-instrumented framework
# LangChain
from agentproof.adapters.langchain import from_langchain_run
run = from_langchain_run(agent_executor.invoke({"input": "..."}))

# CrewAI
from agentproof.adapters.crewai import from_crewai_result
run = from_crewai_result(crew.kickoff())

# OpenAI Agents SDK
from agentproof.adapters.openai_agents import from_openai_response
run = from_openai_response(Runner.run(agent, "..."))

Replay Testing

Record once, test forever. Capture a real agent run and replay it in CI without making API calls:

from agentproof import save_trace, load_trace

# Record
save_trace(run, "tests/fixtures/booking_success.jsonl")

# Replay in tests
def test_booking_replay():
    run = load_trace("tests/fixtures/booking_success.jsonl")
    assert_tool_called(run, "book_flight")
    assert_max_cost(run, 0.50)

Snapshot Testing

Like Jest snapshots but for agent behavior. Detect behavioral regressions automatically:

from agentproof import assert_snapshot

def test_agent_behavior_stable(snapshot_dir):
    run = my_agent("book a flight to NYC")
    assert_snapshot(run, snapshot_dir / "booking_agent.json")
    # First run: creates snapshot
    # Subsequent runs: compares against saved snapshot

Update snapshots: pytest --agentproof-update-snapshots

Bundled Cost Database

200+ LLM models with up-to-date pricing. No API calls needed.

from agentproof import get_model_pricing, calculate_run_cost

pricing = get_model_pricing("gpt-4o")
# {"provider": "openai", "prompt": 2.50, "completion": 10.00}

cost = calculate_run_cost(run)
# CostBreakdown(total_cost_usd=0.0325, ...)

Models: GPT-4o, Claude 4.5, Gemini 2.5, Llama 4, DeepSeek V3, Mistral Large, and 190+ more.

CI/CD Integration

GitHub Actions

- uses: praxiumlabs/agentproof@v1
  with:
    test-path: tests/

Or manually:

- run: |
    pip install agentproof
    pytest tests/ -v

Comparison

Feature AgentProof DeepEval Braintrust
pytest native Yes Yes No
Behavioral assertions Yes No No
Tool sequence testing Yes No No
Cost assertions Yes No No
No API key needed Yes No No
JSONL replay Yes No No
Snapshot testing Yes No No
Framework adapters 4 2 1
Package size <500KB ~50MB Cloud
Price Free Freemium Paid

Honest Limitations

  • Hallucination detection uses TF-IDF — it catches obvious fabrication, not subtle inaccuracies. For LLM-as-judge evaluation, use DeepEval or Braintrust.
  • Token counts must be provided — we don't intercept API calls (by design). Use framework adapters or log tokens manually.
  • No real-time monitoring — AgentProof is a testing tool, not an observability platform. Use it alongside your existing tracing.
  • Cost DB needs community updates — model pricing changes. Submit a PR to update agentproof/data/models.yaml.

Contributing

git clone https://github.com/praxiumlabs/agentproof
cd agentproof
pip install -e ".[dev]"
pytest

Adding a model to the cost DB:

Edit agentproof/data/models.yaml:

new-model-name:
  provider: provider-name
  prompt: 1.50
  completion: 5.00

License

MIT License. See LICENSE.


Built by Sarak Dahal at Praxium Labs. Star the repo if AgentProof saves your agents.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentproof-0.1.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentproof-0.1.0-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file agentproof-0.1.0.tar.gz.

File metadata

  • Download URL: agentproof-0.1.0.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentproof-0.1.0.tar.gz
Algorithm Hash digest
SHA256 35801703d5ad635f08c63c4da921f52ff187131bee7685ed31d733f3a367b76a
MD5 2c8d3324fb2e0519aef8f7f9b9b4076f
BLAKE2b-256 868940ff7c6bf1101f3ddbc66c8d0d36e59adffbea3677a900944ada53fb8845

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentproof-0.1.0.tar.gz:

Publisher: publish.yml on praxiumlabs/agentproof

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentproof-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentproof-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentproof-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68f8f3d0b29770f32c093154832735c3bb0c3b35b1b791705ffa1f94b09af2df
MD5 a9983befc2d775e3cf219aeec0d702d9
BLAKE2b-256 63141f5432c55bec48a882fe2217d94500179a7d2f8b6686ca638ba9dcb262a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentproof-0.1.0-py3-none-any.whl:

Publisher: publish.yml on praxiumlabs/agentproof

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page