Skip to main content

A testing and evaluation framework for software agents

Project description

AgentProbe

CI Coverage Python 3.11+ License PyPI

A testing and evaluation framework for software agents.

AgentProbe gives you the tools to test autonomous software agents the same way you'd test any critical software --- with structured assertions, regression detection, safety scans, cost tracking, and full execution tracing. If your agent calls tools, makes decisions, or produces non-deterministic output, AgentProbe helps you verify it works correctly and keeps working.

agentprobe test


Who Is This For?

  • You build agents with LangChain, CrewAI, AutoGen, or MCP and need to verify they work correctly after every change
  • You're shipping agents to production and need cost guardrails, safety checks, and regression detection before each deploy
  • You want pytest-native testing instead of custom scripts, notebooks, or SaaS dashboards
  • Your team needs reproducible agent tests with structured traces, not ad-hoc print statements and manual spot-checks

If your agent calls tools, makes decisions, or produces non-deterministic output, AgentProbe helps you test it like any other critical software.


How It Compares

AgentProbe DeepEval RAGAS Promptfoo LangSmith
pytest native fixture custom assert -- -- --
Self-hosted yes yes yes yes SaaS
Trace recording yes -- -- -- yes
Cost tracking built-in (5 providers) -- -- yes yes
Safety scanning 6 built-in suites red teaming -- red teaming --
Framework adapters 4 (LC, CrewAI, AutoGen, MCP) -- LangChain provider-based LangChain
Regression detection yes via cloud -- snapshots yes
Language Python Python Python JS/YAML Python
License Apache 2.0 Apache 2.0 Apache 2.0 MIT Proprietary

Install

pip install agentprobe-framework

Optional extras:

pip install agentprobe-framework[postgres]     # PostgreSQL storage
pip install agentprobe-framework[eval]         # Embedding evaluators (numpy)
pip install agentprobe-framework[dashboard]    # REST API dashboard (FastAPI)

Quick Start

pytest Plugin (Recommended)

AgentProbe ships as a native pytest plugin. Install it and the agentprobe fixture is automatically available:

pytest with agentprobe

# test_my_agent.py
from agentprobe.testing import assert_trace, assert_score, assert_cost
from agentprobe.eval.rules import RuleBasedEvaluator, RuleSpec

evaluator = RuleBasedEvaluator(rules=[
    RuleSpec(rule_type="max_length", params={"max": 3000}),
])

async def test_greeting(agentprobe):
    trace = await agentprobe.invoke("Say hello", adapter=my_adapter)

    assert_trace(trace).has_output().contains("hello")
    await assert_score(trace, evaluator, min_score=0.8)
    assert_cost(trace, max_usd=0.01)

Run with standard pytest:

pytest tests/ -v

Standalone CLI

agentprobe init

agentprobe init          # create agentprobe.yaml
agentprobe test -d tests/  # discover and run @scenario tests

Scenario-Based Tests

from agentprobe import scenario, expect

@scenario(name="greeting_test", input_text="Say hello", tags=["smoke"])
def test_greeting():
    """Agent should produce a friendly greeting."""
    pass

Core Concepts

Trace Recording

Every agent execution is captured as a Trace --- a structured record of every decision, tool call, and model invocation. Traces are immutable, storable, and replayable.

from agentprobe.trace.recorder import TraceRecorder

recorder = TraceRecorder(agent_name="my-agent", model="claude-sonnet-4-5-20250929")

async with recorder.recording() as ctx:
    ctx.record_llm_call(model="claude-sonnet-4-5-20250929", input_tokens=150, output_tokens=80)
    ctx.record_tool_call(tool_name="search", tool_input={"query": "weather"})

trace = recorder.finalize(input_text="What's the weather?", output="It's sunny.")
# trace.llm_calls, trace.tool_calls, trace.total_input_tokens, etc.

Fluent Assertions

Validate agent output with a chainable API that gives clear error messages on failure:

from agentprobe import expect, expect_tool_calls

# Output assertions
expect(output).to_contain("Paris").to_not_contain("error").to_match(r"\d+ degrees")

# JSON validation
expect(output).to_be_valid_json()

# Tool call assertions
expect_tool_calls(trace.tool_calls).to_contain("search").to_have_count(3)
expect_tool_calls(trace.tool_calls).to_have_sequence(["search", "calculate", "respond"])

Rule-Based Evaluation

Define evaluation rules that score agent output on multiple dimensions:

from agentprobe.eval.rules import RuleBasedEvaluator, RuleSpec

evaluator = RuleBasedEvaluator(
    rules=[
        RuleSpec(rule_type="contains_any", params={"values": ["Paris"]}, weight=3.0),
        RuleSpec(rule_type="max_length", params={"max": 500}, weight=1.0),
        RuleSpec(rule_type="json_valid"),
        RuleSpec(rule_type="regex", params={"pattern": r"\d{4}"}),
        RuleSpec(rule_type="not_contains", params={"values": ["error", "fail"]}),
    ]
)

result = await evaluator.evaluate(test_case, trace)
# result.verdict: PASS | PARTIAL | FAIL
# result.score: 0.0 - 1.0 (weighted average)

Cost Tracking

Track token usage and costs across providers. Supports Anthropic, OpenAI, Google, Mistral, and Cohere pricing out of the box:

from agentprobe.cost.calculator import CostCalculator

calculator = CostCalculator()  # loads built-in pricing data
summary = calculator.calculate_trace_cost(trace)

print(f"Total: ${summary.total_cost_usd:.6f}")
for model, breakdown in summary.breakdown_by_model.items():
    print(f"  {model}: ${breakdown.total_cost_usd:.6f} ({breakdown.call_count} calls)")

Enforce budgets per-test or per-suite:

from agentprobe.cost.budget import BudgetEnforcer

enforcer = BudgetEnforcer(test_budget_usd=0.10, suite_budget_usd=1.00)
check = enforcer.check_test(summary)
# check.within_budget, check.utilization_pct, check.remaining_usd

Safety Scanning

agentprobe safety scan

Run built-in suites that probe for common agent vulnerabilities:

from agentprobe.safety.scanner import SafetyScanner

scanner = SafetyScanner()  # loads all built-in suites
result = await scanner.scan(adapter)
# result.total_passed, result.total_failed, result.suite_results

Built-in suites: prompt injection, data leakage, jailbreak, role confusion, hallucination, tool abuse.

Regression Detection

Save baselines and detect when agent behavior degrades:

from agentprobe.regression.baseline import BaselineManager
from agentprobe.regression.detector import RegressionDetector

manager = BaselineManager(baseline_dir="baselines/")
manager.save("v1.0", test_results)

# Later, after changes...
baseline = manager.load("v1.0")
detector = RegressionDetector(threshold=0.05)
report = detector.compare("v1.0", baseline, current_results)

print(f"Regressions: {report.regressions}, Improvements: {report.improvements}")
for comp in report.comparisons:
    if comp.is_regression:
        print(f"  {comp.test_name}: {comp.baseline_score:.2f} -> {comp.current_score:.2f}")

Trace Replay and Diff

Replay recorded traces with mock overrides, or compare any two traces structurally:

from agentprobe.trace.replay import ReplayEngine
from agentprobe.trace.diff import TraceDiffer

# Replay with mock tool
engine = ReplayEngine(mock_tools={"search": lambda inp: "mocked result"})
replayed = engine.replay(original_trace)
diff = engine.diff(original_trace, replayed)

# Structural comparison between any two traces
differ = TraceDiffer(similarity_threshold=0.8)
report = differ.diff(trace_a, trace_b)
# report.output_matches, report.token_delta, report.overall_similarity

Security

Scan for PII, hash sensitive fields, and log security events:

from agentprobe.security import PIIRedactor, FieldEncryptor, AuditLogger

# PII detection and redaction
redactor = PIIRedactor()
clean = redactor.redact("Email john@example.com, SSN 123-45-6789")
# "Email [EMAIL], SSN [SSN]"

# Field-level hashing and masking
encryptor = FieldEncryptor()
encryptor.hash_value("sensitive-data")     # deterministic SHA-256
encryptor.mask_value("4111111111111111")   # "************1111"

# Structured audit logging
from agentprobe.security.audit import AuditLogger, AuditEventType
logger = AuditLogger()
logger.log_event(AuditEventType.PII_REDACTION, details={"matches": 3})

CLI Reference

Command Description
agentprobe init Create agentprobe.yaml configuration file
agentprobe test -d <dir> Discover and run test scenarios
agentprobe trace list List recorded execution traces
agentprobe trace show <id> Inspect a specific trace
agentprobe safety scan Run safety test suites
agentprobe cost report Generate cost report
agentprobe cost budget Check budget utilization
agentprobe baseline save Save current results as baseline
agentprobe baseline compare Detect regressions against baseline
agentprobe snapshot update Update golden-file snapshots
agentprobe metrics list List collected metrics
agentprobe metrics summary Show aggregated metric statistics
agentprobe dashboard Start the REST API dashboard

Configuration

agentprobe.yaml controls all behavior:

project_name: my-project
test_dir: tests

runner:
  parallel: true
  max_workers: 4
  default_timeout: 30.0

trace:
  enabled: true
  storage_backend: sqlite        # or "postgres"
  database_path: .agentprobe/traces.db

cost:
  enabled: true
  budget_limit_usd: 10.00

safety:
  enabled: true
  suites:
    - prompt-injection
    - data-exfiltration
    - jailbreak
    - tool-abuse

reporting:
  formats: [terminal, json, junit]
  output_dir: agentprobe-report

regression:
  threshold: 0.05
  baseline_dir: baselines/

metrics:
  enabled: true
  collect: [latency, tokens, cost, score]

plugins:
  enabled: true
  search_paths: [./plugins]

Framework Adapters

Built-in adapters for popular agent frameworks:

Framework Adapter Import
LangChain LangChainAdapter agentprobe.adapters.langchain
CrewAI CrewAIAdapter agentprobe.adapters.crewai
AutoGen AutoGenAdapter agentprobe.adapters.autogen
MCP MCPAdapter agentprobe.adapters.mcp

Or implement the AdapterProtocol:

class MyAdapter:
    @property
    def name(self) -> str:
        return "my-agent"

    async def invoke(self, input_text: str, **kwargs: object) -> Trace:
        # Call your agent and return a Trace
        ...

Storage Backends

Backend Use Case Install
SQLite (default) Local development, single-user Built-in
PostgreSQL Teams, concurrent access, production pip install agentprobe-framework[postgres]

Both backends support the same API: save_trace, load_trace, list_traces, save_result, load_result, load_results, save_metrics, load_metrics.

Project Structure

src/agentprobe/
    core/          Test runner, discovery, assertions, config, models
    eval/          Evaluators: rules, embedding, judge, statistical, trace compare
    trace/         Recorder, replay, diff, time travel
    cost/          Calculator, pricing data (5 providers), budgets
    safety/        Scanner, 6 built-in test suites, payloads
    regression/    Detector, baselines, behavioral diff
    security/      PII redaction, field encryption, audit logging
    adapters/      LangChain, CrewAI, AutoGen, MCP
    metrics/       Collection, aggregation, trending
    storage/       SQLite + PostgreSQL backends with migrations
    reporting/     Terminal, HTML, JUnit XML, JSON, Markdown, CSV
    plugins/       Loader, registry, base classes
    dashboard/     FastAPI REST API
    cli/           Click-based CLI commands
tests/             1090+ tests, 94% coverage
examples/          9 runnable example scripts
docs/              MkDocs Material documentation

Development

git clone https://github.com/dyrach1o/agentprobe-framework.git
cd agentprobe-framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,test,docs]"

make check        # lint + type-check + test
make test         # pytest with coverage
make lint         # ruff check
make format       # ruff format
make type-check   # mypy strict
make docs-serve   # local docs at localhost:8000

Contributing

See CONTRIBUTING.md for development setup, coding standards, and PR workflow.

Changelog

See CHANGELOG.md for release history.

Security

See SECURITY.md for vulnerability reporting.

License

Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprobe_framework-1.1.0.tar.gz (224.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentprobe_framework-1.1.0-py3-none-any.whl (149.0 kB view details)

Uploaded Python 3

File details

Details for the file agentprobe_framework-1.1.0.tar.gz.

File metadata

  • Download URL: agentprobe_framework-1.1.0.tar.gz
  • Upload date:
  • Size: 224.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentprobe_framework-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2e62c1158d8c8166b274b7c46de4cda7cae93c9b58965295c1ca5b0f12af6c6c
MD5 6b43f56745a193cd114d53630c8fa98d
BLAKE2b-256 61011c119d0d6d04fba99ccb7d30d92220c9389503907966e245da4a60d6b4b3

See more details on using hashes here.

File details

Details for the file agentprobe_framework-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentprobe_framework-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eddbf6de8e25ffec01dfaed6091c3aadd08f93dec61bd5d06eb594830ff8ccdb
MD5 1779dbc91eda67650d4c32797e0ccf65
BLAKE2b-256 c111d51062e033e953b79597cb90bd5e3fb051a828254cebf8dfd1ac3aa82c54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page