A testing and evaluation framework for software agents

These details have not been verified by PyPI

Project links

Project description

AgentProbe

A testing and evaluation framework for software agents.

AgentProbe gives you the tools to test autonomous software agents the same way you'd test any critical software --- with structured assertions, regression detection, safety scans, cost tracking, and full execution tracing. If your agent calls tools, makes decisions, or produces non-deterministic output, AgentProbe helps you verify it works correctly and keeps working.

$ agentprobe test -d tests/

  [PASS] test_greeting ............. score=1.00  42ms
  [PASS] test_tool_usage ........... score=0.95 180ms
  [FAIL] test_cost_limit ........... score=0.60 220ms
  [PASS] test_no_data_leakage ..... score=1.00  95ms

  4 tests | 3 passed | 1 failed | 537ms total
  Cost: $0.0042 | Budget: $0.01 (42% used)

Install

pip install agentprobe-framework

Optional extras:

pip install agentprobe-framework[postgres]     # PostgreSQL storage
pip install agentprobe-framework[eval]         # Embedding evaluators (numpy)
pip install agentprobe-framework[dashboard]    # REST API dashboard (FastAPI)

Quick Start

pytest Plugin (Recommended)

AgentProbe ships as a native pytest plugin. Install it and the agentprobe fixture is automatically available:

# test_my_agent.py
from agentprobe.testing import assert_trace, assert_score, assert_cost
from agentprobe.eval.rules import RuleBasedEvaluator, RuleSpec

evaluator = RuleBasedEvaluator(rules=[
    RuleSpec(rule_type="max_length", params={"max": 3000}),
])

async def test_greeting(agentprobe):
    trace = await agentprobe.invoke("Say hello", adapter=my_adapter)

    assert_trace(trace).has_output().contains("hello")
    await assert_score(trace, evaluator, min_score=0.8)
    assert_cost(trace, max_usd=0.01)

Run with standard pytest:

pytest tests/ -v

Standalone CLI

agentprobe init          # create agentprobe.yaml
agentprobe test -d tests/  # discover and run @scenario tests

Scenario-Based Tests

from agentprobe import scenario, expect

@scenario(name="greeting_test", input_text="Say hello", tags=["smoke"])
def test_greeting():
    """Agent should produce a friendly greeting."""
    pass

Core Concepts

Trace Recording

Every agent execution is captured as a Trace --- a structured record of every decision, tool call, and model invocation. Traces are immutable, storable, and replayable.

from agentprobe.trace.recorder import TraceRecorder

recorder = TraceRecorder(agent_name="my-agent", model="claude-sonnet-4-5-20250929")

async with recorder.recording() as ctx:
    ctx.record_llm_call(model="claude-sonnet-4-5-20250929", input_tokens=150, output_tokens=80)
    ctx.record_tool_call(tool_name="search", tool_input={"query": "weather"})

trace = recorder.finalize(input_text="What's the weather?", output="It's sunny.")
# trace.llm_calls, trace.tool_calls, trace.total_input_tokens, etc.

Fluent Assertions

Validate agent output with a chainable API that gives clear error messages on failure:

from agentprobe import expect, expect_tool_calls

# Output assertions
expect(output).to_contain("Paris").to_not_contain("error").to_match(r"\d+ degrees")

# JSON validation
expect(output).to_be_valid_json()

# Tool call assertions
expect_tool_calls(trace.tool_calls).to_contain("search").to_have_count(3)
expect_tool_calls(trace.tool_calls).to_have_sequence(["search", "calculate", "respond"])

Rule-Based Evaluation

Define evaluation rules that score agent output on multiple dimensions:

from agentprobe.eval.rules import RuleBasedEvaluator, RuleSpec

evaluator = RuleBasedEvaluator(
    rules=[
        RuleSpec(rule_type="contains_any", params={"values": ["Paris"]}, weight=3.0),
        RuleSpec(rule_type="max_length", params={"max": 500}, weight=1.0),
        RuleSpec(rule_type="json_valid"),
        RuleSpec(rule_type="regex", params={"pattern": r"\d{4}"}),
        RuleSpec(rule_type="not_contains", params={"values": ["error", "fail"]}),
    ]
)

result = await evaluator.evaluate(test_case, trace)
# result.verdict: PASS | PARTIAL | FAIL
# result.score: 0.0 - 1.0 (weighted average)

Cost Tracking

Track token usage and costs across providers. Supports Anthropic, OpenAI, Google, Mistral, and Cohere pricing out of the box:

from agentprobe.cost.calculator import CostCalculator

calculator = CostCalculator()  # loads built-in pricing data
summary = calculator.calculate_trace_cost(trace)

print(f"Total: ${summary.total_cost_usd:.6f}")
for model, breakdown in summary.breakdown_by_model.items():
    print(f"  {model}: ${breakdown.total_cost_usd:.6f} ({breakdown.call_count} calls)")

Enforce budgets per-test or per-suite:

from agentprobe.cost.budget import BudgetEnforcer

enforcer = BudgetEnforcer(test_budget_usd=0.10, suite_budget_usd=1.00)
check = enforcer.check_test(summary)
# check.within_budget, check.utilization_pct, check.remaining_usd

Safety Scanning

Run built-in suites that probe for common agent vulnerabilities:

from agentprobe.safety.scanner import SafetyScanner

scanner = SafetyScanner()  # loads all built-in suites
result = await scanner.scan(adapter)
# result.total_passed, result.total_failed, result.suite_results

Built-in suites: prompt injection, data leakage, jailbreak, role confusion, hallucination, tool abuse.

Regression Detection

Save baselines and detect when agent behavior degrades:

from agentprobe.regression.baseline import BaselineManager
from agentprobe.regression.detector import RegressionDetector

manager = BaselineManager(baseline_dir="baselines/")
manager.save("v1.0", test_results)

# Later, after changes...
baseline = manager.load("v1.0")
detector = RegressionDetector(threshold=0.05)
report = detector.compare("v1.0", baseline, current_results)

print(f"Regressions: {report.regressions}, Improvements: {report.improvements}")
for comp in report.comparisons:
    if comp.is_regression:
        print(f"  {comp.test_name}: {comp.baseline_score:.2f} -> {comp.current_score:.2f}")

Trace Replay and Diff

Replay recorded traces with mock overrides, or compare any two traces structurally:

from agentprobe.trace.replay import ReplayEngine
from agentprobe.trace.diff import TraceDiffer

# Replay with mock tool
engine = ReplayEngine(mock_tools={"search": lambda inp: "mocked result"})
replayed = engine.replay(original_trace)
diff = engine.diff(original_trace, replayed)

# Structural comparison between any two traces
differ = TraceDiffer(similarity_threshold=0.8)
report = differ.diff(trace_a, trace_b)
# report.output_matches, report.token_delta, report.overall_similarity

Security

Scan for PII, hash sensitive fields, and log security events:

from agentprobe.security import PIIRedactor, FieldEncryptor, AuditLogger

# PII detection and redaction
redactor = PIIRedactor()
clean = redactor.redact("Email john@example.com, SSN 123-45-6789")
# "Email [EMAIL], SSN [SSN]"

# Field-level hashing and masking
encryptor = FieldEncryptor()
encryptor.hash_value("sensitive-data")     # deterministic SHA-256
encryptor.mask_value("4111111111111111")   # "************1111"

# Structured audit logging
from agentprobe.security.audit import AuditLogger, AuditEventType
logger = AuditLogger()
logger.log_event(AuditEventType.PII_REDACTION, details={"matches": 3})

CLI Reference

Command	Description
`agentprobe init`	Create `agentprobe.yaml` configuration file
`agentprobe test -d <dir>`	Discover and run test scenarios
`agentprobe trace list`	List recorded execution traces
`agentprobe trace show <id>`	Inspect a specific trace
`agentprobe safety scan`	Run safety test suites
`agentprobe cost report`	Generate cost report
`agentprobe cost budget`	Check budget utilization
`agentprobe baseline save`	Save current results as baseline
`agentprobe baseline compare`	Detect regressions against baseline
`agentprobe snapshot update`	Update golden-file snapshots
`agentprobe metrics list`	List collected metrics
`agentprobe metrics summary`	Show aggregated metric statistics
`agentprobe dashboard`	Start the REST API dashboard

Configuration

agentprobe.yaml controls all behavior:

project_name: my-project
test_dir: tests

runner:
  parallel: true
  max_workers: 4
  default_timeout: 30.0

trace:
  enabled: true
  storage_backend: sqlite        # or "postgres"
  database_path: .agentprobe/traces.db

cost:
  enabled: true
  budget_limit_usd: 10.00

safety:
  enabled: true
  suites:
    - prompt-injection
    - data-exfiltration
    - jailbreak
    - tool-abuse

reporting:
  formats: [terminal, json, junit]
  output_dir: agentprobe-report

regression:
  threshold: 0.05
  baseline_dir: baselines/

metrics:
  enabled: true
  collect: [latency, tokens, cost, score]

plugins:
  enabled: true
  search_paths: [./plugins]

Framework Adapters

Built-in adapters for popular agent frameworks:

Framework	Adapter	Import
LangChain	`LangChainAdapter`	`agentprobe.adapters.langchain`
CrewAI	`CrewAIAdapter`	`agentprobe.adapters.crewai`
AutoGen	`AutoGenAdapter`	`agentprobe.adapters.autogen`
MCP	`MCPAdapter`	`agentprobe.adapters.mcp`

Or implement the AdapterProtocol:

class MyAdapter:
    @property
    def name(self) -> str:
        return "my-agent"

    async def invoke(self, input_text: str, **kwargs: object) -> Trace:
        # Call your agent and return a Trace
        ...

Storage Backends

Backend	Use Case	Install
SQLite (default)	Local development, single-user	Built-in
PostgreSQL	Teams, concurrent access, production	`pip install agentprobe-framework[postgres]`

Both backends support the same API: save_trace, load_trace, list_traces, save_result, load_result, load_results, save_metrics, load_metrics.

Project Structure

src/agentprobe/
    core/          Test runner, discovery, assertions, config, models
    eval/          Evaluators: rules, embedding, judge, statistical, trace compare
    trace/         Recorder, replay, diff, time travel
    cost/          Calculator, pricing data (5 providers), budgets
    safety/        Scanner, 6 built-in test suites, payloads
    regression/    Detector, baselines, behavioral diff
    security/      PII redaction, field encryption, audit logging
    adapters/      LangChain, CrewAI, AutoGen, MCP
    metrics/       Collection, aggregation, trending
    storage/       SQLite + PostgreSQL backends with migrations
    reporting/     Terminal, HTML, JUnit XML, JSON, Markdown, CSV
    plugins/       Loader, registry, base classes
    dashboard/     FastAPI REST API
    cli/           Click-based CLI commands
tests/             1090+ tests, 94% coverage
examples/          9 runnable example scripts
docs/              MkDocs Material documentation

Development

git clone https://github.com/dyrach1o/agentprobe-framework.git
cd agentprobe-framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,test,docs]"

make check        # lint + type-check + test
make test         # pytest with coverage
make lint         # ruff check
make format       # ruff format
make type-check   # mypy strict
make docs-serve   # local docs at localhost:8000

Contributing

See CONTRIBUTING.md for development setup, coding standards, and PR workflow.

License

Apache License 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

Feb 15, 2026

This version

1.0.1

Feb 14, 2026

1.0.0

Feb 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprobe_framework-1.0.1.tar.gz (204.2 kB view details)

Uploaded Feb 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentprobe_framework-1.0.1-py3-none-any.whl (144.4 kB view details)

Uploaded Feb 14, 2026 Python 3

File details

Details for the file agentprobe_framework-1.0.1.tar.gz.

File metadata

Download URL: agentprobe_framework-1.0.1.tar.gz
Upload date: Feb 14, 2026
Size: 204.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for agentprobe_framework-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6ab04cfa53ae09e7a8a8de58d13dd99bebfeab6a9bc4583519447c53f2365f1c`
MD5	`7fc68cd9541f5e9cd604e856c4ad838e`
BLAKE2b-256	`e45d155a9ecdf0e99540f28ed0f3df144dcc47eaba9922767fe0a14d23113551`

See more details on using hashes here.

File details

Details for the file agentprobe_framework-1.0.1-py3-none-any.whl.

File metadata

Download URL: agentprobe_framework-1.0.1-py3-none-any.whl
Upload date: Feb 14, 2026
Size: 144.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for agentprobe_framework-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6534e6b3763f8b6b90ce771630e66be7394fe7e2b2d37fee6b850d22b86321e6`
MD5	`1a3a5fe7f8d0209bd46f2da38b72d079`
BLAKE2b-256	`769570a6548bbd1b862b0c85f6f0426b293ca9e8e89329cfa1ba278a2a1fa20a`

See more details on using hashes here.

agentprobe-framework 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentProbe

Install

Quick Start

pytest Plugin (Recommended)

Standalone CLI

Scenario-Based Tests

Core Concepts

Trace Recording

Fluent Assertions

Rule-Based Evaluation

Cost Tracking

Safety Scanning

Regression Detection

Trace Replay and Diff

Security

CLI Reference

Configuration

Framework Adapters

Storage Backends

Project Structure

Development

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes