Formal regression testing for non-deterministic AI agent workflows

These details have not been verified by PyPI

Project links

Project description

AgentAssay

Test More. Spend Less. Ship Confident.

The first agent testing framework that delivers statistical guarantees WITHOUT burning your token budget.

The Problem

Every time you change a prompt, swap a model, or update a tool, you need to know: does my agent still work?

Today, answering that question is painfully expensive. Run 100 trials across 20 scenarios, and you've burned thousands of tokens just to check for a regression. Most teams either:

Over-test: Run fixed-N trials and waste budget on scenarios that don't need it.
Under-test: Skip testing because the cost is too high, and ship broken agents.
Guess: Run a few trials, eyeball the results, and hope for the best.

None of these are engineering. They are gambling.

The Solution

AgentAssay introduces token-efficient agent testing -- three techniques that deliver the same statistical confidence at a fraction of the cost:

1. Behavioral Fingerprinting

Instead of comparing raw text outputs (high-dimensional, noisy, expensive), AgentAssay extracts behavioral fingerprints -- compact representations of what the agent did rather than what it said. Tool sequences, state transitions, decision patterns. Low-dimensional signals need fewer samples to detect change.

2. Adaptive Budget Optimization

No more guessing how many trials to run. AgentAssay runs a small calibration set (5-10 runs), measures behavioral variance, and computes the exact minimum number of trials needed for your target confidence level. High-variance scenarios get more trials. Stable scenarios get fewer. Zero waste.

3. Trace-First Offline Analysis

Coverage metrics, contract checks, metamorphic relations, and mutation analysis can all run on production traces you already have -- at zero additional token cost. Why re-run your agent when you can analyze runs that already happened?

Result: Same confidence. 83% less cost.

Install

pip install agentassay

# With framework adapters
pip install agentassay[all]

# Development
pip install agentassay[dev]

Quick Example: Token-Efficient Testing

from agentassay.efficiency import BehavioralFingerprint, AdaptiveBudgetOptimizer
from agentassay.core.runner import TrialRunner
from agentassay.verdicts import VerdictFunction

# Step 1: Calibrate -- run just 10 trials to measure variance
optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)
estimate = optimizer.calibrate(calibration_traces)

print(f"Recommended trials: {estimate.recommended_n}")   # e.g., 17 (not 100)
print(f"Estimated cost: ${estimate.estimated_cost_usd:.2f}")  # e.g., $0.34
print(f"Savings vs fixed-100: {estimate.savings_vs_fixed_100:.0%}")  # e.g., 83%

# Step 2: Run only the trials you need
runner = TrialRunner(agent_fn=my_agent, config=config)
results = runner.run_trials(scenario, n=estimate.recommended_n)

# Step 3: Compare fingerprints for regression detection
baseline_fp = BehavioralFingerprint.from_traces(baseline_traces)
current_fp = BehavioralFingerprint.from_traces(current_traces)
drift = baseline_fp.distance(current_fp)

# Step 4: Get a statistically-backed verdict
verdict = VerdictFunction(alpha=0.05).evaluate(results)
print(f"Verdict: {verdict.status}")  # PASS / FAIL / INCONCLUSIVE
print(f"Pass rate: {verdict.pass_rate:.1%} [{verdict.ci_lower:.1%}, {verdict.ci_upper:.1%}]")

How It Works

                      Token-Efficient Testing Pipeline
  +-----------------------------------------------------------------+
  |                                                                   |
  |  Production Traces -----> Trace Store -----> Offline Analysis     |
  |  (already paid for)                          (coverage, contracts,|
  |                                               metamorphic -- FREE)|
  |                                                     |             |
  |  New Agent Version --> Calibration (5-10 runs) --> Budget Estimate|
  |                                                     |             |
  |                   Targeted Testing (optimal N) --> Fingerprint    |
  |                                                    Comparison     |
  |                                                     |             |
  |                                          Statistical Verdict      |
  |                                          (5-20x cheaper)          |
  +-----------------------------------------------------------------+

The core insight: most of the information you need to test an agent is already in traces you have collected. AgentAssay extracts maximum signal from minimum runs.

Feature Matrix

Feature	Description
Behavioral fingerprinting	Detect regression from behavioral patterns, not raw text. Fewer samples needed.
Adaptive budget optimization	Calibrate variance, compute exact minimum N. No over-testing.
Trace-first offline analysis	Run coverage, contracts, and metamorphic checks on existing traces. Zero token cost.
Multi-fidelity proxy testing	Use cheaper models for initial screening, expensive models only for confirmation.
Warm-start sequential testing	Incorporate prior results to reach verdicts faster.
Three-valued verdicts	PASS, FAIL, or INCONCLUSIVE -- never a misleading binary answer.
Confidence intervals	Know the true pass rate range, not a point estimate.
Statistical regression detection	Hypothesis tests catch regressions before production.
5D coverage metrics	Measure tool, path, state, boundary, and model coverage.
Mutation testing	Perturb your agent to validate test sensitivity.
Metamorphic testing	Verify behavioral invariants across input transformations.
Contract oracle	Check behavioral specifications from AgentAssert contracts.
Deployment gates	Block broken deployments in CI/CD with statistical evidence.
Framework adapters	Works with popular agent frameworks out of the box.
pytest integration	Use familiar pytest conventions with statistical assertions.
CLI	Five commands: `run`, `compare`, `mutate`, `coverage`, `report`.

Comparison

Feature	AgentAssay	deepeval	agentrial	LangSmith
Statistical regression testing	:white_check_mark:	:x:	:warning:	:x:
Three-valued verdicts	:white_check_mark:	:x:	:x:	:x:
Token-efficient testing	:white_check_mark:	:x:	:x:	:x:
Behavioral fingerprinting	:white_check_mark:	:x:	:x:	:x:
Adaptive budget optimization	:white_check_mark:	:x:	:x:	:x:
Trace-first offline analysis	:white_check_mark:	:x:	:x:	:x:
5D coverage metrics	:white_check_mark:	:x:	:x:	:x:
Mutation testing	:white_check_mark:	:x:	:x:	:x:
Metamorphic testing	:white_check_mark:	:x:	:x:	:x:
CI/CD deployment gates	:white_check_mark:	:x:	:white_check_mark:	:x:
Published research paper	:white_check_mark:	:x:	:x:	:x:

Architecture

+-------------------------------------------------------------------+
|  Layer 6: Efficiency                                               |
|  Fingerprinting | Budget Optimization | Trace Analysis             |
|  Multi-Fidelity | Warm-Start Sequential                           |
+-------------------------------------------------------------------+
|  Layer 5: Integration                                              |
|  Framework Adapters | pytest Plugin | CLI | Reporting              |
+-------------------------------------------------------------------+
|  Layer 4: Analysis                                                 |
|  Coverage (5D) | Mutation | Metamorphic | Contract Oracle          |
+-------------------------------------------------------------------+
|  Layer 3: Verdicts                                                 |
|  Stochastic Verdicts | Deployment Gates                           |
+-------------------------------------------------------------------+
|  Layer 2: Statistics                                               |
|  Hypothesis Tests | Confidence Intervals | SPRT | Effect Size      |
+-------------------------------------------------------------------+
|  Layer 1: Core                                                     |
|  Data Models | Execution Engine | Trace Format                    |
+-------------------------------------------------------------------+

Layer 6 (Efficiency) is the differentiator. It sits atop the full statistical testing stack, optimizing how many runs are needed while Layers 1-5 ensure every run produces rigorous results.

Usage with pytest

import pytest

@pytest.mark.agentassay(n=30, threshold=0.80)
def test_agent_booking_flow(trial_runner):
    runner = trial_runner(my_agent)
    scenario = TestScenario(
        scenario_id="booking",
        name="Flight booking",
        input_data={"task": "Book a flight from NYC to London"},
        expected_properties={"max_steps": 10, "must_use_tools": ["search", "book"]},
    )
    results = runner.run_trials(scenario)
    assert_pass_rate(results, threshold=0.80, confidence=0.95)

python -m pytest tests/ -v --agentassay

CLI

# Run trials with adaptive budget
agentassay run --scenario booking.yaml --budget-mode adaptive

# Compare two versions for regression
agentassay compare --baseline v1.json --current v2.json

# Analyze coverage from existing traces
agentassay coverage --traces production-traces/ --tools search,book,cancel

# Mutation testing
agentassay mutate --scenario booking.yaml --operators prompt,tool,model

# Generate report
agentassay report --results trials.json --output report.html

Documentation

Installation
Quickstart
Token-Efficient Testing -- The core concept
Stochastic Testing -- Why agent testing needs statistics
Coverage Metrics -- Five-dimensional coverage model
Architecture Overview
CLI Reference

Research

AgentAssay is backed by a peer-reviewed research paper with formal definitions, theorems, and proofs.

@article{bhardwaj2026agentassay,
  title={AgentAssay: Formal Regression Testing for Non-Deterministic AI Agent Workflows},
  author={Bhardwaj, Varun Pratap},
  year={2026},
  note={Preprint}
}

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Apr 17, 2026

0.2.2

Apr 17, 2026

0.2.1

Apr 17, 2026

0.2.0

Apr 10, 2026

0.1.2

Mar 6, 2026

0.1.1

Mar 6, 2026

This version

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentassay-0.1.0.tar.gz (2.5 MB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentassay-0.1.0-py3-none-any.whl (262.6 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file agentassay-0.1.0.tar.gz.

File metadata

Download URL: agentassay-0.1.0.tar.gz
Upload date: Mar 3, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for agentassay-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`09a4ac44f2d92ed1066c1668f1280a349c3cf2b6211f6379f08179981f33baa9`
MD5	`0875523f00e2f738b1362dc55f8422d1`
BLAKE2b-256	`8e1ad67b66023cb867f1f056ec3e22f403682fe0260aced47650df5296ec89c4`

See more details on using hashes here.

File details

Details for the file agentassay-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentassay-0.1.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 262.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for agentassay-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf9c2e68ad34624324230fc85d72459eda7adde2c0fc4359419d612667af48d0`
MD5	`31b4cac5ac3ac16c823cb8aa3cb6d6d6`
BLAKE2b-256	`a2d6184ed90b22b7ff591fc63d4eb82f415c210f260b2ac2bd5b3d0631023e7e`

See more details on using hashes here.

agentassay 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AgentAssay

The Problem

The Solution

1. Behavioral Fingerprinting

2. Adaptive Budget Optimization

3. Trace-First Offline Analysis

Install

Quick Example: Token-Efficient Testing

How It Works

Feature Matrix

Comparison

Architecture

Usage with pytest

CLI

Documentation

Research

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes