Skip to main content

Pytest-style behavioral regression testing for AI agents.

Project description

AgentCheck

AgentCheck is pytest for AI agents. Test behavior, not exact text.

  • GitHub: https://github.com/ashutosh-rath02/pygent-test/
  • PyPI: https://pypi.org/project/pygent-test/

Install

pip install pygent-test

Optional framework extras:

pip install "pygent-test[openai]"
pip install "pygent-test[langgraph]"
pip install "pygent-test[crewai]"

Quickstart (5 minutes)

pip install -e .
python -m agentcheck.cli test examples
python -m agentcheck.cli bless examples
python -m agentcheck.cli test regression_examples

This shows a passing test, a baseline being saved, and an intentional regression caught with a clear behavior diff.

What It Tests

AgentCheck checks observable agent behavior:

  • which tools were called, and how many times
  • whether tools ran in the expected order
  • whether the agent stayed within a step budget
  • whether the agent claimed success without tool evidence
  • whether any of the above regressed against a saved baseline
  • whether output matched or avoided specific content or patterns

Write a Test

from agentcheck import agent_test, expect

@agent_test(runs=5, agent_factory=MyAgent)
def test_booking_agent(agent):
    result = agent.run("Book a table for 2 tonight")

    check = expect(result, collect=True)
    check.used_tool("restaurant_search")
    check.used_tool("booking_tool")
    check.steps_less_than(5)
    check.did_not_claim_confirmation_without_tool("booking_tool")
    check.verify()
    return result

Assertions

expect(result).used_tool("search")
expect(result).used_tool_times("search", 2)
expect(result).used_tool_at_least("search", 1)
expect(result).used_tool_at_most("search", 3)
expect(result).did_not_use_tool("forbidden_tool")
expect(result).used_tools_in_order(["search", "summarize"])
expect(result).used_any_tool()
expect(result).tool_succeeded("book")
expect(result).steps_less_than(10)
expect(result).finished_successfully()
expect(result).did_not_error()
expect(result).final_output_contains("confirmed")
expect(result).final_output_does_not_contain("error")
expect(result).final_output_matches_pattern(r"Order #\d+")
expect(result).did_not_claim_confirmation_without_tool("booking_tool")

Chain multiple checks with collect=True to get all failures at once:

check = expect(result, collect=True)
check.used_tool("search")
check.steps_less_than(5)
check.verify()

CLI Commands

# Run tests
agentcheck test [path] [-k filter_pattern] [--html report.html] [--fail-on-regression]

# Save baseline
agentcheck bless [path]

# Re-compare last run against baseline
agentcheck compare

# Print last report
agentcheck report [--html report.html]

# Baseline management
agentcheck baseline list
agentcheck baseline inspect .agentcheck/baselines/latest.json
agentcheck baseline delete .agentcheck/baselines/old.json --yes

# Agent contracts
agentcheck contract init my_agent
agentcheck contract validate agent_contract.json

# Scenario generation
agentcheck generate scenarios agent_contract.json --stub tests/generated_tests.py

# Config file
agentcheck config init

# Run history
agentcheck history list
agentcheck history show <run-id>

HTML Report

Every agentcheck test run automatically writes a self-contained HTML report to .agentcheck/reports/latest.html. Open it in any browser — no server needed.

To write it to a custom path:

agentcheck test examples --html reports/run.html

Failure Categories

Every failed assertion is labeled with a category so you know exactly what type of failure occurred:

Category Triggered by
missing_required_tool used_tool, used_any_tool, used_tool_times, etc.
wrong_tool_order used_tools_in_order
step_budget_exceeded steps_less_than
unsupported_success_claim did_not_claim_confirmation_without_tool
runtime_error finished_successfully, did_not_error
output_mismatch final_output_contains, final_output_matches_pattern
tool_failure tool_succeeded

Flakiness Detection

When a test runs multiple times and produces mixed results, AgentCheck computes a flakiness_score (0–1) and flags unstable_tool_paths when tool sequences vary between runs. Both appear in CLI output and the HTML/Markdown reports.

Agent Contracts

Define expected agent behavior in a reusable file:

agentcheck contract init booking_agent

This creates agent_contract.json:

{
  "name": "booking_agent",
  "expected_tools": ["search", "summarize"],
  "required_tool_order": [],
  "step_budget": 10,
  "success_conditions": ["answer provided"],
  "forbidden_claims": ["reservation complete"],
  "scenario_tags": ["happy_path"]
}

Validate it:

agentcheck contract validate agent_contract.json

Scenario Generation

Generate starter test scenarios from a contract:

agentcheck generate scenarios agent_contract.json --stub tests/generated.py

This writes a JSON scenario pack and a ready-to-edit Python test file covering: happy_path, missing_information, ambiguous_request, tool_failure, over_step, unsupported_success

HTTP Endpoint Testing

Test a deployed agent without importing any local code:

from agentcheck import agent_test, expect, HttpAdapter

adapter = HttpAdapter(
    "https://my-agent.example.com/run",
    auth_env_var="AGENT_API_KEY",
)

@agent_test(runs=3)
def test_deployed_agent():
    result = adapter.run_input("What is the weather in Tokyo?")
    return expect(result).used_any_tool().finished_successfully().verify()

Or fully environment-driven:

adapter = HttpAdapter.from_env(
    url_env_var="AGENT_ENDPOINT",
    auth_env_var="AGENT_API_KEY",
)

Config File

Create agentcheck.json in your project root to set defaults:

agentcheck config init
{
  "path": ".",
  "runs": 3,
  "fail_on_regression": false
}

CLI flags always override config file values.

Run History

Every test run is automatically recorded locally:

agentcheck history list
agentcheck history show abc123

History is stored at .agentcheck/history.json and capped at 200 entries.

Adapters

Adapter Install Usage
PythonAdapter built-in any Python callable
OpenAIAgentsAdapter pygent-test[openai] OpenAI Agents SDK
LangGraphAdapter pygent-test[langgraph] LangGraph StateGraph
CrewAIAdapter pygent-test[crewai] CrewAI Crew / Agent
HttpAdapter built-in any HTTP endpoint

Regression Detection

When a baseline exists, agentcheck test compares the current run and reports:

  • success rate change per test
  • step drift, latency drift, cost drift
  • tool coverage drops
  • primary tool path changes
  • failure category breakdown
# Save a baseline
agentcheck bless examples

# Future runs compare automatically
agentcheck test examples --fail-on-regression

Test Filtering

Run a subset of tests by name:

agentcheck test -k booking
agentcheck test -k "research or booking"

CI Integration

- name: Run AgentCheck
  run: agentcheck test . --fail-on-regression --html reports/agentcheck.html

- name: Upload report
  uses: actions/upload-artifact@v4
  with:
    name: agentcheck-report
    path: reports/agentcheck.html

The Markdown report is automatically written to the GitHub Actions step summary when GITHUB_STEP_SUMMARY is set.

pytest

AgentCheck tests also run through pytest:

pytest examples -q
pytest tests -q

Artifacts Written Per Run

File Contents
.agentcheck/reports/latest.json Full session report (JSON)
.agentcheck/reports/latest.md Markdown report
.agentcheck/reports/latest.html Self-contained HTML report
.agentcheck/traces/latest.json Raw per-run traces
.agentcheck/history.json Append-only run log

Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygent_test-0.3.1.tar.gz (48.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pygent_test-0.3.1-py3-none-any.whl (45.8 kB view details)

Uploaded Python 3

File details

Details for the file pygent_test-0.3.1.tar.gz.

File metadata

  • Download URL: pygent_test-0.3.1.tar.gz
  • Upload date:
  • Size: 48.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pygent_test-0.3.1.tar.gz
Algorithm Hash digest
SHA256 a35f04cde65645e3d42aec24115b8a2e0abb39a4e45563a7667a2251005b9a1a
MD5 7d15cb123c3e0dbd95d66f06c37fd3bf
BLAKE2b-256 cd0565377bf5009203573846ddd1b35a58bd2d57fb152771861a3bc9e35b5644

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygent_test-0.3.1.tar.gz:

Publisher: publish-pypi.yml on ashutosh-rath02/pygent-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pygent_test-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pygent_test-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 45.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pygent_test-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 176d1b091a259f681f09cc052594c475c72309e7f9a9099155b36cd82163531d
MD5 1dd74d0c2cc71b16a6e6fb47b4e87472
BLAKE2b-256 8c3c4f8c1f28e1abe7578aeeecd69c286e8ad606acd98c161ededb9a2d9ec82d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pygent_test-0.3.1-py3-none-any.whl:

Publisher: publish-pypi.yml on ashutosh-rath02/pygent-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page