Skip to main content

Open-source test harness for AI agents that take real-world actions.

Project description

AgentHarness

Open-source test harness for AI agents that take real-world actions.

Problem

Teams can observe agents in traces and score outputs with existing tools, but they lack a shared, pytest-friendly way to assert the sequence of tool calls, arguments, and safety properties on a run. AgentHarness provides trace-oriented assertions and a CLI so those checks can run in CI without calling real APIs by default. It is complementary to observability and LLM evaluation stacks.

Install

Core package (editable, from repo root):

pip install -e "."

LangGraph adapter, dev tooling, and optional resource/cost helpers (matches CI and typical agent projects):

pip install -e ".[langgraph,dev]"

[dev] includes pytest-asyncio, typing stubs, and pulls [resource] (tokencost) per pyproject.toml. Other extras: [openai], [anthropic], [crewai], [live], [compliance], [langfuse], [arize], or [all].

Quickstart

Behavioral checks use @scenario, the run fixture, and assertions on run.trace. For a trace that includes tool arguments (not only tool names), run the bundled LangGraph example test from the repo root:

from agentharness import (
    assert_approval_gate,
    assert_arg_lte,
    assert_called_before,
    scenario,
)


@scenario("examples/01_customer_support_langgraph/scenarios/happy_path.yaml")
def test_happy_path(run):
    assert_called_before(run.trace, "lookup_order", "issue_refund")
    assert_arg_lte(run.trace, tool="issue_refund", arg="amount", value=100)
    assert_approval_gate(run.trace, tool="issue_refund")
pip install -e ".[langgraph,dev]"
python -m pytest examples/01_customer_support_langgraph/test_refund_agent.py::test_happy_path -q

The example package overrides the run fixture so YAML steps execute under LangGraph with recorded args. More detail: examples/01_customer_support_langgraph/README.md.

What AgentHarness is not

  • Not a monitoring or observability platform (use LangFuse or Arize Phoenix for that)
  • Not a full LLMOps platform
  • Not framework-specific (not a LangChain product)
  • Not an LLM benchmark (not SWE-bench or WebArena)
  • Not a replacement for LangSmith, DeepEval, or TruLens — complementary behavioral testing over traces

Available assertions

Function What it checks Regulatory reference (from REFS_* in assertions/base.py)
assert_called_before First occurrence of earlier_tool before first of later_tool EU AI Act Article 9; NIST AI RMF TEVV Verify
assert_call_count Tool appears exactly expected times in order EU AI Act Article 9
assert_completion No ERROR status on tool spans / no errors on ToolCallRecords EU AI Act Article 9; NIST AI RMF TEVV Validate
assert_mutual_exclusion Two tools are not both invoked in the same run EU AI Act Article 9
assert_arg_lte Every call to tool has args[arg] <= value (numeric) EU AI Act Article 15; Colorado SB 24-205
assert_arg_pattern Args match a regex EU AI Act Article 15
assert_arg_schema Args validate against a JSON Schema EU AI Act Article 9; NIST AI RMF TEVV Verify
assert_arg_not_contains Args do not contain forbidden substrings EU AI Act Article 15; OWASP LLM06:2025
assert_approval_gate Each call to tool has approved or approval_id in args EU AI Act Article 14; Colorado SB 24-205; OWASP LLM06:2025
assert_no_loop Tool call count for tool does not exceed max_calls EU AI Act Article 9; OWASP LLM10:2025
assert_cost_under Estimated trace cost at most max_usd (via trace attrs and optional tokencost) OWASP LLM10:2025; EU AI Act Article 9

CLI

agentharness run <scenario.yaml>
python -m agentharness run <scenario.yaml>

Optional --mode mock|live (default mock).

Roadmap

Phase 0 (foundation, pytest plugin, assertions, LangGraph adapter, CLI run, example agent) is complete. Phase 1 adds record/replay, OpenAI and CrewAI adapters, multi-run statistical mode, and a PyPI release. A public roadmap will be published before the Phase 1 launch.

License

Apache License 2.0 — see LICENSE and pyproject.toml.

Contributing

See CONTRIBUTING.md for workflow and review expectations.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agentharness-0.1.0a1.tar.gz (47.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agentharness-0.1.0a1-py3-none-any.whl (60.1 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agentharness-0.1.0a1.tar.gz.

File metadata

  • Download URL: pytest_agentharness-0.1.0a1.tar.gz
  • Upload date:
  • Size: 47.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pytest_agentharness-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 d713d359a2a62a7423d269655c245e3c617d120a344fccc85d1c98a093c04211
MD5 603026c351e9fadef3866a2f36ced04a
BLAKE2b-256 e864806896415484c11e5f4367f7a84206af2a87e2f0491a336c274c40f23c9c

See more details on using hashes here.

File details

Details for the file pytest_agentharness-0.1.0a1-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agentharness-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 3351d926af206d85d9a3075da44f0e5a6537e51aaf1af58590e650ca8005def7
MD5 bd7f4441879e996d7fc85a1c73ae50a3
BLAKE2b-256 c7922e303f6ce7f905fe139bc2074b24526ff1139ae031a6248e87b54c351cc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page