Skip to main content

Lightweight evaluation and observability toolkit for LLM agents

Project description

agenteval

CI Python License: MIT Code style: ruff

A lightweight, framework-agnostic toolkit for evaluating and observing LLM agents.

[IMAGE TO ADD] Run agenteval run tests/ against your demo agent and take a screenshot of the terminal output showing the results table. Save it as assets/demo.png and replace this block with: ![agenteval terminal output](assets/demo.png)


The problem this solves

Standard unit tests don't work for agents. If you write assert result == "expected answer", you've already lost — because the same prompt can produce a different tool-calling sequence, different wording, or even a different conclusion on the next run. Flaky by design.

The correct mental model for agent reliability is statistical. Your agent doesn't "pass" or "fail" — it passes 80% of the time or 60% of the time. That's a meaningful number you can track, regress against, and improve. agenteval is built around that idea.

You define a test, tell it how many times to run (n=20) and what pass rate you'll accept (threshold=0.85), and get back a proper reliability score — along with per-run traces, timing, step counts, and failure breakdowns you can actually debug with.

test_find_restaurants      18/20  ✅ 90%   avg 1.4s   2.8 steps
test_multi_step_reasoning   9/20  ⚠️  45%   avg 4.1s   6.3 steps
test_no_hallucination      15/20  ✅ 75%   avg 2.0s   3.5 steps

Install

pip install agenteval-py

If you're using a specific framework, grab the extras:

pip install "agenteval-py[openai]"       # OpenAI function calling
pip install "agenteval-py[anthropic]"    # Anthropic tool use
pip install "agenteval-py[langchain]"    # LangChain callback integration
pip install "agenteval-py[all]"          # everything

Requires Python 3.11+.

The PyPI package name is agenteval-py, but the Python import and CLI stay the same:

agenteval --help
import agenteval

Quick start

import agenteval
from agenteval import Tracer

# Your real agent and tools go here.
# This is a stand-in for demonstration.
async def web_search(query: str) -> str:
    return f"search results for: {query}"

async def my_agent(prompt: str, search) -> str:
    results = await search(query=prompt)
    return f"Based on my research: {results}"


# This test runs 20 times. It must pass at least 85% of those runs.
@agenteval.test(n=20, threshold=0.85)
async def test_agent_uses_search(tracer: Tracer) -> None:
    # Wrap your tools. The tracer records every call — name, args, result, timing.
    search = tracer.wrap(web_search)

    # Mark the run boundary. Duration is captured automatically.
    async with tracer.run(input="What is agenteval?") as run:
        result = await my_agent("What is agenteval?", search=search)
        run.set_output(result)

    # Assert on what happened. Failures are collected, not raised one-by-one.
    (tracer.assert_that()
        .called_tool("web_search")       # did the agent actually search?
        .completed_within_steps(5)       # not wandering through 10 tool calls?
        .completed_within_seconds(10.0)  # not timing out?
        .response_contains("research")   # did it produce a real response?
        .no_errors()                     # no unhandled exceptions?
        .check())                        # raise once with all failures listed

Run it with the CLI:

agenteval run tests/

Or directly in Python:

result = agenteval.run(test_agent_uses_search, n=10)
print(f"{result.n_passed}/{result.n_runs} passed ({result.pass_rate:.0%})")

# Drill into failures
for trace in result.failed_traces:
    print(f"Run {trace.run_id}: {trace.error or trace.assertion_errors}")

How it works

[IMAGE TO ADD] A simple box diagram showing the flow: Test Function → Tracer (wraps tools) → Runner (N concurrent runs) → Reporter Save it as assets/architecture.png and replace this block with: ![Architecture](assets/architecture.png)

There are four moving parts:

Tracer — handed to your test function fresh each run. You wrap your tools with it (tracer.wrap(fn)), and it records every call: the function name, the arguments it received, the result it returned, how long it took, and whether it threw. Nothing in your actual agent code needs to change.

Runner — executes your test function N times concurrently (bounded by concurrency, default 4). It handles both sync and async test functions, catches any unhandled exceptions, and aggregates everything into a TestResult.

Assertions — a fluent chainable API (tracer.assert_that().called_tool(...).no_errors().check()) that collects failures instead of raising at the first one. When you call .check(), you get a single AssertionError listing every problem from that run.

Reporter — renders a color-coded summary table in the terminal with pass rates, average timing, step counts, and optional per-run trace details. Exports JSON for CI pipelines.


Writing tests

The test decorator

@agenteval.test(n=20, threshold=0.85, tags=["search", "integration"])
async def test_my_agent(tracer: Tracer) -> None:
    ...
  • n — how many times to run this test (default: 20)
  • threshold — minimum pass rate to consider this test passing (default: 0.8)
  • tags — optional labels for filtering with agenteval run --tag

Sync test functions work too — agenteval wraps them automatically:

@agenteval.test(n=10)
def test_sync_agent(tracer: Tracer) -> None:
    tool = tracer.wrap(my_sync_tool)
    with tracer.run(input="hello") as run:
        result = my_sync_agent("hello", tool=tool)
        run.set_output(result)
    tracer.assert_that().no_errors().check()

Wrapping tools

# Method form — returns a wrapped version of the function
search = tracer.wrap(my_search_fn)
search = tracer.wrap(my_search_fn, name="web_search")  # custom name in traces

# Decorator form — useful when defining tools inline
@tracer.tool
def calculator(x: int, y: int) -> int:
    return x + y

@tracer.tool(name="math")
async def calculator(x: int, y: int) -> int:
    return x + y

Both sync and async functions are supported. The wrapper is transparent — it preserves return values and re-raises exceptions exactly as before, just with recording on the side.

Recording the run

async with tracer.run(input="user prompt here") as run:
    result = await my_agent("user prompt here")
    run.set_output(result)
    run.set_token_usage({"input_tokens": 400, "output_tokens": 120})
    run.add_metadata(model="gpt-4o", temperature=0.7)

The context manager captures start/end time automatically. Any unhandled exception inside the block is recorded as trace.error (AssertionErrors from .check() are handled separately by the runner).


Assertions

All assertion methods return self so you can chain them. Failures are collected — .check() raises once at the end listing every issue, not just the first one.

tracer.assert_that()
    # Tool usage
    .called_tool("search")                              # called at least once
    .never_called_tool("delete_record")                 # must never be called
    .tool_call_count("search", min=1, max=3)            # call count range
    .tool_called_before("search", "summarize")          # ordering check
    .tool_called_with_args("search", {"q": "python"})  # argument subset match
    .tool_called_with_args("search", {"q": "python"}, match="exact")  # exact match

    # Performance
    .completed_within_steps(8)          # at most 8 tool calls
    .completed_within_seconds(15.0)     # wall-clock time limit

    # Output quality
    .response_contains("Python")                        # substring match
    .response_contains("python", case_sensitive=False)  # case-insensitive
    .response_matches_schema(MyPydanticModel)           # structured output validation

    # Errors
    .no_errors()                                        # no unhandled exceptions

    # Custom escape hatch
    .custom(lambda t: len(t.tool_calls) >= 2)
    .custom(lambda t: "gpt-4" in str(t.metadata), message="wrong model used")

    .check()  # raise AssertionError if anything above failed

You can also inspect without raising:

aset = tracer.assert_that().called_tool("search").no_errors()
if not aset.passed:
    print(aset.failures)  # list[str] — all failure messages

Framework adapters

OpenAI

pip install "agenteval-py[openai]"
from agenteval.adapters.openai_adapter import wrap_tools, extract_token_usage
import json

@agenteval.test(n=15, threshold=0.8)
async def test_openai_agent(tracer: Tracer) -> None:
    tools = wrap_tools({"search": search_fn, "calculator": calc_fn}, tracer)

    async with tracer.run(input=prompt) as run:
        messages = [{"role": "user", "content": prompt}]
        while True:
            response = await client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=openai_tool_schemas,
            )
            run.set_token_usage(extract_token_usage(response))
            choice = response.choices[0]

            if choice.finish_reason == "tool_calls":
                tool_results = []
                for tc in choice.message.tool_calls:
                    args = json.loads(tc.function.arguments)
                    result = await tools[tc.function.name](**args)
                    tool_results.append({
                        "role": "tool",
                        "tool_call_id": tc.id,
                        "content": str(result),
                    })
                messages.append(choice.message)
                messages.extend(tool_results)
            else:
                run.set_output(choice.message.content)
                break

    tracer.assert_that().called_tool("search").no_errors().check()

Anthropic

pip install "agenteval-py[anthropic]"
from agenteval.adapters.anthropic_adapter import wrap_tools, extract_token_usage

@agenteval.test(n=15, threshold=0.8)
async def test_anthropic_agent(tracer: Tracer) -> None:
    tools = wrap_tools({"web_search": search_fn}, tracer)

    async with tracer.run(input=prompt) as run:
        messages = [{"role": "user", "content": prompt}]
        while True:
            response = await client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                tools=anthropic_tool_schemas,
                messages=messages,
            )
            run.set_token_usage(extract_token_usage(response))

            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = await tools[block.name](**block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": str(result),
                        })
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})
            else:
                text = next(b.text for b in response.content if b.type == "text")
                run.set_output(text)
                break

    tracer.assert_that().called_tool("web_search").no_errors().check()

LangChain

pip install "agenteval-py[langchain]"
from agenteval.adapters.langchain_adapter import AgentEvalCallbackHandler

@agenteval.test(n=10, threshold=0.75)
async def test_langchain_agent(tracer: Tracer) -> None:
    handler = AgentEvalCallbackHandler()
    # The handler auto-connects to the active tracer via ContextVar.
    # Multiple concurrent runs each get their own isolated tracer — no locking needed.

    async with tracer.run(input=prompt) as run:
        result = await agent.ainvoke(
            {"input": prompt},
            config={"callbacks": [handler]},
        )
        run.set_output(result.get("output", ""))

    tracer.assert_that().called_tool("my_tool").no_errors().check()

CLI reference

# Discover and run all test_*.py files
agenteval run tests/

# Adjust runs and threshold for this session
agenteval run tests/ --n 10 --threshold 0.9

# Filter by tag
agenteval run tests/ --tag integration

# Show per-trace details (useful when debugging failures)
agenteval run tests/ --traces

# Save a JSON report for CI or later inspection
agenteval run tests/ --output report.json

# Pretty-print a saved report
agenteval report report.json
agenteval report report.json --traces

Exit codes: 0 if all tests met their threshold, 1 if any failed, 2 on error (missing file, import failure, etc.). This makes it straightforward to gate on in CI/CD pipelines.

Flag Description Default
--n Override run count for all tests test's own n
--threshold Override threshold for all tests test's own threshold
--concurrency Max concurrent agent runs 4
--tag Filter tests by tag (repeatable) all tests
--pattern File glob for test discovery test_*.py
--output Write JSON report to a file none
--traces Show per-run trace details off
--failures / --no-failures Show failure reasons on

Data models

AgentTrace(
    run_id,            # unique UUID for this run
    input,             # what was passed to the agent
    output,            # what the agent returned
    tool_calls,        # list[ToolCall] — every recorded tool invocation
    duration_seconds,  # wall-clock time for the entire agent run
    total_steps,       # number of tool calls (or manual override via run.set_steps())
    token_usage,       # dict from run.set_token_usage() — None if not provided
    error,             # string if the agent raised an unhandled exception
    assertion_errors,  # list of failure messages from .check()
    passed,            # True only if error is None and assertion_errors is empty
    metadata,          # dict from run.add_metadata()
)

TestResult(
    test_name, n_runs, n_passed,
    pass_rate,          # n_passed / n_runs
    threshold,
    traces,             # all AgentTrace objects for this test
    passed_traces,      # computed — only traces where passed=True
    failed_traces,      # computed — only traces where passed=False
    avg_duration,       # mean duration across all runs
    avg_steps,          # mean step count across all runs
    met_threshold,      # True if pass_rate >= threshold
    tags,
)

SuiteResult(
    results,            # list[TestResult], one per test function
    start_time, end_time,
    total_tests, passed_tests, failed_tests,
    all_passed,
    duration_seconds,
)

Using in CI

The exit code makes this drop-in ready. A GitHub Actions example:

- name: Run agent evals
  run: agenteval run tests/ --n 5 --output eval-report.json
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload eval report
  uses: actions/upload-artifact@v4
  with:
    name: eval-report
    path: eval-report.json
  if: always()

Keep n lower in CI to control API costs — 5 or 10 runs is usually enough to catch regressions.


Project structure

agenteval/
├── src/agenteval/
│   ├── __init__.py          # public API surface
│   ├── tracer.py            # Tracer + RunContext — the core recording layer
│   ├── assertions.py        # AssertionSet — fluent assertion API
│   ├── runner.py            # runs a test function N times concurrently
│   ├── registry.py          # @agenteval.test decorator + global test registry
│   ├── suite.py             # discovers and runs test files
│   ├── reporter.py          # Rich terminal output + JSON export
│   ├── cli.py               # Typer CLI (agenteval run / agenteval report)
│   ├── models.py            # AgentTrace, TestResult, SuiteResult, ToolCall
│   ├── exceptions.py        # TracerError
│   └── adapters/
│       ├── openai_adapter.py      # wrap_tools + extract_token_usage for OpenAI
│       ├── anthropic_adapter.py   # same for Anthropic
│       └── langchain_adapter.py   # AgentEvalCallbackHandler for LangChain
├── tests/                   # full test suite for the framework itself
├── docs/                    # detailed documentation per topic
│   ├── quickstart.md
│   ├── tracer.md
│   ├── assertions.md
│   ├── adapters.md
│   └── cli.md
└── pyproject.toml

Development setup

git clone https://github.com/awesome-pro/agenteval
cd agenteval

# Install in editable mode with all dev dependencies
pip install -e ".[dev]"

# Run the full test suite (no API keys needed)
pytest tests/ -v

# Lint and test
ruff check src/ tests/ --select F,I
pytest tests/ -v

The test suite runs completely offline — all agent runs inside the tests use mock functions. Three Python versions are tested in CI (3.11, 3.12, 3.13).


Contributing

See CONTRIBUTING.md for how to get set up, the branching strategy, and what to include in a pull request.


License

MIT — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agenteval_py-0.1.0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agenteval_py-0.1.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file agenteval_py-0.1.0.tar.gz.

File metadata

  • Download URL: agenteval_py-0.1.0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenteval_py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4b5925d896d463c6c5c8d2e6e13d1d61c4d979a73f2e7abf32bc29e8bd67366
MD5 2a22918af7af33b68f94129ebcdae955
BLAKE2b-256 fbecbde79e00c99b61507e117b14abebfe7f48c258c81addba27e97fa23af107

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_py-0.1.0.tar.gz:

Publisher: publish.yml on awesome-pro/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agenteval_py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agenteval_py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agenteval_py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c63b18c1d663735844006c0560f1301bd2d531ed002a1af4405ad8f7b41b5182
MD5 f50a1da85362b903762b8bdc072bef5f
BLAKE2b-256 3a6c0927d00d62ecbef37e90bfc710aec3208a01a4bdee1ec7fd52b1637810cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for agenteval_py-0.1.0-py3-none-any.whl:

Publisher: publish.yml on awesome-pro/agenteval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page