Skip to main content

The evaluation framework for AI agents. Pytest for agents.

Project description

Evaldeck

The evaluation framework for AI agents. Pytest for agents.

PyPI version License Python 3.10+


Evaldeck helps you answer one question: "Is my agent actually working?"

Unlike LLM evaluation tools that focus on input→output scoring, Evaldeck evaluates the entire agent execution—how it reasons, which tools it selects, and whether it achieves the goal.

Why Evaldeck?

  • Agent-native: Evaluates multi-step traces, not just final outputs
  • Framework-agnostic: Works with LangChain, CrewAI, AutoGen, or custom agents
  • Developer-friendly: CLI-first, CI/CD ready, 5-minute setup
  • Comprehensive metrics: Tool correctness, step efficiency, plan adherence, and more
  • Flexible grading: Code-based, model-based (BYOK), or combine both

Installation

pip install evaldeck

With framework integrations:

pip install evaldeck[langchain]  # LangChain/LangGraph support
pip install evaldeck[openai]     # OpenAI model graders
pip install evaldeck[all]        # Everything

Quick Start

1. Initialize your project

evaldeck init

This creates:

evaldeck.yaml          # Configuration
tests/evals/         # Test directory
  example.yaml       # Example test case

2. Define test cases

# tests/evals/booking.yaml
name: book_flight_basic
input: "Book me a flight from NYC to LA on March 15th"

expected:
  tools_called:
    - search_flights
    - book_flight
  output_contains:
    - "confirmation"
    - "March 15"
  max_steps: 5

3. Run evaluations

evaldeck run

Output:

Running 3 tests...

  ✓ book_flight_basic (1.2s)
  ✓ book_flight_roundtrip (2.1s)
  ✗ book_flight_with_preferences (1.8s)
    └─ FAIL at step 3: Wrong tool called
       Expected: search_flights_with_filters
       Got: search_flights

Results: 2/3 passed (66.7%)

Configuration

# evaldeck.yaml
version: 1

agent:
  module: my_agent
  function: run

test_dir: tests/evals

defaults:
  timeout: 30
  retries: 2

graders:
  llm:
    model: gpt-4o-mini
    # Uses OPENAI_API_KEY from environment

thresholds:
  min_pass_rate: 0.9

Test Case Format

Basic test case

name: test_name
input: "User message to the agent"

expected:
  # What tools should be called?
  tools_called:
    - tool_name_1
    - tool_name_2

  # What tools should NOT be called?
  tools_not_called:
    - dangerous_tool

  # What should the output contain?
  output_contains:
    - "expected phrase"

  # What should the output NOT contain?
  output_not_contains:
    - "error"

  # Maximum steps allowed
  max_steps: 10

  # Must complete successfully?
  task_completed: true

Using model-based grading

name: helpful_response
input: "Explain quantum computing"

graders:
  - type: llm
    prompt: |
      Rate this response for helpfulness and accuracy.
      Response: {{ output }}

      Score from 1-5, where 5 is excellent.
    threshold: 4

Framework Integration

LangChain

Copy the reference tracer from examples/langchain_tracer.py to your project:

from langchain_tracer import EvaldeckTracer
from langchain.agents import AgentExecutor

tracer = EvaldeckTracer()
agent = AgentExecutor(...)

result = agent.invoke(
    {"input": "Book a flight"},
    config={"callbacks": [tracer]}
)

# Get trace for evaluation
trace = tracer.get_trace()

Manual trace construction

from evaldeck import Trace, Step, Evaluator

trace = Trace(
    input="Book a flight from NYC to LA",
    steps=[
        Step(
            type="tool_call",
            tool_name="search_flights",
            tool_args={"from": "NYC", "to": "LA"},
            tool_result=[{"flight": "AA123", "price": 299}]
        ),
        Step(
            type="tool_call",
            tool_name="book_flight",
            tool_args={"flight_id": "AA123"},
            tool_result={"confirmation": "ABC123"}
        ),
    ],
    output="Your flight AA123 is booked. Confirmation: ABC123",
    status="success"
)

evaluator = Evaluator()
result = evaluator.evaluate(trace, test_case)

CI/CD Integration

GitHub Actions

# .github/workflows/evaldeck.yaml
name: Agent Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install evaldeck[all]

      - run: evaldeck run --output junit --output-file results.xml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: results.xml

Metrics

Metric Description
task_completion Did the agent achieve the goal?
tool_correctness Were the right tools selected?
argument_correctness Were correct arguments passed to tools?
step_efficiency Did it complete without unnecessary steps?
tool_call_ordering Were tools called in the right sequence?

Graders

Code-based (deterministic)

from evaldeck.graders import ContainsGrader, ToolCalledGrader

graders = [
    ContainsGrader(values=["confirmation"]),
    ToolCalledGrader(required=["book_flight"]),
]

Model-based (LLM-as-judge)

from evaldeck.graders import LLMGrader

grader = LLMGrader(
    prompt="Did the agent complete the booking? Answer: pass or fail",
    model="gpt-4o-mini",  # Uses your API key
)

Roadmap

  • Core evaluation engine
  • CLI interface
  • LangChain integration
  • CrewAI integration
  • AutoGen integration
  • VS Code extension
  • Historical result tracking
  • Team dashboard (cloud)

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Setup development environment
git clone https://github.com/tantra-run/evaldeck-py.git
cd evaldeck
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest

# Run linting
ruff check .
mypy src/

License

Apache 2.0 - See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaldeck-0.1.4.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaldeck-0.1.4-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file evaldeck-0.1.4.tar.gz.

File metadata

  • Download URL: evaldeck-0.1.4.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.4.tar.gz
Algorithm Hash digest
SHA256 05574e650e9b95215d85c48afa532b3213d24b0c70a1c409437a97fcce6d39e8
MD5 21e7fd696912d68e5e53397db9f7e3bf
BLAKE2b-256 fc201aba66e60526d25ecd180968c0f22c382caa41bcd98c30ce1f787ca3ebd5

See more details on using hashes here.

File details

Details for the file evaldeck-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: evaldeck-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 607b96cc7a37304ff7622ba8b3c6b4542e4ced4bba2117148b2486c4867cb2bf
MD5 e3bc25f856f689c9c67d1c67d43ac599
BLAKE2b-256 a9595e53eab973014f136e6541b3ac9d122108b8d0c1b32e686788dd5d53dd51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page