Skip to main content

The evaluation framework for AI agents. Pytest for agents.

Project description

Evaldeck

The evaluation framework for AI agents. Pytest for agents.

PyPI version CI Downloads Python 3.10+ License Ruff Type checked


Evaldeck helps you answer one question: "Is my agent actually working?"

Unlike LLM evaluation tools that focus on input→output scoring, Evaldeck evaluates the entire agent execution—how it reasons, which tools it selects, and whether it achieves the goal.

Why Evaldeck?

  • Agent-native: Evaluates multi-step traces, not just final outputs
  • Framework-agnostic: Works with LangChain, CrewAI, AutoGen, or custom agents
  • Developer-friendly: CLI-first, CI/CD ready, 5-minute setup
  • Comprehensive metrics: Tool correctness, step efficiency, plan adherence, and more
  • Flexible grading: Code-based, model-based (BYOK), or combine both

Installation

pip install evaldeck

With framework integrations:

pip install evaldeck[langchain]  # LangChain/LangGraph support
pip install evaldeck[openai]     # OpenAI model graders
pip install evaldeck[all]        # Everything

Quick Start

1. Initialize your project

evaldeck init

This creates:

evaldeck.yaml          # Configuration
tests/evals/         # Test directory
  example.yaml       # Example test case

2. Define test cases

# tests/evals/booking.yaml
name: book_flight_basic
input: "Book me a flight from NYC to LA on March 15th"

expected:
  tools_called:
    - search_flights
    - book_flight
  output_contains:
    - "confirmation"
    - "March 15"
  max_steps: 5

3. Run evaluations

evaldeck run

Output:

Running 3 tests...

  ✓ book_flight_basic (1.2s)
  ✓ book_flight_roundtrip (2.1s)
  ✗ book_flight_with_preferences (1.8s)
    └─ FAIL at step 3: Wrong tool called
       Expected: search_flights_with_filters
       Got: search_flights

Results: 2/3 passed (66.7%)

Configuration

# evaldeck.yaml
version: 1

agent:
  module: my_agent
  function: run

test_dir: tests/evals

defaults:
  timeout: 30
  retries: 2

graders:
  llm:
    model: gpt-4o-mini
    # Uses OPENAI_API_KEY from environment

thresholds:
  min_pass_rate: 0.9

Test Case Format

Basic test case

name: test_name
input: "User message to the agent"

expected:
  # What tools should be called?
  tools_called:
    - tool_name_1
    - tool_name_2

  # What tools should NOT be called?
  tools_not_called:
    - dangerous_tool

  # What should the output contain?
  output_contains:
    - "expected phrase"

  # What should the output NOT contain?
  output_not_contains:
    - "error"

  # Maximum steps allowed
  max_steps: 10

  # Must complete successfully?
  task_completed: true

Using model-based grading

name: helpful_response
input: "Explain quantum computing"

graders:
  - type: llm
    prompt: |
      Rate this response for helpfulness and accuracy.
      Response: {{ output }}

      Score from 1-5, where 5 is excellent.
    threshold: 4

Framework Integration

LangChain

Copy the reference tracer from examples/langchain_tracer.py to your project:

from langchain_tracer import EvaldeckTracer
from langchain.agents import AgentExecutor

tracer = EvaldeckTracer()
agent = AgentExecutor(...)

result = agent.invoke(
    {"input": "Book a flight"},
    config={"callbacks": [tracer]}
)

# Get trace for evaluation
trace = tracer.get_trace()

Manual trace construction

from evaldeck import Trace, Step, Evaluator

trace = Trace(
    input="Book a flight from NYC to LA",
    steps=[
        Step(
            type="tool_call",
            tool_name="search_flights",
            tool_args={"from": "NYC", "to": "LA"},
            tool_result=[{"flight": "AA123", "price": 299}]
        ),
        Step(
            type="tool_call",
            tool_name="book_flight",
            tool_args={"flight_id": "AA123"},
            tool_result={"confirmation": "ABC123"}
        ),
    ],
    output="Your flight AA123 is booked. Confirmation: ABC123",
    status="success"
)

evaluator = Evaluator()
result = evaluator.evaluate(trace, test_case)

CI/CD Integration

GitHub Actions

# .github/workflows/evaldeck.yaml
name: Agent Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install evaldeck[all]

      - run: evaldeck run --output junit --output-file results.xml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: results.xml

Metrics

Metric Description
task_completion Did the agent achieve the goal?
tool_correctness Were the right tools selected?
argument_correctness Were correct arguments passed to tools?
step_efficiency Did it complete without unnecessary steps?
tool_call_ordering Were tools called in the right sequence?

Graders

Code-based (deterministic)

from evaldeck.graders import ContainsGrader, ToolCalledGrader

graders = [
    ContainsGrader(values=["confirmation"]),
    ToolCalledGrader(required=["book_flight"]),
]

Model-based (LLM-as-judge)

from evaldeck.graders import LLMGrader

grader = LLMGrader(
    prompt="Did the agent complete the booking? Answer: pass or fail",
    model="gpt-4o-mini",  # Uses your API key
)

Roadmap

  • Core evaluation engine
  • CLI interface
  • LangChain integration
  • CrewAI integration
  • AutoGen integration
  • VS Code extension
  • Historical result tracking
  • Team dashboard (cloud)

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Setup development environment
git clone https://github.com/tantra-run/evaldeck-py.git
cd evaldeck
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest

# Run linting
ruff check .
mypy src/

License

Apache 2.0 - See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaldeck-0.1.5.tar.gz (41.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaldeck-0.1.5-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file evaldeck-0.1.5.tar.gz.

File metadata

  • Download URL: evaldeck-0.1.5.tar.gz
  • Upload date:
  • Size: 41.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.5.tar.gz
Algorithm Hash digest
SHA256 75c1233b7f871a73ab58a1806635461bc74cce30a4077afaa4bd92af8228c0a3
MD5 f98052996d67f09920b50810ee404bb8
BLAKE2b-256 b9ce260d2d926849178089a871efe29a42c6d97bffc0f0d2b572f17d2f66b2c6

See more details on using hashes here.

File details

Details for the file evaldeck-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: evaldeck-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 08dec99684f338ba7f5aa096c71204c510dcb49ea79d697b129098323d3ccad2
MD5 f582a220092d599f01c935532238a793
BLAKE2b-256 3ab1c30d804484762fbf58d9d831367eaab4944135f2dbbeba669485f2e24d4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page