The evaluation framework for AI agents. Pytest for agents.

These details have not been verified by PyPI

Project links

Project description

Evaldeck

The evaluation framework for AI agents. Pytest for agents.

Evaldeck helps you answer one question: "Is my agent actually working?"

Unlike LLM evaluation tools that focus on input→output scoring, Evaldeck evaluates the entire agent execution—how it reasons, which tools it selects, and whether it achieves the goal.

Why Evaldeck?

Agent-native: Evaluates multi-step traces, not just final outputs
Framework-agnostic: Works with LangChain, CrewAI, AutoGen, or custom agents
Developer-friendly: CLI-first, CI/CD ready, 5-minute setup
Comprehensive metrics: Tool correctness, step efficiency, plan adherence, and more
Flexible grading: Code-based, model-based (BYOK), or combine both

Installation

pip install evaldeck

With framework integrations:

pip install evaldeck[langchain]  # LangChain/LangGraph support
pip install evaldeck[openai]     # OpenAI model graders
pip install evaldeck[all]        # Everything

Quick Start

1. Initialize your project

evaldeck init

This creates:

evaldeck.yaml          # Configuration
tests/evals/         # Test directory
  example.yaml       # Example test case

2. Define test cases

# tests/evals/booking.yaml
name: book_flight_basic
input: "Book me a flight from NYC to LA on March 15th"

expected:
  tools_called:
    - search_flights
    - book_flight
  output_contains:
    - "confirmation"
    - "March 15"
  max_steps: 5

3. Run evaluations

evaldeck run

Output:

Running 3 tests...

  ✓ book_flight_basic (1.2s)
  ✓ book_flight_roundtrip (2.1s)
  ✗ book_flight_with_preferences (1.8s)
    └─ FAIL at step 3: Wrong tool called
       Expected: search_flights_with_filters
       Got: search_flights

Results: 2/3 passed (66.7%)

Configuration

# evaldeck.yaml
version: 1

agent:
  module: my_agent
  function: run

test_dir: tests/evals

defaults:
  timeout: 30
  retries: 2

graders:
  llm:
    model: gpt-4o-mini
    # Uses OPENAI_API_KEY from environment

thresholds:
  min_pass_rate: 0.9

Test Case Format

Basic test case

name: test_name
input: "User message to the agent"

expected:
  # What tools should be called?
  tools_called:
    - tool_name_1
    - tool_name_2

  # What tools should NOT be called?
  tools_not_called:
    - dangerous_tool

  # What should the output contain?
  output_contains:
    - "expected phrase"

  # What should the output NOT contain?
  output_not_contains:
    - "error"

  # Maximum steps allowed
  max_steps: 10

  # Must complete successfully?
  task_completed: true

Using model-based grading

name: helpful_response
input: "Explain quantum computing"

graders:
  - type: llm
    prompt: |
      Rate this response for helpfulness and accuracy.
      Response: {{ output }}

      Score from 1-5, where 5 is excellent.
    threshold: 4

Framework Integration

LangChain

Copy the reference tracer from examples/langchain_tracer.py to your project:

from langchain_tracer import EvaldeckTracer
from langchain.agents import AgentExecutor

tracer = EvaldeckTracer()
agent = AgentExecutor(...)

result = agent.invoke(
    {"input": "Book a flight"},
    config={"callbacks": [tracer]}
)

# Get trace for evaluation
trace = tracer.get_trace()

Manual trace construction

from evaldeck import Trace, Step, Evaluator

trace = Trace(
    input="Book a flight from NYC to LA",
    steps=[
        Step(
            type="tool_call",
            tool_name="search_flights",
            tool_args={"from": "NYC", "to": "LA"},
            tool_result=[{"flight": "AA123", "price": 299}]
        ),
        Step(
            type="tool_call",
            tool_name="book_flight",
            tool_args={"flight_id": "AA123"},
            tool_result={"confirmation": "ABC123"}
        ),
    ],
    output="Your flight AA123 is booked. Confirmation: ABC123",
    status="success"
)

evaluator = Evaluator()
result = evaluator.evaluate(trace, test_case)

CI/CD Integration

GitHub Actions

# .github/workflows/evaldeck.yaml
name: Agent Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - run: pip install evaldeck[all]

      - run: evaldeck run --output junit --output-file results.xml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - uses: mikepenz/action-junit-report@v4
        if: always()
        with:
          report_paths: results.xml

Metrics

Metric	Description
`task_completion`	Did the agent achieve the goal?
`tool_correctness`	Were the right tools selected?
`argument_correctness`	Were correct arguments passed to tools?
`step_efficiency`	Did it complete without unnecessary steps?
`tool_call_ordering`	Were tools called in the right sequence?

Graders

Code-based (deterministic)

from evaldeck.graders import ContainsGrader, ToolCalledGrader

graders = [
    ContainsGrader(values=["confirmation"]),
    ToolCalledGrader(required=["book_flight"]),
]

Model-based (LLM-as-judge)

from evaldeck.graders import LLMGrader

grader = LLMGrader(
    prompt="Did the agent complete the booking? Answer: pass or fail",
    model="gpt-4o-mini",  # Uses your API key
)

Roadmap

Core evaluation engine
CLI interface
LangChain integration
CrewAI integration
AutoGen integration
VS Code extension
Historical result tracking
Team dashboard (cloud)

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Setup development environment
git clone https://github.com/tantra-run/evaldeck-py.git
cd evaldeck
pip install -e ".[dev]"
pre-commit install

# Run tests
pytest

# Run linting
ruff check .
mypy src/

License

Apache 2.0 - See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Feb 4, 2026

0.1.4

Feb 4, 2026

This version

0.1.3

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaldeck-0.1.3.tar.gz (38.3 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evaldeck-0.1.3-py3-none-any.whl (41.9 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file evaldeck-0.1.3.tar.gz.

File metadata

Download URL: evaldeck-0.1.3.tar.gz
Upload date: Feb 4, 2026
Size: 38.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`01aa54a9a710a948ca900d5d8d923cc6c35999fbe00067b631fdeb416c4db58d`
MD5	`fb6889adeb51abd37cc8f67575d60f66`
BLAKE2b-256	`b048b8d10879c51b47527899b755b2eb8cb3206fd0bd3507feaf1e5b49a34aef`

See more details on using hashes here.

File details

Details for the file evaldeck-0.1.3-py3-none-any.whl.

File metadata

Download URL: evaldeck-0.1.3-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 41.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for evaldeck-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ffdcbf615aa5e0716278625fc215cabcb0914c48aeaf16643fb1032ef300af47`
MD5	`48fa9edc42d1b80f4a9395000ceb8d39`
BLAKE2b-256	`a9048069f1da1add2c94c9f51f58a7a1770c7a9f919375c89b640a6d64a04022`

See more details on using hashes here.

evaldeck 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Evaldeck

Why Evaldeck?

Installation

Quick Start

1. Initialize your project

2. Define test cases

3. Run evaluations

Configuration

Test Case Format

Basic test case

Using model-based grading

Framework Integration

LangChain

Manual trace construction

CI/CD Integration

GitHub Actions

Metrics

Graders

Code-based (deterministic)

Model-based (LLM-as-judge)

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes