Skip to main content

Build and validate agentic AI chat experiences

Project description

space-evals

A Python framework for testing and validating agentic AI chat experiences. Define conversation tests in YAML, run them against any LLM, and get structured results with pass/fail grading.

Why space-evals?

Existing eval platforms give you scores and metrics, but when a test fails, you're left reading raw JSON to figure out what went wrong. space-evals is built for multi-turn conversations from the ground up:

  • Scripted tests for rigid, turn-by-turn validation of deterministic flows
  • Task-based tests where an LLM plays the customer and talks to your target until a goal is reached
  • Event-driven architecture so a future UI can render conversations in real-time as tests execute
  • Plugin system for LLM providers, graders, and runners so the core stays lightweight and extensible

Installation

pip install space-evals

Then install the client plugin(s) for your LLM provider:

pip install space-evals-client-openai
pip install space-evals-client-anthropic

Quick Start

1. Create a config file

Create space-evals.yaml in your project:

clients:
  target:
    provider: openai
    model: gpt-4o

  customer:                  # needed for task-based tests
    provider: anthropic
    model: claude-sonnet-4-20250514

runner:
  max_concurrency: 10

output:
  dir: ./results

2. Write tests

Scripted -- every turn is predefined with per-turn grading:

id: greeting-flow
type: scripted
scenario: Validate greeting and handoff

turns:
  - user_message: Hi there
    graders:
      - type: llm_judge
        assertions:
          - type: Response is a friendly greeting
          - type: Response asks how it can help

  - user_message: I need help with my order
    graders:
      - type: tool_calls
        expected_calls:
          - tool_name: lookup_order
            arguments:
              source: conversation

Task-based -- an LLM simulates a customer working toward a goal:

id: order-status-task
type: task_based
scenario: User checks order status
goal: Find out the delivery date for order #12345
initial_user_message: Can you check on my order #12345?
max_turns: 5
end_conditions:
  - type: keyword_match
    keyword: delivery date
transcript_graders:
  - type: llm_judge
    assertions:
      - type: The bot provided the delivery date

3. Run

space-evals run ./tests/

Or programmatically:

import asyncio
from space_evals.engine import run_tests
from space_evals.events import EventBus
from space_evals.reporters.console import ConsoleReporter

event_bus = EventBus()
event_bus.subscribe(ConsoleReporter())

result = asyncio.run(run_tests(
    path="./tests/",
    target_client=my_target,
    customer_client=my_customer,
    event_bus=event_bus,
))

print(f"{result.passed}/{result.total} passed")

Test Spec Schema

Common fields

Field Type Required Description
id string yes Unique test identifier
type "scripted" or "task_based" yes Determines which runner executes the test
scenario string yes Human-readable description

Scripted

Field Type Required Description
turns list yes Ordered conversation turns
turns[].user_message string yes Message sent to the target
turns[].graders list no Graders for this turn's response

Task-based

Field Type Required Description
goal string yes What the simulated customer is trying to accomplish
initial_user_message string yes First message to the target
max_turns integer yes Max turns before stopping
end_conditions list no Conditions that stop the conversation early
transcript_graders list no Graders run on the full transcript after the conversation

Built-in Graders

exact_match

- type: exact_match
  expected_response: "Hello! How can I help?"

llm_judge

- type: llm_judge
  assertions:
    - type: Response is helpful and on-topic
    - type: Response does not hallucinate

Requires a judge client configured in space-evals.yaml.

tool_calls

- type: tool_calls
  expected_calls:
    - tool_name: get_weather
      arguments:
        location: "New York"

End Conditions (task-based)

tool_calls

Stops when the target makes a specific tool call.

keyword_match

Stops when a keyword appears in the target's response.

The conversation also stops if the customer LLM signals [GOAL_COMPLETE].

Configuration

Field Type Default Description
clients.target object -- LLM being tested (required)
clients.customer object -- Simulated user for task-based tests
clients.judge object -- LLM for llm_judge graders
clients.*.provider string -- Matches an installed client plugin
clients.*.model string -- Model identifier
clients.*.api_key_env string provider default Env var for API key
clients.*.params object {} Extra provider parameters
runner.max_concurrency integer 10 Max parallel tests
output.dir string ./results Result output directory

Plugins

space-evals uses a plugin architecture. The core has no LLM provider dependencies -- you install only what you need.

Official client plugins:

Package Provider Requires
space-evals-client-openai openai OPENAI_API_KEY
space-evals-client-anthropic anthropic ANTHROPIC_API_KEY

Plugins are discovered automatically via Python entry points. Install a plugin, use its provider name in your config, done.

Building your own plugin

Client plugin:

from space_evals.clients.base import BaseClient, ClientResponse, Message, register_client

@register_client("my_provider")
class MyClient(BaseClient):
    def __init__(self, model: str, api_key_env: str | None = None, **params):
        ...

    async def send(self, messages: list[Message], system_prompt: str = "") -> ClientResponse:
        ...
# pyproject.toml
[project.entry-points."space_evals.clients"]
my_provider = "my_package:MyClient"

Grader plugin:

from space_evals.graders.base import BaseGrader, register_grader
from space_evals.models.results import GraderResult

@register_grader("my_grader")
class MyGrader(BaseGrader):
    async def grade(self, turn, spec):
        ...
[project.entry-points."space_evals.graders"]
my_grader = "my_package:MyGrader"

Events

The framework emits events during execution for progress reporting and future UI integration:

Event When
RUN_STARTED Test suite begins
TEST_STARTED Individual test begins
TURN_STARTED Conversation turn begins
RESPONSE_RECEIVED LLM responds
GRADER_COMPLETED Grader finishes
TEST_COMPLETED Individual test finishes
RUN_COMPLETED All tests complete
from space_evals.events import EventBus, Event

class MyListener:
    def on_event(self, event: Event) -> None:
        print(f"{event.event_type.value}: {event.data}")

bus = EventBus()
bus.subscribe(MyListener())

Result Persistence

Results are saved as JSON when output.dir is configured:

results/<run_id>/<test_id>/result.json

Each file contains the full transcript with per-turn grader results and timing data.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

space_evals-0.1.0.tar.gz (76.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

space_evals-0.1.0-py3-none-any.whl (64.7 kB view details)

Uploaded Python 3

File details

Details for the file space_evals-0.1.0.tar.gz.

File metadata

  • Download URL: space_evals-0.1.0.tar.gz
  • Upload date:
  • Size: 76.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for space_evals-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e0f78640090009dd5831e7d0398c62f7884f7d0895a0e525437ad9b12c002de
MD5 97a89f76fb704d0de3decbf5c561cb96
BLAKE2b-256 75118e2c41622ebea8d452f24a4050b31d3df93248cb507cb658e132826f5b9d

See more details on using hashes here.

Provenance

The following attestation bundles were made for space_evals-0.1.0.tar.gz:

Publisher: publish.yaml on Raghav-Sahai/Evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file space_evals-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: space_evals-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for space_evals-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ffb2ef0175819c0ef67df3e9ffc59d3281dda0c257cf0770791516aed6454989
MD5 1c18dcb51cff96002f03657e95ffd18d
BLAKE2b-256 92625bce22c11dce663c65468840e79d16b6cab9d9f940f22e977c7900188690

See more details on using hashes here.

Provenance

The following attestation bundles were made for space_evals-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on Raghav-Sahai/Evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page