Build and validate agentic AI chat experiences
Project description
space-evals
A Python framework for testing and validating agentic AI chat experiences. Define conversation tests in YAML, run them against any LLM, and get structured results with pass/fail grading.
Why space-evals?
Existing eval platforms give you scores and metrics, but when a test fails, you're left reading raw JSON to figure out what went wrong. space-evals is built for multi-turn conversations from the ground up:
- Scripted tests for rigid, turn-by-turn validation of deterministic flows
- Task-based tests where an LLM plays the customer and talks to your target until a goal is reached
- Event-driven architecture so a future UI can render conversations in real-time as tests execute
- Plugin system for LLM providers, graders, and runners so the core stays lightweight and extensible
Installation
pip install space-evals
Then install the client plugin(s) for your LLM provider:
pip install space-evals-client-openai
pip install space-evals-client-anthropic
Quick Start
1. Create a config file
Create space-evals.yaml in your project:
clients:
target:
provider: openai
model: gpt-4o
customer: # needed for task-based tests
provider: anthropic
model: claude-sonnet-4-20250514
runner:
max_concurrency: 10
output:
dir: ./results
2. Write tests
Scripted -- every turn is predefined with per-turn grading:
id: greeting-flow
type: scripted
scenario: Validate greeting and handoff
turns:
- user_message: Hi there
graders:
- type: llm_judge
assertions:
- type: Response is a friendly greeting
- type: Response asks how it can help
- user_message: I need help with my order
graders:
- type: tool_calls
expected_calls:
- tool_name: lookup_order
arguments:
source: conversation
Task-based -- an LLM simulates a customer working toward a goal:
id: order-status-task
type: task_based
scenario: User checks order status
goal: Find out the delivery date for order #12345
initial_user_message: Can you check on my order #12345?
max_turns: 5
end_conditions:
- type: keyword_match
keyword: delivery date
transcript_graders:
- type: llm_judge
assertions:
- type: The bot provided the delivery date
3. Run
space-evals run ./tests/
Or programmatically:
import asyncio
from space_evals.engine import run_tests
from space_evals.events import EventBus
from space_evals.reporters.console import ConsoleReporter
event_bus = EventBus()
event_bus.subscribe(ConsoleReporter())
result = asyncio.run(run_tests(
path="./tests/",
target_client=my_target,
customer_client=my_customer,
event_bus=event_bus,
))
print(f"{result.passed}/{result.total} passed")
Test Spec Schema
Common fields
| Field | Type | Required | Description |
|---|---|---|---|
id |
string | yes | Unique test identifier |
type |
"scripted" or "task_based" |
yes | Determines which runner executes the test |
scenario |
string | yes | Human-readable description |
Scripted
| Field | Type | Required | Description |
|---|---|---|---|
turns |
list | yes | Ordered conversation turns |
turns[].user_message |
string | yes | Message sent to the target |
turns[].graders |
list | no | Graders for this turn's response |
Task-based
| Field | Type | Required | Description |
|---|---|---|---|
goal |
string | yes | What the simulated customer is trying to accomplish |
initial_user_message |
string | yes | First message to the target |
max_turns |
integer | yes | Max turns before stopping |
end_conditions |
list | no | Conditions that stop the conversation early |
transcript_graders |
list | no | Graders run on the full transcript after the conversation |
Built-in Graders
exact_match
- type: exact_match
expected_response: "Hello! How can I help?"
llm_judge
- type: llm_judge
assertions:
- type: Response is helpful and on-topic
- type: Response does not hallucinate
Requires a judge client configured in space-evals.yaml.
tool_calls
- type: tool_calls
expected_calls:
- tool_name: get_weather
arguments:
location: "New York"
End Conditions (task-based)
tool_calls
Stops when the target makes a specific tool call.
keyword_match
Stops when a keyword appears in the target's response.
The conversation also stops if the customer LLM signals [GOAL_COMPLETE].
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
clients.target |
object | -- | LLM being tested (required) |
clients.customer |
object | -- | Simulated user for task-based tests |
clients.judge |
object | -- | LLM for llm_judge graders |
clients.*.provider |
string | -- | Matches an installed client plugin |
clients.*.model |
string | -- | Model identifier |
clients.*.api_key_env |
string | provider default | Env var for API key |
clients.*.params |
object | {} |
Extra provider parameters |
runner.max_concurrency |
integer | 10 |
Max parallel tests |
output.dir |
string | ./results |
Result output directory |
Plugins
space-evals uses a plugin architecture. The core has no LLM provider dependencies -- you install only what you need.
Official client plugins:
| Package | Provider | Requires |
|---|---|---|
| space-evals-client-openai | openai |
OPENAI_API_KEY |
| space-evals-client-anthropic | anthropic |
ANTHROPIC_API_KEY |
Plugins are discovered automatically via Python entry points. Install a plugin, use its provider name in your config, done.
Building your own plugin
Client plugin:
from space_evals.clients.base import BaseClient, ClientResponse, Message, register_client
@register_client("my_provider")
class MyClient(BaseClient):
def __init__(self, model: str, api_key_env: str | None = None, **params):
...
async def send(self, messages: list[Message], system_prompt: str = "") -> ClientResponse:
...
# pyproject.toml
[project.entry-points."space_evals.clients"]
my_provider = "my_package:MyClient"
Grader plugin:
from space_evals.graders.base import BaseGrader, register_grader
from space_evals.models.results import GraderResult
@register_grader("my_grader")
class MyGrader(BaseGrader):
async def grade(self, turn, spec):
...
[project.entry-points."space_evals.graders"]
my_grader = "my_package:MyGrader"
Events
The framework emits events during execution for progress reporting and future UI integration:
| Event | When |
|---|---|
RUN_STARTED |
Test suite begins |
TEST_STARTED |
Individual test begins |
TURN_STARTED |
Conversation turn begins |
RESPONSE_RECEIVED |
LLM responds |
GRADER_COMPLETED |
Grader finishes |
TEST_COMPLETED |
Individual test finishes |
RUN_COMPLETED |
All tests complete |
from space_evals.events import EventBus, Event
class MyListener:
def on_event(self, event: Event) -> None:
print(f"{event.event_type.value}: {event.data}")
bus = EventBus()
bus.subscribe(MyListener())
Result Persistence
Results are saved as JSON when output.dir is configured:
results/<run_id>/<test_id>/result.json
Each file contains the full transcript with per-turn grader results and timing data.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file space_evals-0.1.0.tar.gz.
File metadata
- Download URL: space_evals-0.1.0.tar.gz
- Upload date:
- Size: 76.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0f78640090009dd5831e7d0398c62f7884f7d0895a0e525437ad9b12c002de
|
|
| MD5 |
97a89f76fb704d0de3decbf5c561cb96
|
|
| BLAKE2b-256 |
75118e2c41622ebea8d452f24a4050b31d3df93248cb507cb658e132826f5b9d
|
Provenance
The following attestation bundles were made for space_evals-0.1.0.tar.gz:
Publisher:
publish.yaml on Raghav-Sahai/Evals
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
space_evals-0.1.0.tar.gz -
Subject digest:
7e0f78640090009dd5831e7d0398c62f7884f7d0895a0e525437ad9b12c002de - Sigstore transparency entry: 1429378103
- Sigstore integration time:
-
Permalink:
Raghav-Sahai/Evals@0be97e6f87840042bdd443fa53c783ad47587af6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Raghav-Sahai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@0be97e6f87840042bdd443fa53c783ad47587af6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file space_evals-0.1.0-py3-none-any.whl.
File metadata
- Download URL: space_evals-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffb2ef0175819c0ef67df3e9ffc59d3281dda0c257cf0770791516aed6454989
|
|
| MD5 |
1c18dcb51cff96002f03657e95ffd18d
|
|
| BLAKE2b-256 |
92625bce22c11dce663c65468840e79d16b6cab9d9f940f22e977c7900188690
|
Provenance
The following attestation bundles were made for space_evals-0.1.0-py3-none-any.whl:
Publisher:
publish.yaml on Raghav-Sahai/Evals
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
space_evals-0.1.0-py3-none-any.whl -
Subject digest:
ffb2ef0175819c0ef67df3e9ffc59d3281dda0c257cf0770791516aed6454989 - Sigstore transparency entry: 1429378106
- Sigstore integration time:
-
Permalink:
Raghav-Sahai/Evals@0be97e6f87840042bdd443fa53c783ad47587af6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Raghav-Sahai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@0be97e6f87840042bdd443fa53c783ad47587af6 -
Trigger Event:
release
-
Statement type: