Build and validate agentic AI chat experiences

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

space-evals

A Python framework for testing and validating agentic AI chat experiences. Define conversation tests in YAML, run them against any LLM, and get structured results with pass/fail grading.

Why space-evals?

Existing eval platforms give you scores and metrics, but when a test fails, you're left reading raw JSON to figure out what went wrong. space-evals is built for multi-turn conversations from the ground up:

Scripted tests for rigid, turn-by-turn validation of deterministic flows
Task-based tests where an LLM plays the customer and talks to your target until a goal is reached
Event-driven architecture so a future UI can render conversations in real-time as tests execute
Plugin system for LLM providers, graders, and runners so the core stays lightweight and extensible

Installation

pip install space-evals

Then install the client plugin(s) for your LLM provider:

pip install space-evals-client-openai
pip install space-evals-client-anthropic

Quick Start

1. Create a config file

Create space-evals.yaml in your project:

clients:
  target:
    provider: openai
    model: gpt-4o

  customer:                  # needed for task-based tests
    provider: anthropic
    model: claude-sonnet-4-20250514

runner:
  max_concurrency: 10

output:
  dir: ./results

2. Write tests

Scripted -- every turn is predefined with per-turn grading:

id: greeting-flow
type: scripted
scenario: Validate greeting and handoff

turns:
  - user_message: Hi there
    graders:
      - type: llm_judge
        assertions:
          - type: Response is a friendly greeting
          - type: Response asks how it can help

  - user_message: I need help with my order
    graders:
      - type: tool_calls
        expected_calls:
          - tool_name: lookup_order
            arguments:
              source: conversation

Task-based -- an LLM simulates a customer working toward a goal:

id: order-status-task
type: task_based
scenario: User checks order status
goal: Find out the delivery date for order #12345
initial_user_message: Can you check on my order #12345?
max_turns: 5
end_conditions:
  - type: keyword_match
    keyword: delivery date
transcript_graders:
  - type: llm_judge
    assertions:
      - type: The bot provided the delivery date

3. Run

space-evals run ./tests/

Or programmatically:

import asyncio
from space_evals.engine import run_tests
from space_evals.events import EventBus
from space_evals.reporters.console import ConsoleReporter

event_bus = EventBus()
event_bus.subscribe(ConsoleReporter())

result = asyncio.run(run_tests(
    path="./tests/",
    target_client=my_target,
    customer_client=my_customer,
    event_bus=event_bus,
))

print(f"{result.passed}/{result.total} passed")

Test Spec Schema

Common fields

Field	Type	Required	Description
`id`	string	yes	Unique test identifier
`type`	`"scripted"` or `"task_based"`	yes	Determines which runner executes the test
`scenario`	string	yes	Human-readable description

Scripted

Field	Type	Required	Description
`turns`	list	yes	Ordered conversation turns
`turns[].user_message`	string	yes	Message sent to the target
`turns[].graders`	list	no	Graders for this turn's response

Task-based

Field	Type	Required	Description
`goal`	string	yes	What the simulated customer is trying to accomplish
`initial_user_message`	string	yes	First message to the target
`max_turns`	integer	yes	Max turns before stopping
`end_conditions`	list	no	Conditions that stop the conversation early
`transcript_graders`	list	no	Graders run on the full transcript after the conversation

Built-in Graders

`exact_match`

- type: exact_match
  expected_response: "Hello! How can I help?"

`llm_judge`

- type: llm_judge
  assertions:
    - type: Response is helpful and on-topic
    - type: Response does not hallucinate

Requires a judge client configured in space-evals.yaml.

`tool_calls`

- type: tool_calls
  expected_calls:
    - tool_name: get_weather
      arguments:
        location: "New York"

End Conditions (task-based)

`tool_calls`

Stops when the target makes a specific tool call.

`keyword_match`

Stops when a keyword appears in the target's response.

The conversation also stops if the customer LLM signals [GOAL_COMPLETE].

Configuration

Field	Type	Default	Description
`clients.target`	object	--	LLM being tested (required)
`clients.customer`	object	--	Simulated user for task-based tests
`clients.judge`	object	--	LLM for `llm_judge` graders
`clients.*.provider`	string	--	Matches an installed client plugin
`clients.*.model`	string	--	Model identifier
`clients.*.api_key_env`	string	provider default	Env var for API key
`clients.*.params`	object	`{}`	Extra provider parameters
`runner.max_concurrency`	integer	`10`	Max parallel tests
`output.dir`	string	`./results`	Result output directory

Plugins

space-evals uses a plugin architecture. The core has no LLM provider dependencies -- you install only what you need.

Official client plugins:

Package	Provider	Requires
space-evals-client-openai	`openai`	`OPENAI_API_KEY`
space-evals-client-anthropic	`anthropic`	`ANTHROPIC_API_KEY`

Plugins are discovered automatically via Python entry points. Install a plugin, use its provider name in your config, done.

Building your own plugin

Client plugin:

from space_evals.clients.base import BaseClient, ClientResponse, Message, register_client

@register_client("my_provider")
class MyClient(BaseClient):
    def __init__(self, model: str, api_key_env: str | None = None, **params):
        ...

    async def send(self, messages: list[Message], system_prompt: str = "") -> ClientResponse:
        ...

# pyproject.toml
[project.entry-points."space_evals.clients"]
my_provider = "my_package:MyClient"

Grader plugin:

from space_evals.graders.base import BaseGrader, register_grader
from space_evals.models.results import GraderResult

@register_grader("my_grader")
class MyGrader(BaseGrader):
    async def grade(self, turn, spec):
        ...

[project.entry-points."space_evals.graders"]
my_grader = "my_package:MyGrader"

Events

The framework emits events during execution for progress reporting and future UI integration:

Event	When
`RUN_STARTED`	Test suite begins
`TEST_STARTED`	Individual test begins
`TURN_STARTED`	Conversation turn begins
`RESPONSE_RECEIVED`	LLM responds
`GRADER_COMPLETED`	Grader finishes
`TEST_COMPLETED`	Individual test finishes
`RUN_COMPLETED`	All tests complete

from space_evals.events import EventBus, Event

class MyListener:
    def on_event(self, event: Event) -> None:
        print(f"{event.event_type.value}: {event.data}")

bus = EventBus()
bus.subscribe(MyListener())

Result Persistence

Results are saved as JSON when output.dir is configured:

results/<run_id>/<test_id>/result.json

Each file contains the full transcript with per-turn grader results and timing data.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sahair21

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

space_evals-0.1.0.tar.gz (76.5 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

space_evals-0.1.0-py3-none-any.whl (64.7 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file space_evals-0.1.0.tar.gz.

File metadata

Download URL: space_evals-0.1.0.tar.gz
Upload date: May 2, 2026
Size: 76.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for space_evals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7e0f78640090009dd5831e7d0398c62f7884f7d0895a0e525437ad9b12c002de`
MD5	`97a89f76fb704d0de3decbf5c561cb96`
BLAKE2b-256	`75118e2c41622ebea8d452f24a4050b31d3df93248cb507cb658e132826f5b9d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for space_evals-0.1.0.tar.gz:

Publisher: publish.yaml on Raghav-Sahai/Evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: space_evals-0.1.0.tar.gz
- Subject digest: 7e0f78640090009dd5831e7d0398c62f7884f7d0895a0e525437ad9b12c002de
- Sigstore transparency entry: 1429378103
- Sigstore integration time: May 2, 2026
Source repository:
- Permalink: Raghav-Sahai/Evals@0be97e6f87840042bdd443fa53c783ad47587af6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Raghav-Sahai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@0be97e6f87840042bdd443fa53c783ad47587af6
- Trigger Event: release

File details

Details for the file space_evals-0.1.0-py3-none-any.whl.

File metadata

Download URL: space_evals-0.1.0-py3-none-any.whl
Upload date: May 2, 2026
Size: 64.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for space_evals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ffb2ef0175819c0ef67df3e9ffc59d3281dda0c257cf0770791516aed6454989`
MD5	`1c18dcb51cff96002f03657e95ffd18d`
BLAKE2b-256	`92625bce22c11dce663c65468840e79d16b6cab9d9f940f22e977c7900188690`

See more details on using hashes here.

Provenance

The following attestation bundles were made for space_evals-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on Raghav-Sahai/Evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: space_evals-0.1.0-py3-none-any.whl
- Subject digest: ffb2ef0175819c0ef67df3e9ffc59d3281dda0c257cf0770791516aed6454989
- Sigstore transparency entry: 1429378106
- Sigstore integration time: May 2, 2026
Source repository:
- Permalink: Raghav-Sahai/Evals@0be97e6f87840042bdd443fa53c783ad47587af6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Raghav-Sahai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@0be97e6f87840042bdd443fa53c783ad47587af6
- Trigger Event: release

space-evals 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

space-evals

Why space-evals?

Installation

Quick Start

1. Create a config file

2. Write tests

3. Run

Test Spec Schema

Common fields

Scripted

Task-based

Built-in Graders

exact_match

llm_judge

tool_calls

End Conditions (task-based)

tool_calls

keyword_match

Configuration

Plugins

Building your own plugin

Events

Result Persistence

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`exact_match`

`llm_judge`

`tool_calls`

`tool_calls`

`keyword_match`