synthetic evals for agents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

Evaluateur

Generate diverse, realistic test queries for LLM applications. Define your evaluation space as dimensions, sample combinations, and convert them to natural language -- with optional goal-guided optimization to target specific failure modes.

Why

Evaluations require test data. Early on, you don't have any.

If you ask an LLM to "generate 50 test queries," you get repetitive inputs. The model gravitates toward the same phrasing, the same scenarios, the same level of complexity. Manual test cases fare no better: they reflect what the author thought to test, not what actually breaks.

Evaluateur solves this with structure. You define dimensions -- the axes along which your system's behavior varies -- and the library generates combinations that cover the space systematically, including edge cases that neither a human nor an LLM would produce on its own.

The approach follows the dimensions → tuples → queries pattern described in Hamel Husain's evaluation FAQ.

How it works

Dimensions        Options          Tuples              Queries
                                   (combinations)      (natural language)
┌──────────┐    ┌────────────┐    ┌───────────────┐    ┌──────────────────────┐
│ payer     │───▶│ Cigna      │    │ Cigna, adult, │    │ "Does Cigna cover    │
│ age       │    │ Aetna      │───▶│ off-label, TX │───▶│  off-label Dupixent  │
│ complexity│    │ BCBS       │    │               │    │  for adults in TX?"  │
│ geography │    │ ...        │    │ ...           │    │ ...                  │
└──────────┘    └────────────┘    └───────────────┘    └──────────────────────┘

Dimensions → Options. Define a Pydantic model with the axes of variation. The LLM generates diverse values for each field.
Options → Tuples. Sample combinations. The default cross-product strategy uses Farthest Point Sampling to maximize diversity across dimensions. An AI strategy is also available for semantically coherent combinations.
Tuples → Queries. Each combination is converted into a natural language query, ready to feed to your agent.

Installation

uv add evaluateur

or with pip:

pip install evaluateur

Quick start

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator, TupleStrategy


class Query(BaseModel):
    payer: str = Field(..., description="insurance payer, like Cigna")
    age: str = Field(..., description="patient age category, like 'adult' or 'pediatric'")
    complexity: str = Field(
        ...,
        description="query complexity, like 'off-label', 'comorbidities', etc",
    )
    geography: str = Field(..., description="geography indicator, like a state or zip code")


async def main() -> None:
    evaluator = Evaluator(Query)

    async for q in evaluator.run(
        tuple_strategy=TupleStrategy.CROSS_PRODUCT,
        tuple_count=50,
        seed=0,
        instructions="Focus on common US payers and edge-case clinical scenarios.",
    ):
        print(q.source_tuple.model_dump(), "->", q.query)


asyncio.run(main())

The run() method handles the full pipeline: generating options, sampling tuples, and converting each tuple to a natural language query.

For step-by-step control, call evaluator.options(), evaluator.tuples(), and evaluator.queries() separately.

Goal-guided optimization

The first batch of queries gives you a baseline. After running them through your agent and analyzing the failures, you can feed those observations back as goals to bias the next round of query generation toward specific failure modes.

Goals can be categorized using the CTO framework:

Components -- system internals: retrieval freshness, citation accuracy, tool reliability.
Trajectories -- decision sequences: tool selection order, conflict resolution, retry behavior.
Outcomes -- what the user sees: output format, actionability, appropriate uncertainty.

Pass goals as free-form text. Structured lists with Components:, Trajectories:, and Outcomes: headers are parsed directly without an LLM call:

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator


class Query(BaseModel):
    payer: str = Field(..., description="insurance payer, like Cigna")
    age: str = Field(..., description="patient age category")
    complexity: str = Field(..., description="query complexity, like 'off-label'")
    geography: str = Field(..., description="geography indicator, like a state")


async def main() -> None:
    evaluator = Evaluator(Query)

    async for q in evaluator.run(
        seed=0,
        goals="""
Components:
- The system must cite current policy versions; stale guidelines are a compliance risk
- Every clinical claim needs a traceable source from retrieved documents

Trajectories:
- Prefer formulary API over generic web search for drug lists
- Surface conflicts between sources instead of silently picking one

Outcomes:
- Produce structured checklists that reviewers can sign off on
- Flag uncertainty instead of guessing
""",
        instructions="Write realistic questions from a doctor's perspective.",
    ):
        print(f"[{q.metadata.goal_focus}] {q.query}")


asyncio.run(main())

Each generated query targets a single goal by default (cycling through them), so one run produces a mix of stress-test styles. You can also pass goals as a GoalSpec with structured Goal objects for programmatic control.

See the custom goals guide for goal modes (sample, cycle, full) and advanced usage.

The iteration loop

The core workflow is a feedback loop:

Generate queries across your dimensions.
Run them through your agent and collect traces.
Analyze failures -- write freeform notes about what went wrong.
Turn notes into goals -- group observations into Components, Trajectories, and Outcomes.
Generate again with those goals to stress-test the failure modes you found.

Each cycle tightens coverage. The first round catches obvious failures. By the third, you're stress-testing edge cases that real traffic won't hit for months. When production traffic arrives, feed those traces back into the loop.

Features

Pydantic-based dimensions. Define your evaluation space with standard Pydantic models. Field descriptions guide option generation.
Farthest Point Sampling. When sampling from the cross product, tuples are selected to maximize Hamming distance, ensuring broad coverage instead of clustered combinations.
Seeded, reproducible sampling. Set seed= to get deterministic results. Change the seed for a different subset.
Goal-guided generation. Bias queries toward specific failure modes using the CTO framework or custom categories.
Async streaming. All generators yield results as async iterators for memory-efficient processing.
Provider-agnostic. Works with any LLM provider supported by Instructor -- OpenAI, Anthropic, and others.
Traceability. Every generated query links back to its source tuple via q.source_tuple, making it easy to understand why a query was generated.
Mixed options. Fixed lists (state: list[str] = ["CA", "NY", "TX"]) coexist with LLM-generated options in the same model.

Configuration

By default, evaluateur reads the EVALUATEUR_MODEL environment variable (defaults to openai/gpt-4.1-mini). You can override this per evaluator:

from evaluateur import Evaluator

evaluator = Evaluator(Query, llm="anthropic/claude-haiku-4-5")

For advanced setups (observability wrappers, custom providers), pass a pre-configured Instructor client directly:

import instructor
from openai import AsyncOpenAI

from evaluateur import Evaluator

client = instructor.from_openai(AsyncOpenAI())
evaluator = Evaluator(Query, client=client, model_name="gpt-4.1-mini")

See the provider configuration guide for details.

Documentation

Full documentation is available at evaluateur.aptford.com.

Getting started -- installation and environment setup
Dimensions, tuples, and queries -- core concepts
Goal-guided optimization -- the CTO framework
Walkthrough notebook -- end-to-end example
API reference -- full API docs

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aptlin

Release history Release notifications | RSS feed

This version

0.3.0

Feb 16, 2026

0.2.0

Jan 9, 2026

0.1.0

Dec 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluateur-0.3.0.tar.gz (29.9 kB view details)

Uploaded Feb 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evaluateur-0.3.0-py3-none-any.whl (46.4 kB view details)

Uploaded Feb 16, 2026 Python 3

File details

Details for the file evaluateur-0.3.0.tar.gz.

File metadata

Download URL: evaluateur-0.3.0.tar.gz
Upload date: Feb 16, 2026
Size: 29.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evaluateur-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`eefd1b5a7d97ed21c7a1643d0adb65319010d9664d2a8516ab9a23e7c9c711c1`
MD5	`49191a08523dcb7dc7e3f2d452e45a01`
BLAKE2b-256	`f09f078f8a4a17e6a6ec86581f0909ae51fb388f0ba2e54ecda4cb6cb55e312f`

See more details on using hashes here.

File details

Details for the file evaluateur-0.3.0-py3-none-any.whl.

File metadata

Download URL: evaluateur-0.3.0-py3-none-any.whl
Upload date: Feb 16, 2026
Size: 46.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.3 {"installer":{"name":"uv","version":"0.10.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evaluateur-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17551d63d4d4be153400636ff20b2e53c999e063b5b7de5ed8fea060ecd3323e`
MD5	`c63a4ae5d558ebef5ab910c04f163328`
BLAKE2b-256	`39cab649780aa5a997ed5d1a5d4da56703bca473bcf95fe89f1810d2ec5df3b1`

See more details on using hashes here.

evaluateur 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

Evaluateur

Why

How it works

Installation

Quick start

Goal-guided optimization

The iteration loop

Features

Configuration

Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes