Skip to main content

synthetic evals for agents

Project description

Evaluateur

Synthetic evaluation helper for LLM applications, built around the dimensions → tuples → queries flow described in Hamel Husain's FAQ.

Installation

The project is packaged as a normal Python library. With uv:

uv add evaluateur

Basic usage

Define a Pydantic model that represents the dimensions of your evaluation space, then use the Evaluator to generate options and queries:

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator, QueryMode, TupleStrategy
from evaluateur.configs import QueryConfig, TupleConfig


class Query(BaseModel):
    payer: str = Field(..., description="insurance payer, like Cigna")
    age: str = Field(..., description="patient age category, like 'adult' or 'pediatric'")
    complexity: str = Field(
        ...,
        description="complexity of the query to account for the edge cases, like 'off-label', 'comorbidities', etc",
    )
    geography: str = Field(..., description="geography indicator, like a zip code, specific state or county")


async def main() -> None:
    evaluator = Evaluator(Query, context="Healthcare prior authorization")

    # Step 1: generate options for each dimension using Instructor
    options = await evaluator.options(
        config=TupleConfig(
            options_instructions="Focus on common US payers and edge-case clinical scenarios.",
            options_per_field=5,
        ),
    )

    # Step 2: stream tuples -> natural language queries
    async for q in evaluator.run(
        options=options,
        tuple_config=TupleConfig(strategy=TupleStrategy.CROSS_PRODUCT, count=50, seed=0),
        query_config=QueryConfig(
            mode=QueryMode.INSTRUCTOR,
            instructions="""
Write realistic user questions.
Keep them short but specific.
Don't include any extra explanation outside the query itself.
""",
        ),
    ):
        print(q.source_tuple.values, "->", q.query)


asyncio.run(main())

The evaluator uses environment variables (for example OPENAI_API_KEY) and supports any provider that instructor supports. You can customise the provider and model via the LLMClient helper if needed.

If your input model already uses iterator fields (for example payer: list[str] = ["Cigna", "Aetna"]), those lists are treated as fixed options and are not modified by generate_options(). Scalar fields of any basic type (str, int, float, and so on) are turned into lists of options automatically.

Instructions: options vs queries

Evaluateur uses two different instruction strings:

  • Option generation: use TupleConfig.options_instructions (either via Evaluator.options(config=...) or via Evaluator.run(..., tuple_config=...)) to guide what dimension values to propose (e.g. “Focus on common US payers.”).
  • Query generation: use QueryConfig.instructions (either via Evaluator.run(..., query_config=...) or Evaluator.queries(..., config=...)) to guide how the natural language query should be written (e.g. “Keep the question short and specific.”).

Tuple generation: seeded sampling for cross product

When TupleStrategy.CROSS_PRODUCT is used and 0 < count < total_combinations, Evaluateur returns a seeded randomized sample of the cartesian product (uniform without replacement). This helps avoid always taking the “first N” combinations when the space is large.

  • To get reproducible results, set TupleConfig(seed=...).
  • Changing the seed gives you a different randomized subset.

Goal-guided query optimization (Components / Trajectories / Outcomes)

You can guide query generation using the three-layer framework by providing a GoalSpec (structured) or free-form text (which is normalized into a GoalSpec).

Goals are used to condition queries per run, so you can iterate quickly. If you provide GoalItem.examples, they are included in the internal goal prompt passed to the query generator.

When you pass free-form text for goals, Evaluateur will ask an LLM to convert it into a structured GoalSpec. For best results, include concrete example user questions and any measurable acceptance criteria (e.g. “cite policy section,” “ask a clarifying question if payer is missing,” “return a checklist”). This helps produce low-overlap goals across components vs trajectories vs outcomes.

Sampling goals per query (diversity mode)

By default, Evaluateur picks a single focus area (components, trajectories, or outcomes) per generated query. This helps ensure one run produces a mix of different stress-test styles.

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator, GoalItem, GoalLayer, GoalSpec, QueryMode
from evaluateur.configs import QueryConfig


class Query(BaseModel):
    payer: str = Field(...)
    age: str = Field(...)
    complexity: str = Field(...)
    geography: str = Field(...)


async def main() -> None:
    evaluator = Evaluator(Query, context="Healthcare prior authorization")

    goals = GoalSpec(
        components=GoalLayer(items=[GoalItem(name="freshness checks")]),
        trajectories=GoalLayer(items=[GoalItem(name="conflict handling")]),
        outcomes=GoalLayer(items=[GoalItem(name="checklist-ready")]),
    )

    async for q in evaluator.run(
        query_config=QueryConfig(
            mode=QueryMode.INSTRUCTOR,
            goal_seed=0,
            instructions="Make the question sound like a real user.",
        ),
        goals=goals,
    ):
        print(q.metadata.goal_focus_area, "->", q.query)
        break


asyncio.run(main())

Context builders (advanced)

A context builder is a callable used by query generators to vary the prompt per tuple, instead of using one shared context string for the whole run.

It returns two things:

  • A context string to include in the prompt for that tuple
  • Optional per-query metadata (extra keys are allowed) that will be merged into q.metadata

Evaluateur uses a context builder internally when you enable goal sampling (QueryConfig(goal_mode="sample")) so that each generated query can focus on a different goal area (components vs trajectories vs outcomes).

If you write a custom query generator, accept context_builder and fall back to the base context when it is not provided.

from __future__ import annotations

from collections.abc import AsyncIterator

from evaluateur.generators.query.protocols import ContextBuilder
from evaluateur.models import GeneratedQuery, GeneratedTuple


class MyQueryGenerator:
    async def generate(
        self,
        tuples: AsyncIterator[GeneratedTuple],
        context: str,
        *,
        context_builder: ContextBuilder | None = None,
    ) -> AsyncIterator[GeneratedQuery]:
        async for t in tuples:
            if context_builder is None:
                effective_context, meta = context, {}
            else:
                effective_context, meta = context_builder(t)

            # Use effective_context to build your prompt, and attach meta if you want.
            yield GeneratedQuery(query=f"ctx={effective_context}", source_tuple=t, metadata=meta)

Structured goals:

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator, GoalItem, GoalLayer, GoalSpec, QueryMode
from evaluateur.configs import QueryConfig


class Query(BaseModel):
    payer: str = Field(..., description="insurance payer, like Cigna")
    age: str = Field(..., description="patient age category, like 'adult' or 'pediatric'")
    complexity: str = Field(..., description="complexity bucket, e.g. comorbidities, off-label")
    geography: str = Field(..., description="geography indicator, like state or zip code")


async def main() -> None:
    evaluator = Evaluator(Query, context="Healthcare prior authorization")

    goals = GoalSpec(
        components=GoalLayer(
            summary="Stress freshness, missing-document detection, and citation traceability.",
            items=[
                GoalItem(
                    name="freshness checks",
                    must_include=["effective date", "latest policy", "as of"],
                    avoid=["undated", "last year"],
                ),
                GoalItem(
                    name="grounded claims",
                    must_include=["cite", "policy section"],
                ),
            ],
        ),
        trajectories=GoalLayer(
            items=[
                GoalItem(
                    name="conflict integration",
                    must_include=["conflicting", "payer policy", "FDA label"],
                )
            ]
        ),
        outcomes=GoalLayer(
            items=[
                GoalItem(
                    name="checklist-ready",
                    must_include=["payer", "age", "diagnosis"],
                )
            ]
        ),
    )

    async for q in evaluator.run(
        query_config=QueryConfig(
            mode=QueryMode.INSTRUCTOR,
        ),
        goals=goals,
    ):
        print(q.metadata.query_goals.model_dump() if q.metadata.query_goals else None)
        break


asyncio.run(main())

Free-form goals (normalized with Instructor):

import asyncio
from pydantic import BaseModel, Field

from evaluateur import Evaluator, QueryMode
from evaluateur.configs import QueryConfig


class Query(BaseModel):
    payer: str = Field(...)
    age: str = Field(...)
    complexity: str = Field(...)
    geography: str = Field(...)


async def main() -> None:
    evaluator = Evaluator(Query, context="Healthcare prior authorization")

    i = 0
    async for q in evaluator.run(
        query_config=QueryConfig(
            mode=QueryMode.INSTRUCTOR,
        ),
        goals="""
Components: prioritize freshness checks, grounded citations, and missing-source detection (don’t proceed silently).
Trajectories: include conflict handling and recovery behavior (re-try, switch tools, or escalate when evidence conflicts).
Outcomes: produce checklist-ready outputs that are easy to review and hard to misuse.
""",
    ):
        print(q.query)
        i += 1
        if i >= 3:
            break


asyncio.run(main())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluateur-0.2.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaluateur-0.2.0-py3-none-any.whl (34.2 kB view details)

Uploaded Python 3

File details

Details for the file evaluateur-0.2.0.tar.gz.

File metadata

  • Download URL: evaluateur-0.2.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evaluateur-0.2.0.tar.gz
Algorithm Hash digest
SHA256 63635aa8fb0ca144c99e4b6709a2751a89f40cdd60a726ef52a88232d5d5e6f4
MD5 017848b806900a765a6d97ee755d070b
BLAKE2b-256 99cd1fa027d0849d6272ccd953ee9ffd13830286f1d34194716b99a580f45987

See more details on using hashes here.

File details

Details for the file evaluateur-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: evaluateur-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for evaluateur-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdcdbe3b168cc8e1d42c3f6987bbaac17628bfa082382d0df5eb1371c241d91d
MD5 f22848ec9fcb92f40c7146c6210cc486
BLAKE2b-256 32a375e5fcc7f5e8e5266bf656281a42056130db5b377cabff1692bc302c8db1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page