Skip to main content

Self-improving evals framework for AI agents.

Project description

selfevals

Self-improving evals framework for AI agents.

Point selfevals at your agent and it runs a structured experiment: it feeds eval cases through an adapter, grades each trace, sweeps the parameters you expose, and renders a report that tells you which configuration to keep. CLI-first, multi-tenant from day one, and agnostic to the agent framework underneath — selfevals never calls your provider; your agent does, and selfevals grades the result.

Status: v0.3.0 — runtime functional. The CLI works end-to-end: load an experiment spec → run cases through an adapter → grade traces → persist iterations → render a report. Adapters and graders are async, with concurrent repetitions and grading. See docs/spec/ for the canonical and operational specs that drive design, and docs/STATUS.md for an honest what-works / what-doesn't snapshot.

Install

pip install selfevals

The distribution is selfevals; the import name and the CLI command are both selfevals (import selfevals, selfevals --help).

To run or trace an agent backed by a real provider, install the matching extra — each one bundles the provider's SDK and the tracing integration, so a single install is enough:

pip install 'selfevals[openai]'      # or [anthropic], [bedrock], [vertex],
                                      #    [langchain], [crewai]
pip install 'selfevals[all]'         # every provider + the web API

The core install depends only on pydantic and pyyaml; no provider SDK is pulled until you ask for an extra.

60-second quickstart

pip install selfevals
selfevals examples copy pingpong     # writes evals/ into the current dir
selfevals run evals/experiments/example_pingpong.yaml --no-persist

Expected output: a markdown report showing two iterations, the best one selected, and a top failure-modes table — end-to-end in under a second against the bundled EmbeddedAdapter echo agent. No API key needed.

To persist results to SQLite and inspect them afterwards (note: --db is a global flag, so it goes before the subcommand):

selfevals --db ./selfevals.sqlite run evals/experiments/example_pingpong.yaml
selfevals --db ./selfevals.sqlite experiment list <workspace_id>
selfevals --db ./selfevals.sqlite report <workspace_id> <experiment_id>

The run command prints the workspace and experiment ids you need for the follow-up commands.

Concepts

The five nouns you'll meet everywhere:

Term What it is
EvalCase One test: an input (a validated multi-turn messages conversation, or any opaque payload), the expected outcome, and which graders apply.
Adapter The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls it, never the provider directly.
Grader Scores a trace. DeterministicGrader (rules: substrings, tools, JSON schema) or LLMJudgeGrader (a rubric-driven judge).
Proposer Picks the next parameter configuration to try — manual, grid, or random.
DecisionMatrix Turns each iteration's metrics into a verdict: keep, reject, investigate, spawn sub-experiment, or require a tradeoff review.

An experiment is a YAML spec wiring these together; a run executes it, producing iterations the reporter ranks.

Try it with a real LLM agent

Two parallel examples live in examples/ — same three eval cases (sentiment classification, structured extraction, open-ended support reply), same graders, same temperature sweep, differing only in the provider call. Both fall back to deterministic fakes when the API key is unset, so they're runnable offline.

Anthropic (examples/hello_llm/):

pip install 'selfevals[anthropic]'
export ANTHROPIC_API_KEY=sk-ant-...        # optional; falls back to a fake
uv run selfevals run examples/hello_llm/experiment.yaml --no-persist

OpenAI (examples/hello_openai/):

pip install 'selfevals[openai]'
export OPENAI_API_KEY=sk-...               # optional; falls back to a fake
uv run selfevals run examples/hello_openai/experiment.yaml --no-persist

Each combines a DeterministicGrader (sentiment + extraction) with an LLMJudgeGrader (the open-ended reply). The GridProposer sweeps temperature ∈ {0.0, 0.5, 1.0}; the report ranks them and the DecisionMatrix selects the winner. Against the real models the coolest temperature typically wins pass@1 while warmer settings degrade on the structured-output case.

See examples/README.md for a walk-through of the file layout and how to adapt them to your own agent.

The example specs and datasets reference examples.hello_*.agent import paths, so they run from a source checkout (clone the repo). The pip-installable selfevals examples copy pingpong flow ships only the dependency-free pingpong example today.

Adapters

selfevals ships three concrete AgentAdapter implementations so you can point the loop at any agent:

  • EmbeddedAdapter — a Python callable in-process. Best for quick tests.
  • CliCommandAdapter — invokes a subprocess and reads JSON on stdout.
  • HttpEndpointAdapter — POSTs each case to an HTTP endpoint and reads JSON.

See src/selfevals/runner/adapters.py for the contract and docs/adapters.md for usage examples, per-adapter YAML/code snippets, and a comparison table.

CLI reference

selfevals --help lists every command; selfevals <command> --help shows its arguments. The surface:

Command Purpose
init <slug> Create a workspace and seed the default failure-mode taxonomy.
run <spec.yaml> Run an experiment spec end-to-end.
report <ws> <exp> Render a stored experiment as markdown (--format json for JSON).
compare <ws> <itr_a> <itr_b> Diff two iterations side by side.
estimate Dry-run cost estimate for a search space × cases × reps.
workspace show <ws> Inspect a workspace.
experiment list/show <ws> [exp] List or inspect experiments.
iteration list <ws> <exp> List recorded iterations.
analyze pull/push <ws> <exp> The error-analysis handshake (see below).
failuremode list/promote/retire/merge/edit Manage the failure-mode taxonomy.
skills list / path <name> Locate the agent skills bundled with the install.
examples copy <name> Copy a runnable example into the current project.

--db <path> is a global flag (default ./selfevals.sqlite) and goes before the subcommand.

Error analysis (closed loop)

selfevals grows a per-workspace failure-mode taxonomy and drives the next experiment from it — it never calls an LLM itself. analyze pull emits the failed traces plus the live taxonomy; an external coding agent does the open/axial coding and analyze pushes the result back; a human promotes candidate modes via failuremode promote. The bundled error-analysis skill (discoverable via selfevals skills list) encodes the method.

Layout

src/selfevals/        # the SDK package
  schemas/            # Pydantic v2 entities + contractual validators
  storage/            # SQLite + filesystem object store (interface abstracted)
  trace/              # native SDK decorators + OTel importer
  runner/             # agent adapters + executor + sandbox modes
  graders/            # deterministic + LLM-judge + calibration
  optimization/       # OptimizationLoop + proposers (manual/grid/random)
  decision/           # decision matrix → DecisionRecord
  reporter/           # markdown + JSON reports
  analysis/           # error-analysis handshake (pull/push, bundles)
  cli/                # argparse entrypoint
examples/             # runnable examples (pingpong, hello_llm, hello_openai)
docs/spec/            # canonical + operational specs (source of truth)
tests/                # pytest, mirrors src/selfevals layout

Development

uv sync --all-extras --dev        # venv + every extra + dev tooling
uv run pytest                     # tests
uv run mypy src/selfevals         # types (strict)
uv run ruff check .               # lint

See CONTRIBUTING.md for the test layout, the optional telemetry/web extras some tests require, and PR conventions.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfevals-0.3.0.tar.gz (265.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selfevals-0.3.0-py3-none-any.whl (170.8 kB view details)

Uploaded Python 3

File details

Details for the file selfevals-0.3.0.tar.gz.

File metadata

  • Download URL: selfevals-0.3.0.tar.gz
  • Upload date:
  • Size: 265.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for selfevals-0.3.0.tar.gz
Algorithm Hash digest
SHA256 90cc2e40aeec61bebef341e3c6fd9e349e1f74dddbd4a15ab5d745900d21e998
MD5 7a0fdd60bd0b4d37a532dbb127daed4c
BLAKE2b-256 687842a71873449ec758cd7c90621d9f37c5c6389a68cbf9e311e9eb87d1d57d

See more details on using hashes here.

File details

Details for the file selfevals-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: selfevals-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 170.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for selfevals-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efc8cc22ff1e3fc0e5ada242900ef5608ee5248927b0137930170633dcc1b2aa
MD5 1175be18c73e696f621ae5f90f73c5bd
BLAKE2b-256 696eb6dfd726065ccbb413fd5553493f4637ff892af7214bc7ba2f89e1ef901a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page