Skip to main content

Self-improving evals framework for AI agents.

Project description

selfevals

Self-improving evals framework for AI agents.

Point selfevals at your agent and it runs a structured experiment: it feeds eval cases through an adapter, grades each trace, sweeps the parameters you expose, and renders a report that tells you which configuration to keep. CLI-first, multi-tenant from day one, and agnostic to the agent framework underneath — selfevals never calls your provider; your agent does, and selfevals grades the result.

Status: v0.2.1 — runtime functional. The CLI works end-to-end: load an experiment spec → run cases through an adapter → grade traces → persist iterations → render a report. See docs/spec/ for the canonical and operational specs that drive design, and docs/STATUS.md for an honest what-works / what-doesn't snapshot.

Install

pip install selfevals

The distribution is selfevals; the import name and the CLI command are both selfevals (import selfevals, selfevals --help).

To run or trace an agent backed by a real provider, install the matching extra — each one bundles the provider's SDK and the tracing integration, so a single install is enough:

pip install 'selfevals[openai]'      # or [anthropic], [bedrock], [vertex],
                                      #    [langchain], [crewai]
pip install 'selfevals[all]'         # every provider + the web API

The core install depends only on pydantic and pyyaml; no provider SDK is pulled until you ask for an extra.

60-second quickstart

pip install selfevals
selfevals examples copy pingpong     # writes evals/ into the current dir
selfevals run evals/experiments/example_pingpong.yaml --no-persist

Expected output: a markdown report showing two iterations, the best one selected, and a top failure-modes table — end-to-end in under a second against the bundled EmbeddedAdapter echo agent. No API key needed.

To persist results to SQLite and inspect them afterwards (note: --db is a global flag, so it goes before the subcommand):

selfevals --db ./selfevals.sqlite run evals/experiments/example_pingpong.yaml
selfevals --db ./selfevals.sqlite experiment list <workspace_id>
selfevals --db ./selfevals.sqlite report <workspace_id> <experiment_id>

The run command prints the workspace and experiment ids you need for the follow-up commands.

Concepts

The five nouns you'll meet everywhere:

Term What it is
EvalCase One test: an input, the expected outcome, and which graders apply.
Adapter The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls it, never the provider directly.
Grader Scores a trace. DeterministicGrader (rules: substrings, tools, JSON schema) or LLMJudgeGrader (a rubric-driven judge).
Proposer Picks the next parameter configuration to try — manual, grid, or random.
DecisionMatrix Turns each iteration's metrics into a verdict: keep, reject, investigate, spawn sub-experiment, or require a tradeoff review.

An experiment is a YAML spec wiring these together; a run executes it, producing iterations the reporter ranks.

Try it with a real LLM agent

Two parallel examples live in examples/ — same three eval cases (sentiment classification, structured extraction, open-ended support reply), same graders, same temperature sweep, differing only in the provider call. Both fall back to deterministic fakes when the API key is unset, so they're runnable offline.

Anthropic (examples/hello_llm/):

pip install 'selfevals[anthropic]'
export ANTHROPIC_API_KEY=sk-ant-...        # optional; falls back to a fake
uv run selfevals run examples/hello_llm/experiment.yaml --no-persist

OpenAI (examples/hello_openai/):

pip install 'selfevals[openai]'
export OPENAI_API_KEY=sk-...               # optional; falls back to a fake
uv run selfevals run examples/hello_openai/experiment.yaml --no-persist

Each combines a DeterministicGrader (sentiment + extraction) with an LLMJudgeGrader (the open-ended reply). The GridProposer sweeps temperature ∈ {0.0, 0.5, 1.0}; the report ranks them and the DecisionMatrix selects the winner. Against the real models the coolest temperature typically wins pass@1 while warmer settings degrade on the structured-output case.

See examples/README.md for a walk-through of the file layout and how to adapt them to your own agent.

The example specs and datasets reference examples.hello_*.agent import paths, so they run from a source checkout (clone the repo). The pip-installable selfevals examples copy pingpong flow ships only the dependency-free pingpong example today.

Adapters

selfevals ships three concrete AgentAdapter implementations so you can point the loop at any agent:

  • EmbeddedAdapter — a Python callable in-process. Best for quick tests.
  • CliCommandAdapter — invokes a subprocess and reads JSON on stdout.
  • HttpEndpointAdapter — POSTs each case to an HTTP endpoint and reads JSON.

See src/selfevals/runner/adapters.py for the contract and docs/adapters.md for usage examples, per-adapter YAML/code snippets, and a comparison table.

CLI reference

selfevals --help lists every command; selfevals <command> --help shows its arguments. The surface:

Command Purpose
init <slug> Create a workspace and seed the default failure-mode taxonomy.
run <spec.yaml> Run an experiment spec end-to-end.
report <ws> <exp> Render a stored experiment as markdown (--format json for JSON).
compare <ws> <itr_a> <itr_b> Diff two iterations side by side.
estimate Dry-run cost estimate for a search space × cases × reps.
workspace show <ws> Inspect a workspace.
experiment list/show <ws> [exp] List or inspect experiments.
iteration list <ws> <exp> List recorded iterations.
analyze pull/push <ws> <exp> The error-analysis handshake (see below).
failuremode list/promote/retire/merge/edit Manage the failure-mode taxonomy.
skills list / path <name> Locate the agent skills bundled with the install.
examples copy <name> Copy a runnable example into the current project.

--db <path> is a global flag (default ./selfevals.sqlite) and goes before the subcommand.

Error analysis (closed loop)

selfevals grows a per-workspace failure-mode taxonomy and drives the next experiment from it — it never calls an LLM itself. analyze pull emits the failed traces plus the live taxonomy; an external coding agent does the open/axial coding and analyze pushes the result back; a human promotes candidate modes via failuremode promote. The bundled error-analysis skill (discoverable via selfevals skills list) encodes the method.

Layout

src/selfevals/        # the SDK package
  schemas/            # Pydantic v2 entities + contractual validators
  storage/            # SQLite + filesystem object store (interface abstracted)
  trace/              # native SDK decorators + OTel importer
  runner/             # agent adapters + executor + sandbox modes
  graders/            # deterministic + LLM-judge + calibration
  optimization/       # OptimizationLoop + proposers (manual/grid/random)
  decision/           # decision matrix → DecisionRecord
  reporter/           # markdown + JSON reports
  analysis/           # error-analysis handshake (pull/push, bundles)
  cli/                # argparse entrypoint
examples/             # runnable examples (pingpong, hello_llm, hello_openai)
docs/spec/            # canonical + operational specs (source of truth)
tests/                # pytest, mirrors src/selfevals layout

Development

uv sync --all-extras --dev        # venv + every extra + dev tooling
uv run pytest                     # tests
uv run mypy src/selfevals         # types (strict)
uv run ruff check .               # lint

See CONTRIBUTING.md for the test layout, the optional telemetry/web extras some tests require, and PR conventions.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfevals-0.2.2.tar.gz (236.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selfevals-0.2.2-py3-none-any.whl (169.1 kB view details)

Uploaded Python 3

File details

Details for the file selfevals-0.2.2.tar.gz.

File metadata

  • Download URL: selfevals-0.2.2.tar.gz
  • Upload date:
  • Size: 236.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for selfevals-0.2.2.tar.gz
Algorithm Hash digest
SHA256 42fe18e5b4542ae51399d5c3a414c7f5da7d01b4003d374be7ad57c2ba63d1fb
MD5 7e3d24e67bfb53eda2bc467f934251c9
BLAKE2b-256 dc6b96ab84966c4d08c1ed515b7c3e68f481677a8f6816b8eb7242e063d02ca5

See more details on using hashes here.

File details

Details for the file selfevals-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: selfevals-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 169.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for selfevals-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 97701c27a237c4ab696dbd8e397be716d3a21ddd2f525a3b5239a57c5b8e1b76
MD5 1052663f26192f715194a7d266c7fee6
BLAKE2b-256 242437696ef175d0667dc91c8f80fa12f8e3348f74ec7805089d1735ca2ab61e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page