Self-improving evals framework for AI agents.
Project description
selfevals
Self-improving evals framework for AI agents.
Point selfevals at your agent and it runs a structured experiment: it feeds eval cases through an adapter, grades each trace, sweeps the parameters you expose, and renders a report that tells you which configuration to keep. CLI-first, multi-tenant from day one, and agnostic to the agent framework underneath — selfevals never calls your provider; your agent does, and selfevals grades the result.
Status: v0.2.1 — runtime functional. The CLI works end-to-end: load an experiment spec → run cases through an adapter → grade traces → persist iterations → render a report. See
docs/spec/for the canonical and operational specs that drive design, anddocs/STATUS.mdfor an honest what-works / what-doesn't snapshot.
Install
pip install selfevals
The distribution is selfevals; the import name and the CLI command are
both selfevals (import selfevals, selfevals --help).
To run or trace an agent backed by a real provider, install the matching extra — each one bundles the provider's SDK and the tracing integration, so a single install is enough:
pip install 'selfevals[openai]' # or [anthropic], [bedrock], [vertex],
# [langchain], [crewai]
pip install 'selfevals[all]' # every provider + the web API
The core install depends only on pydantic and pyyaml; no provider SDK
is pulled until you ask for an extra.
60-second quickstart
pip install selfevals
selfevals examples copy pingpong # writes evals/ into the current dir
selfevals run evals/experiments/example_pingpong.yaml --no-persist
Expected output: a markdown report showing two iterations, the best one
selected, and a top failure-modes table — end-to-end in under a second
against the bundled EmbeddedAdapter echo agent. No API key needed.
To persist results to SQLite and inspect them afterwards (note: --db is a
global flag, so it goes before the subcommand):
selfevals --db ./selfevals.sqlite run evals/experiments/example_pingpong.yaml
selfevals --db ./selfevals.sqlite experiment list <workspace_id>
selfevals --db ./selfevals.sqlite report <workspace_id> <experiment_id>
The run command prints the workspace and experiment ids you need for the
follow-up commands.
Concepts
The five nouns you'll meet everywhere:
| Term | What it is |
|---|---|
| EvalCase | One test: an input, the expected outcome, and which graders apply. |
| Adapter | The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls it, never the provider directly. |
| Grader | Scores a trace. DeterministicGrader (rules: substrings, tools, JSON schema) or LLMJudgeGrader (a rubric-driven judge). |
| Proposer | Picks the next parameter configuration to try — manual, grid, or random. |
| DecisionMatrix | Turns each iteration's metrics into a verdict: keep, reject, investigate, spawn sub-experiment, or require a tradeoff review. |
An experiment is a YAML spec wiring these together; a run executes it, producing iterations the reporter ranks.
Try it with a real LLM agent
Two parallel examples live in examples/ — same three eval
cases (sentiment classification, structured extraction, open-ended support
reply), same graders, same temperature sweep, differing only in the
provider call. Both fall back to deterministic fakes when the API key is
unset, so they're runnable offline.
Anthropic (examples/hello_llm/):
pip install 'selfevals[anthropic]'
export ANTHROPIC_API_KEY=sk-ant-... # optional; falls back to a fake
uv run selfevals run examples/hello_llm/experiment.yaml --no-persist
OpenAI (examples/hello_openai/):
pip install 'selfevals[openai]'
export OPENAI_API_KEY=sk-... # optional; falls back to a fake
uv run selfevals run examples/hello_openai/experiment.yaml --no-persist
Each combines a DeterministicGrader (sentiment + extraction) with an
LLMJudgeGrader (the open-ended reply). The GridProposer sweeps
temperature ∈ {0.0, 0.5, 1.0}; the report ranks them and the
DecisionMatrix selects the winner. Against the real models the coolest
temperature typically wins pass@1 while warmer settings degrade on the
structured-output case.
See examples/README.md for a walk-through of the
file layout and how to adapt them to your own agent.
The example specs and datasets reference
examples.hello_*.agentimport paths, so they run from a source checkout (clone the repo). The pip-installableselfevals examples copy pingpongflow ships only the dependency-free pingpong example today.
Adapters
selfevals ships three concrete AgentAdapter implementations so you can
point the loop at any agent:
EmbeddedAdapter— a Python callable in-process. Best for quick tests.CliCommandAdapter— invokes a subprocess and reads JSON on stdout.HttpEndpointAdapter— POSTs each case to an HTTP endpoint and reads JSON.
See src/selfevals/runner/adapters.py for the contract and
docs/adapters.md for usage examples, per-adapter
YAML/code snippets, and a comparison table.
CLI reference
selfevals --help lists every command; selfevals <command> --help shows
its arguments. The surface:
| Command | Purpose |
|---|---|
init <slug> |
Create a workspace and seed the default failure-mode taxonomy. |
run <spec.yaml> |
Run an experiment spec end-to-end. |
report <ws> <exp> |
Render a stored experiment as markdown (--format json for JSON). |
compare <ws> <itr_a> <itr_b> |
Diff two iterations side by side. |
estimate |
Dry-run cost estimate for a search space × cases × reps. |
workspace show <ws> |
Inspect a workspace. |
experiment list/show <ws> [exp] |
List or inspect experiments. |
iteration list <ws> <exp> |
List recorded iterations. |
analyze pull/push <ws> <exp> |
The error-analysis handshake (see below). |
failuremode list/promote/retire/merge/edit |
Manage the failure-mode taxonomy. |
skills list / path <name> |
Locate the agent skills bundled with the install. |
examples copy <name> |
Copy a runnable example into the current project. |
--db <path> is a global flag (default ./selfevals.sqlite) and goes
before the subcommand.
Error analysis (closed loop)
selfevals grows a per-workspace failure-mode taxonomy and drives the next
experiment from it — it never calls an LLM itself. analyze pull emits the
failed traces plus the live taxonomy; an external coding agent does the
open/axial coding and analyze pushes the result back; a human promotes
candidate modes via failuremode promote. The bundled
error-analysis skill
(discoverable via selfevals skills list) encodes the method.
Layout
src/selfevals/ # the SDK package
schemas/ # Pydantic v2 entities + contractual validators
storage/ # SQLite + filesystem object store (interface abstracted)
trace/ # native SDK decorators + OTel importer
runner/ # agent adapters + executor + sandbox modes
graders/ # deterministic + LLM-judge + calibration
optimization/ # OptimizationLoop + proposers (manual/grid/random)
decision/ # decision matrix → DecisionRecord
reporter/ # markdown + JSON reports
analysis/ # error-analysis handshake (pull/push, bundles)
cli/ # argparse entrypoint
examples/ # runnable examples (pingpong, hello_llm, hello_openai)
docs/spec/ # canonical + operational specs (source of truth)
tests/ # pytest, mirrors src/selfevals layout
Development
uv sync --all-extras --dev # venv + every extra + dev tooling
uv run pytest # tests
uv run mypy src/selfevals # types (strict)
uv run ruff check . # lint
See CONTRIBUTING.md for the test layout, the optional
telemetry/web extras some tests require, and PR conventions.
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file selfevals-0.2.2.tar.gz.
File metadata
- Download URL: selfevals-0.2.2.tar.gz
- Upload date:
- Size: 236.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42fe18e5b4542ae51399d5c3a414c7f5da7d01b4003d374be7ad57c2ba63d1fb
|
|
| MD5 |
7e3d24e67bfb53eda2bc467f934251c9
|
|
| BLAKE2b-256 |
dc6b96ab84966c4d08c1ed515b7c3e68f481677a8f6816b8eb7242e063d02ca5
|
File details
Details for the file selfevals-0.2.2-py3-none-any.whl.
File metadata
- Download URL: selfevals-0.2.2-py3-none-any.whl
- Upload date:
- Size: 169.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97701c27a237c4ab696dbd8e397be716d3a21ddd2f525a3b5239a57c5b8e1b76
|
|
| MD5 |
1052663f26192f715194a7d266c7fee6
|
|
| BLAKE2b-256 |
242437696ef175d0667dc91c8f80fa12f8e3348f74ec7805089d1735ca2ab61e
|