Self-improving evals framework for AI agents.

These details have not been verified by PyPI

Project links

Project description

selfevals

Self-improving evals framework for AI agents.

Point selfevals at your agent and it runs a structured experiment: it feeds eval cases through an adapter, grades each trace, sweeps the parameters you expose, and renders a report that tells you which configuration to keep. CLI-first, multi-tenant from day one, and agnostic to the agent framework underneath — selfevals never calls your provider; your agent does, and selfevals grades the result.

Status: v0.3.0 — runtime functional. The CLI works end-to-end: load an experiment spec → run cases through an adapter → grade traces → persist iterations → render a report. Adapters and graders are async, with concurrent repetitions and grading. See docs/spec/ for the canonical and operational specs that drive design, and docs/STATUS.md for an honest what-works / what-doesn't snapshot.

Install

pip install selfevals

The distribution is selfevals; the import name and the CLI command are both selfevals (import selfevals, selfevals --help).

To run or trace an agent backed by a real provider, install the matching extra — each one bundles the provider's SDK and the tracing integration, so a single install is enough:

pip install 'selfevals[openai]'      # or [anthropic], [bedrock], [vertex],
                                      #    [langchain], [crewai]
pip install 'selfevals[all]'         # every provider + the web API

The core install depends only on pydantic and pyyaml; no provider SDK is pulled until you ask for an extra.

60-second quickstart

pip install selfevals
selfevals examples copy pingpong     # writes evals/ into the current dir
selfevals run evals/experiments/example_pingpong.yaml --no-persist

Expected output: a markdown report showing two iterations, the best one selected, and a top failure-modes table — end-to-end in under a second against the bundled EmbeddedAdapter echo agent. No API key needed.

To persist results to SQLite and inspect them afterwards (note: --db is a global flag, so it goes before the subcommand):

selfevals --db ./selfevals.sqlite run evals/experiments/example_pingpong.yaml
selfevals --db ./selfevals.sqlite experiment list <workspace_id>
selfevals --db ./selfevals.sqlite report <workspace_id> <experiment_id>

The run command prints the workspace and experiment ids you need for the follow-up commands.

Concepts

The five nouns you'll meet everywhere:

Term	What it is
EvalCase	One test: an input (a validated multi-turn `messages` conversation, or any opaque payload), the expected outcome, and which graders apply.
Adapter	The bridge to your agent — embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls it, never the provider directly.
Grader	Scores a trace. `DeterministicGrader` (rules: substrings, tools, JSON schema) or `LLMJudgeGrader` (a rubric-driven judge).
Proposer	Picks the next parameter configuration to try — `manual`, `grid`, or `random`.
DecisionMatrix	Turns each iteration's metrics into a verdict: keep, reject, investigate, spawn sub-experiment, or require a tradeoff review.

An experiment is a YAML spec wiring these together; a run executes it, producing iterations the reporter ranks.

Try it with a real LLM agent

Two parallel examples live in examples/ — same three eval cases (sentiment classification, structured extraction, open-ended support reply), same graders, same temperature sweep, differing only in the provider call. Both fall back to deterministic fakes when the API key is unset, so they're runnable offline.

Anthropic (examples/hello_llm/):

pip install 'selfevals[anthropic]'
export ANTHROPIC_API_KEY=sk-ant-...        # optional; falls back to a fake
uv run selfevals run examples/hello_llm/experiment.yaml --no-persist

OpenAI (examples/hello_openai/):

pip install 'selfevals[openai]'
export OPENAI_API_KEY=sk-...               # optional; falls back to a fake
uv run selfevals run examples/hello_openai/experiment.yaml --no-persist

Each combines a DeterministicGrader (sentiment + extraction) with an LLMJudgeGrader (the open-ended reply). The GridProposer sweeps temperature ∈ {0.0, 0.5, 1.0}; the report ranks them and the DecisionMatrix selects the winner. Against the real models the coolest temperature typically wins pass@1 while warmer settings degrade on the structured-output case.

See examples/README.md for a walk-through of the file layout and how to adapt them to your own agent.

The example specs and datasets reference examples.hello_*.agent import paths, so they run from a source checkout (clone the repo). The pip-installable selfevals examples copy pingpong flow ships only the dependency-free pingpong example today.

Adapters

selfevals ships three concrete AgentAdapter implementations so you can point the loop at any agent:

EmbeddedAdapter — a Python callable in-process. Best for quick tests.
CliCommandAdapter — invokes a subprocess and reads JSON on stdout.
HttpEndpointAdapter — POSTs each case to an HTTP endpoint and reads JSON.

See src/selfevals/runner/adapters.py for the contract and docs/adapters.md for usage examples, per-adapter YAML/code snippets, and a comparison table.

CLI reference

selfevals --help lists every command; selfevals <command> --help shows its arguments. The surface:

Command	Purpose
`init <slug>`	Create a workspace and seed the default failure-mode taxonomy.
`run <spec.yaml>`	Run an experiment spec end-to-end.
`report <ws> <exp>`	Render a stored experiment as markdown (`--format json` for JSON; the JSON now includes per-iteration `cache` hit counts and deduplicated `failure_reasons`).
`compare <ws> <itr_a> <itr_b>`	Diff two iterations side by side.
`estimate`	Dry-run cost estimate for a search space × cases × reps.
`workspace show <ws>`	Inspect a workspace.
`experiment list/show <ws> [exp]`	List or inspect experiments.
`iteration list <ws> <exp>`	List recorded iterations.
`analyze pull/push <ws> <exp>`	The error-analysis handshake (see below).
`failuremode list/promote/retire/merge/edit`	Manage the failure-mode taxonomy.
`skills list / path <name>`	Locate the agent skills bundled with the install.
`examples copy <name>`	Copy a runnable example into the current project.

--db <path> is a global flag (default ./selfevals.sqlite) and goes before the subcommand.

Error analysis (closed loop)

selfevals grows a per-workspace failure-mode taxonomy and drives the next experiment from it — it never calls an LLM itself. analyze pull emits the failed traces plus the live taxonomy; an external coding agent does the open/axial coding and analyze pushes the result back; a human promotes candidate modes via failuremode promote. The bundled error-analysis skill (discoverable via selfevals skills list) encodes the method.

Documentation

Doc	What it covers
`docs/eval_config.md`	The YAML experiment spec: top-level keys, `EvalCase`/`Expected` fields (including recall-based `must_include` via `min_recall`), graders, agent transports, and proposers.
`docs/api_reference.md`	The canonical HTTP API reference — every endpoint, response schema, and error codes.
`docs/json_report_schema.md`	The `report --format json` output shape, including the per-iteration `cache` and `failure_reasons` keys.
`docs/adapters.md`	Adapter contract and per-transport YAML/code snippets.
`docs/FRONTEND.md`	The web UI spec (views, endpoints, roadmap).
`docs/STATUS.md`	Honest what-works / what-doesn't snapshot.

Layout

src/selfevals/        # the SDK package
  schemas/            # Pydantic v2 entities + contractual validators
  storage/            # SQLite + filesystem object store (interface abstracted)
  trace/              # native SDK decorators + OTel importer
  runner/             # agent adapters + executor + sandbox modes
  graders/            # deterministic + LLM-judge + calibration
  optimization/       # OptimizationLoop + proposers (manual/grid/random)
  decision/           # decision matrix → DecisionRecord
  reporter/           # markdown + JSON reports
  analysis/           # error-analysis handshake (pull/push, bundles)
  cli/                # argparse entrypoint
examples/             # runnable examples (pingpong, hello_llm, hello_openai)
docs/spec/            # canonical + operational specs (source of truth)
tests/                # pytest, mirrors src/selfevals layout

Development

uv sync --all-extras --dev        # venv + every extra + dev tooling
uv run pytest                     # tests
uv run mypy src/selfevals         # types (strict)
uv run ruff check .               # lint

See CONTRIBUTING.md for the test layout, the optional telemetry/web extras some tests require, and PR conventions.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

May 29, 2026

0.4.2

May 28, 2026

0.4.0

May 28, 2026

0.3.0

May 27, 2026

0.2.2

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selfevals-0.5.0.tar.gz (392.5 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

selfevals-0.5.0-py3-none-any.whl (240.4 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file selfevals-0.5.0.tar.gz.

File metadata

Download URL: selfevals-0.5.0.tar.gz
Upload date: May 29, 2026
Size: 392.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for selfevals-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`becf15b7e2493bce194b4f6732fdaef5e61a88693fcfccbd7a27038a0da5240d`
MD5	`3627225950cdc97bba16c4f466986731`
BLAKE2b-256	`90f1b3532e91552732a1485f95ffb5485762baf25f441aeec9691f843bc9f5e8`

See more details on using hashes here.

File details

Details for the file selfevals-0.5.0-py3-none-any.whl.

File metadata

Download URL: selfevals-0.5.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 240.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.12

File hashes

Hashes for selfevals-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70e19859e15a92b819cd82f3e3fd630967bf4e66fe05fdee53734189046510ef`
MD5	`f27c2bcb69c934f96c122d6526fc0468`
BLAKE2b-256	`77934f9cc812f7d0994bc2008ec128f6e3a4e2aabc29b521d7b0ee0a43c2f502`

See more details on using hashes here.

selfevals 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

selfevals

Install

60-second quickstart

Concepts

Try it with a real LLM agent

Adapters

CLI reference

Error analysis (closed loop)

Documentation

Layout

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes