CLI for evaluating Claude Code skills and AI agents

These details have not been verified by PyPI

Project links

Project description

Caliper - Agent Skill Evaluation Harness

Caliper banner

Evaluate AI agent skills with repeatable tasks, automated judging, and pass@k scoring.

Caliper is a local-first evaluation harness for Claude Code skills, Codex skills, and API-backed agents. It runs a skill against one or more task specs, records every attempt, judges the result with an LLM and/or deterministic Python assertions, and saves reproducible result files you can inspect later.

Use Caliper when you want to answer practical questions like:

Did this skill actually get better after my prompt edit?
Does it still pass the workflows it passed last week?
Does Codex or Claude Code run this skill more reliably for my use case?
Is the skill doing the work, or would the baseline agent pass without it?
Can contributors change a skill without relying on subjective manual testing?

Caliper is especially useful for agent skills because skills are hard to review with ordinary unit tests. A good skill is part prompt, part workflow, part tool contract. Caliper turns that behavior into versioned eval specs, repeatable runs, pass/fail judgments, and saved transcripts.

Caliper terminal demo

Highlights

Skill-first evaluation for Claude Code, Codex, Anthropic API, and OpenAI API backends.
Independent agent and judge backends, so you can test a Codex skill with a Claude judge, a Claude Code skill with a Codex judge, or keep everything on one provider.
Natural-language and deterministic checks through expect: and assert:.
pass@k scoring for measuring reliability across repeated attempts.
Baseline runs to show whether the skill improves over an unassisted agent.
Attempt isolation with fresh temporary homes and no session history.
Reproducible result files that snapshot the skill content, referenced local files, and git SHA when available.
Agent-installable evaluator skill so Claude Code or Codex can help create, validate, run, and interpret evals.

When To Use It

Caliper works well for:

evaluating Claude Code slash-command skills
evaluating Codex skills
comparing agent backends on the same task suite
regression-testing prompt and workflow changes
checking coding, review, refactor, summarization, screenshot, and file-writing behaviors
mixing LLM judgment with exact checks for files, JSON, command output, images, and repository state

It is not a replacement for normal unit tests. Use unit tests for deterministic library behavior. Use Caliper for agent behavior where the output depends on a model following instructions, using tools, and completing a workflow.

Install

The fastest way to install Caliper from GitHub is:

pipx install git+https://github.com/edonadei/caliper.git

After the first PyPI release, install the same CLI as:

pipx install caliper-eval

Both install methods expose the caliper command:

caliper --help

For local development from the repository root:

pip install -e .

For local development and optional OpenAI API support:

pip install -e ".[dev,openai]"

Caliper requires Python 3.10 or newer.

Backend Setup

Caliper can run the agent under test and the judge through different backends.

Role	Claude Code CLI	Codex CLI	API backends
Agent under test	`skill.backend: claude-code`	`skill.backend: codex`	`skill.backend: claude-api` or `openai-api`
LLM judge	`judge.backend: claude-code`	`judge.backend: codex`	`judge.backend: claude-api` or `openai-api`
Auth/billing	Claude Code subscription/auth	Codex CLI subscription/auth	Provider API key/billing
Transcript	Claude `stream-json` tool-call transcript	Final Codex text output	Final API response text

Claude Code

Install and authenticate the claude CLI. backend: claude-code uses your normal Claude Code CLI auth.

If you explicitly use backend: claude-api, set:

export ANTHROPIC_API_KEY=...

Codex

Install and authenticate the Codex CLI:

npm install -g @openai/codex
codex login
codex --version

backend: codex calls Codex with codex exec. It does not fall back to the OpenAI API. If the CLI is unavailable or cannot authenticate, Caliper reports a backend configuration error.

When the Codex desktop app is installed, Caliper prefers the app-bundled Codex CLI over an older codex found on PATH. Set CODEX_CLI_PATH to force a specific CLI binary.

If you explicitly use backend: openai-api, set:

export OPENAI_API_KEY=...

Quick Start

Create an eval spec:

# my-skill.eval.yaml
skill:
  path: ./SKILL.md
  backend: codex

judge:
  backend: codex

tasks:
  - name: Produces the expected answer
    prompt: "Use this skill to answer: what is 2 + 2?"
    expect: "The assistant answers 4."

Run it:

caliper run my-skill.eval.yaml --k 3 --baseline

Example output:

CALIPER  -  my-skill  -  k=3  -  codex

ID      Task                           k (3)   pass@k
task-1  Produces the expected answer   2/3     96.3%    PARTIAL

With skill    96.3%   ###################-
No skill      70.4%   ##############------
Delta         +25.9%  up
Results saved to .caliper/results/my-skill/2026-05-22T14-23-01Z.json

Browse results:

caliper list
caliper report my-skill

Validate a spec before running:

caliper validate my-skill.eval.yaml

Recommended Workflow

Create a small eval spec for one behavior you care about.
Run it with --k 1 while iterating on the spec.
Add deterministic assert: checks for facts an LLM judge should not guess.
Run with --k 3 or higher once the task is stable.
Use --baseline to measure whether the skill helps over the raw agent.
Commit the spec beside the skill so future contributors can run the same evaluation before changing behavior.

caliper run path/to/skill.eval.yaml --k 3 --baseline --verbose

Install The Evaluator Skill

The repository includes an evaluate-skill agent skill. Installing it lets Claude Code or Codex help you create eval specs, validate them, run Caliper, and summarize results from inside your normal agent workflow.

If you installed the CLI, use the bundled installer:

caliper install-skill codex
caliper install-skill claude-code

Preview the destination without writing files:

caliper install-skill codex --dry-run

Use --force to overwrite an existing installed copy.

Claude Code

Without the CLI installer, copy the skill into Claude Code commands:

mkdir -p ~/.claude/commands
curl -fsSL https://raw.githubusercontent.com/edonadei/caliper/main/skills/evaluate-skill/SKILL.md \
  -o ~/.claude/commands/evaluate-skill.md

Then use it in Claude Code:

/evaluate-skill validate my-skill.eval.yaml
/evaluate-skill run my-skill.eval.yaml --k 3

Codex

Without the CLI installer, install the skill in Codex:

mkdir -p ~/.codex/skills/evaluate-skill
curl -fsSL https://raw.githubusercontent.com/edonadei/caliper/main/skills/evaluate-skill/SKILL.md \
  -o ~/.codex/skills/evaluate-skill/SKILL.md

Make sure caliper is on PATH for Codex sessions. If you installed Caliper in editable mode, the generated console script is usually enough.

Then ask Codex:

Use the evaluate-skill skill to validate my-skill.eval.yaml.
Use the evaluate-skill skill to run my-skill.eval.yaml with k=3 and summarize the result.

Examples

Codex Agent, Codex Judge

skill:
  path: ./SKILL.md
  backend: codex

judge:
  backend: codex

tasks:
  - name: Validates a spec
    prompt: "Use caliper to validate ./example.eval.yaml and summarize the result."
    expect: "The assistant runs caliper validate and reports whether the spec is valid."

caliper run my-codex-skill.eval.yaml --k 1 --verbose

Claude Code Agent, Claude Judge

skill:
  path: ~/.claude/commands/review.md
  backend: claude-code
  model: claude-sonnet-4-6

judge:
  backend: claude-code
  model: claude-haiku-4-5-20251001

tasks:
  - name: Finds a null dereference
    prompt: "/review the staged changes in /tmp/eval-repo"
    expect: "The review identifies a possible null pointer dereference."

Mix Backends

The agent backend and judge backend are independent:

skill:
  path: ./SKILL.md
  backend: codex

judge:
  backend: claude-code

Or opt into API billing explicitly:

skill:
  path: ./SKILL.md
  backend: openai-api
  model: gpt-4o-mini

judge:
  backend: openai-api
  model: gpt-4o-mini

Deterministic Assertions

Use assert: when success can be verified with Python. This is usually better than asking an LLM to judge files, JSON, command output, or screenshots.

tasks:
  - name: Writes an output file
    cleanup: rm -f /tmp/out.txt
    prompt: "Write hello world to /tmp/out.txt"
    expect: "A file is written at /tmp/out.txt."
    assert: |
      from pathlib import Path

      path = Path("/tmp/out.txt")
      assert path.exists(), "Output file was not created"
      assert path.read_text().strip() == "hello world"

When both expect and assert are present, both must pass.

Screenshot Skill Eval

The repo includes a Codex-backed screenshot eval:

caliper validate skills/evaluate-skill/references/evals/screenshot/screenshot.eval.yaml
caliper run skills/evaluate-skill/references/evals/screenshot/screenshot.eval.yaml --k 1 --judge script --verbose

On macOS, the process running the eval must have Screen Recording permission. If direct screencapture -x /tmp/test.png fails, this eval will fail until that permission is granted.

Spec Format

skill:
  path: ./SKILL.md              # optional path to the skill file
  backend: codex                # claude-code | codex | claude-api | openai-api
  model: <model-name>           # optional backend-specific model override

judge:
  backend: codex                # claude-code | codex | claude-api | openai-api
  model: <model-name>           # optional backend-specific model override

sandbox:
  extra_path:
    - ./bin                     # optional paths prepended to PATH
  forbidden_files:
    - ".*\\.eval\\.yaml$"       # agent cannot read the spec file
    - "./.caliper/.*"           # agent cannot read saved results

tasks:
  - name: Short task name
    setup: <shell command>      # optional, runs before each attempt
    cleanup: <shell command>    # optional, always runs after each attempt
    prompt: <prompt sent to the agent>
    expect: <natural-language success condition>
    assert: |
      # optional inline Python assertion
      assert True

  - name: Task with external assertion script
    prompt: "Generate a report"
    assert: ./assertions/check_report.py

Each task must define at least one of expect or assert. Task ids are assigned automatically as task-001, task-002, and so on.

Commands

Command	Description
`caliper run <spec>`	Run an evaluation spec
`caliper new [name]`	Create a new evaluation spec with the wizard
`caliper validate <spec>`	Validate a spec file
`caliper list [spec]`	List specs and saved runs
`caliper report <spec-or-result>`	Re-render saved results

`caliper run` Flags

Flag	Default	Description
`--k INT`	`3`	Attempts per task
`--baseline`	off	Also run each task without the skill
`--judge autorater`	`autorater`	LLM judge gives a direct pass/fail
`--judge script`		Run static assertions and, if `expect` exists, an LLM judge
`--judge autorater-sdk`		Legacy alias for Anthropic SDK judging; prefer `judge.backend: claude-api`
`--workers INT`	`4`	Parallel task workers
`--timeout INT`	`120`	Seconds per attempt
`--model MODEL`		Override `skill.model` for the agent under test
`--verbose`	off	Show per-attempt judge reasoning
`--output PATH`		Also save results JSON to a specific path

Judging

Autorater

--judge autorater asks the configured judge backend to decide whether the transcript satisfies expect.

judge:
  backend: codex

Script Judge

--judge script always runs static assert: checks when present.

If the task also has expect, it also asks the configured judge backend for an LLM verdict. With judge.backend: codex, that LLM check is performed by Codex CLI. With judge.backend: claude-code, it is performed by Claude Code CLI. Use claude-api or openai-api only when API billing is intended.

Static Assertions

Static assertions run locally with Python. They are ideal for verifying:

files exist
exact file contents
JSON/schema validity
command output
images or screenshots
repository state

Isolation And Reproducibility

Each attempt runs with a fresh temporary HOME directory. For Claude Code, Caliper installs a temporary slash-command skill in that isolated home. For Codex, Caliper injects the skill body directly into the prompt passed to codex exec.

Results are saved next to the spec:

.caliper/results/<spec-name>/<timestamp>.json

Each result includes a skill snapshot: the skill file content, referenced local files, and git SHA when available.

Scoring

For each task:

pass@k = 1 - (1 - successes / k)^k

The aggregate score is the average task pass@k. With --baseline, Caliper also runs the same tasks without the skill and reports the delta.

Project Layout

caliper/
  commands/       Typer command implementations
  harness/        Claude, Codex, and API execution backends
  judge/          LLM and script judging implementations
  schema/         Eval spec and result models
  runner.py       Evaluation orchestration
skills/
  evaluate-skill/ Agent skill for running Caliper from Claude Code or Codex
tests/            Pytest coverage for harnesses, judges, and runner behavior

Contributing

Contributions are welcome when they keep Caliper focused on repeatable, maintainable skill evaluation.

Good first contribution areas:

add example evals for real skills
improve backend error messages
add deterministic assertion helpers
expand tests for harness and judge behavior
improve result reporting and summaries
document common setup problems for Claude Code and Codex

Before opening a pull request:

pip install -e ".[dev,openai]"
pytest
ruff check .
caliper validate skills/evaluate-skill/evaluate-skill.eval.yaml

When changing behavior, include either a test or an eval fixture that demonstrates the expected outcome. Keep backend-specific behavior isolated to the relevant module under caliper/harness/ or caliper/judge/ when possible.

Troubleshooting

`codex judge failed: model ... is not supported`

The model name in skill.model or judge.model is not available to your Codex account. Use a model that codex exec --model <name> supports.

`codex CLI not found`

Install the Codex CLI and ensure it is on PATH:

npm install -g @openai/codex

`claude` command not found

Install and authenticate Claude Code, or switch the relevant backend to codex, claude-api, or openai-api.

A task passes only because of `assert:`

When a task has only assert:, no LLM judge is required. Add expect: if you also want an LLM to judge the transcript.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 24, 2026

This version

0.1.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caliper_eval-0.1.0.tar.gz (2.1 MB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caliper_eval-0.1.0-py3-none-any.whl (48.1 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file caliper_eval-0.1.0.tar.gz.

File metadata

Download URL: caliper_eval-0.1.0.tar.gz
Upload date: May 24, 2026
Size: 2.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for caliper_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e968c85119bd46c82b4312c10c66a35dcc4ba1aa9ee217bf0aa78b5a572be591`
MD5	`fb5b4e3da48fe82cfef1e00c8a7a0c06`
BLAKE2b-256	`8a7b7e901b12e5fc9d2760588fd2c670ccf540b664657eb29e26a14a0c46b960`

See more details on using hashes here.

File details

Details for the file caliper_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: caliper_eval-0.1.0-py3-none-any.whl
Upload date: May 24, 2026
Size: 48.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for caliper_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ef6a90be8f370d94f254c8d9ee9f91f4e64fe51c4fad985f048529900b84fb0`
MD5	`b6f4032c6741a366175502e3e623bc41`
BLAKE2b-256	`9d6d016c1eabc6508b0293dd7681f44f2a26b58bb5357e4cc2dc89635719c708`

See more details on using hashes here.

caliper-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Caliper - Agent Skill Evaluation Harness

Highlights

When To Use It

Install

Backend Setup

Claude Code

Codex

Quick Start

Recommended Workflow

Install The Evaluator Skill

Claude Code

Codex

Examples

Codex Agent, Codex Judge

Claude Code Agent, Claude Judge

Mix Backends

Deterministic Assertions

Screenshot Skill Eval

Spec Format

Commands

caliper run Flags

Judging

Autorater

Script Judge

Static Assertions

Isolation And Reproducibility

Scoring

Project Layout

Contributing

Troubleshooting

codex judge failed: model ... is not supported

codex CLI not found

claude command not found

A task passes only because of assert:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`caliper run` Flags

`codex judge failed: model ... is not supported`

`codex CLI not found`

`claude` command not found

A task passes only because of `assert:`