Prompt eval CLI to stress test agents and generate robust prompts.

These details have not been verified by PyPI

Project links

Homepage

Project description

ev

ev is an agent evaluation and prompt refinement tool designed to stress-test AI agents and make prompts more robust.

It does three main things:

Runs a suite of JSON test cases against a prompt pair (system_prompt.j2 + user_prompt.j2)
Evaluates results against explicit criteria defined in eval.md
Iteratively improves the prompts, only accepting new versions that perform better

Everything is plain files. No external services beyond the LLM APIs you already use.

Key Features

Multi-criteria evals: Test prompts against any number of criteria defined in eval.md.
Deterministic scoring: Cases × cycles ensure stable, noise-resistant pass rates.
Iterative refinement: Automatically proposes and tests improved prompt versions.
Version gating: Only snapshots a new version when it clearly outperforms the current one.
File-native: Everything is plain text and folders; no databases, no external infra.
Model-flexible: Use any provider/model via simple provider[name] notation.

Core concepts
Installation and requirements
Configuration and API keys
Project layout
Creating and setting up a test
Running evaluations
- ev run for optimization
- ev eval for evaluation only
Understanding the outputs
- summary.json
- versions/ and log.json
Other CLI commands
- ev list
- ev copy
- ev delete
- ev version
Models and cycles

Core concepts

Eval
A test is a folder under evals (for example evals/myAgent). It contains:
- JSON cases in cases/
- Criteria definitions in eval.md
- A Pydantic schema in schema.py
- Prompt templates in system_prompt.j2 and user_prompt.j2
Case
A single JSON input file under cases/. E.g. the data you want to test. One eval should have many cases.
Eval criteria
Each # heading in eval.md defines one criterion.
You can have many criteria, and each is judged independently.
Cycles
A cycle means evaluating all cases once.
More cycles reduce randomness and stabilize the score.

Total evaluations per iteration:
cases × cycles
Iterations
Each iteration:
1. Evaluate the current prompts
2. Generate improved prompts
3. Re-evaluate the candidate
4. Compare pass rates
Total model calls per run:
cases × cycles × iterations

Think of it as a contest: each iteration tries to produce a better prompt.
Pass rate
Criteria scores are averaged across cases, then averaged across criteria.
This avoids one noisy criterion dominating the result.
Versions
A new version is created only if the best candidate from the run beats the active version.
One new version max per ev run.

Installation and requirements

pip install evx

or with uv:

uv tool install evx

Requires Python >=3.12

Verify installation

ev --help

Configuration and API keys

The CLI reads configuration from:

.env file by default
Or environment variables if you request it

`.env` based config

By default, keys are loaded from .env (must be in root)

Currenly supported .env vars:

OPENAI_API_KEY=sk-...
GROQ_API_KEY=gk-...

Key source flag

You can control where keys are loaded from using the --key flag:

--key file or -k file (default) loads from .env
--key env or -k env loads from os.environ

Examples:

# use keys from .env
ev create myAgent
ev run myAgent -i 3 --cycles 2 --key file

# use keys from environment variables
set OPENAI_API_KEY=sk-...
set GROQ_API_KEY=gk-...
ev run myAgent -i 3 --cycles 2 --key env

Project layout

At the top level, the tool expects an evals directory.

<repo-root>/
  evals/
    myAgent/
      cases/
        example.json
      eval.md
      schema.py
      system_prompt.j2
      user_prompt.j2
      versions/
        base - <timestamp>/
          system_prompt.j2
          user_prompt.j2
          summary.json
        <other versions>/
      versions/log.json

Creating and setting up a test

1) Scaffold a new eval

ev create myAgent

This will:

Create evals/myAgent
Add cases/example.json
Add a blank eval.md
Add a minimal schema.py
Add basic system_prompt.j2 and user_prompt.j2

which will create files in evals/myAgent/...

2) Define your response schema

Open evals/myAgent/schema.py and define the expected model. For example:

from pydantic import BaseModel

class Response(BaseModel):
    risk_class: str
    recommendation: str
    explanation: str

This schema is used when the cases are generated in evals.

3) Define your eval criteria

Edit evals/myAgent/eval.md and declare your criteria:

# classification
The classification should be one of ["low", "medium", "high"] and should match the scenario.

# use_of_data
The answer should use the provided input fields and not ignore key details.

# explanation
The explanation should be honest, clear, and concise.

Each # heading becomes a separate criterion that the eval agent scores.

4) Add cases

Add JSON files under evals/myAgent/cases/. One file per test case.

// evals/myAgent/cases/case1.json
{
  "business_name": "Acme Widgets",
  "sector": "Manufacturing",
  "revenue": 5000000
}

// evals/myAgent/cases/case2.json
{
  "business_name": "Beta Health",
  "sector": "Healthcare",
  "revenue": 12000000
}

5) Refine your prompts

Edit:

evals/myAgent/system_prompt.j2
evals/myAgent/user_prompt.j2

You can access test case JSON fields via {{ data.<field> }}.

Example user_prompt.j2:

A business owner is applying for a loan.

Business name: {{ data.business_name }}
Sector: {{ data.sector }}
Revenue: {{ data.revenue }}

Classify the credit risk and tell the business owner what you recommend they do next.
Respond using the JSON schema described in your system instructions.

Running evaluations

`ev run` for optimization

ev run runs the whole loop:

Evaluates the current active version across all cases
Lets an agent propose changes to the prompts
Evaluates the candidate version
Only accepts and snapshots the candidate if the pass rate is higher than the current best

Basic usage:

ev run myAgent

Common options:

# Run 3 optimization iterations, single cycle per case
ev run myAgent -i 3

# Run 5 iterations, 2 cycles per case
ev run myAgent -i 5 -c 2

# Use a specific shared model for both generation and eval
ev run myAgent -m "groq[moonshotai/kimi-k2-instruct]"

# Different models for generation and eval
ev run myAgent \
  --gen-model "groq[moonshotai/kimi-k2-instruct]" \
  --eval-model "openai[gpt-5]"

New versions are only gnerated if the run beat the active version.

`ev run` Flags

A simple list of all flags supported by ev run:

-i, --iterations

Number of self-improvement loops to run.
Each iteration proposes improved prompts and accepts them only if pass rate increases.

-c, --cycles

Number of evaluation cycles per case.
Scores are averaged across cycles to reduce randomness.

-m, --model

Sets a single model for both generation and evaluation.

--gen-model

Overrides only the generation model.
Takes precedence over --model.

--eval-model

Overrides only the evaluation model.
Takes precedence over --model.

-k, --key

Where to load API keys from.
file (default, loads from .env) or env (loans from environment variables).

`ev eval` for evaluation only

ev eval runs the test suite against the current active version without changing any prompts or creating new versions.

ev eval myAgent

With options:

# Multiple cycles for stability checking
ev eval myAgent -c 3

# Custom model overrides
ev eval myAgent -m "groq[moonshotai/kimi-k2-instruct]"

`ev eval` flags

--eval-model

Overrides only the evaluation model.
Takes precedence over --model.

-k, --key

Where to load API keys from.
file (default, loads from .env) or env (loads from environment variables).

Understanding the active version

Each test has one active version: the best-performing prompt pair so far.

A new version is created only if a candidate from the current ev run achieves a higher pass rate than the active version.
If no candidate beats it, no new version is saved.

Only one new version can be created per ev run (the best candidate of that run).
This keeps history clean and ensures every version is a strict improvement.

Understanding the outputs

Summary table (console)

At the end of an eval, you will see something like:

=== SUMMARY TABLE ===
Version: base - 18 Nov 2025 14-22-10
Pass rate: 96.0 percent
Cycles: 1

Case                 | Criteria            | Score     
-------------------- | ------------------- | ----------
1                    | classification      | 100 percent  
                     | use_of_data         | 67 percent
                     | explanation         | 100 percent  
-------------------- | ------------------- | ----------
2                    | classification      | 100 percent  
                     | use_of_data         | 100 percent  
                     | explanation         | 100 percent  
-------------------- | ------------------- | ----------

Notes:

Pass rate is the average across criteria, not just number of fully passing cases.
Score is per criterion, expressed in percent.
Each score is averaged across cycles when --cycles > 1.

`summary.json`

For each version, summary.json is written under:

evals/<test>/versions/<version-id>/summary.json

It contains:

version - the version identifier
total_cases
passed_cases - cases where all criteria passed
pass_rate - overall criteria based pass rate
cycles - number of cycles used in this run
cases - per case metrics

You can use this file for dashboards or CI integration.

`versions/log.json`

evals/<test>/versions/log.json tracks versions:

[
  {
    "version": "base - 18 Nov 2025 14-22-10",
    "pass_rate": 0.83,
    "is_active": false,
    "date": "2025-11-18T14:22:10.123456",
    "cycles": 1
  },
  {
    "version": "abcd1234 - 18 Nov 2025 15-01-42",
    "pass_rate": 0.95,
    "is_active": true,
    "date": "2025-11-18T15:01:42.789012",
    "cycles": 1
  }
]

The is_active flag marks which version will be used when you run ev run or ev eval.

Other CLI commands

`ev list` - list tests

Lists tests under evals:

ev list

Example output:

› Available tests
  myAgent
  creditRisk_v2
  onboarding_bot

`ev copy` - copy a test

Duplicates an existing test folder:

ev copy myAgent

This creates evals/myAgent_copy.

`ev delete` - delete a test

Deletes a test and everything inside it:

ev delete myAgent

You can add -y to skip confirmation:

ev delete myAgent -y

Use with care.

`ev version` - show active version

Displays the active version for a test:

ev version myAgent

Output:

› Fetching active version for 'myAgent'
  path: <repo>/evals/myAgent
✓ Active version: abcd1234 - 18 Nov 2025 15-01-42

Models and cycles

### Models

You can control which LLMs are used for generation and evaluation.

* `-m, --model` sets both generation and eval model.
* `--gen-model` overrides only the generation model.
* `--eval-model` overrides only the eval model.

The format is:

```text
provider[identifier]

Examples:

ev run myAgent -m "openai[gpt-5]"
ev run myAgent --gen-model "groq[moonshotai/kimi-k2-instruct]" --eval-model "openai[gpt-5]"

Resolution is handled by your resolve_model_config helper.

Supported models

Provider	Model Identifier
openai	gpt-5
openai	gpt-5-mini
openai	gpt-5-nano
groq	openai/gpt-oss-120b
groq	qwen/qwen3-32b
groq	moonshotai/kimi-k2-instruct

Cycles

--cycles or -c repeats the eval multiple times per case to check stability.

cycles = 1 (default) - single pass
cycles = N - each criterion score is averaged across N runs

Example:

ev eval myAgent -c 3

If a criterion is flaky, you will see it reflected in non 100 percent scores.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.3.3

Nov 19, 2025

0.1.2

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evx-0.1.3.3.tar.gz (27.0 kB view details)

Uploaded Nov 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evx-0.1.3.3-py3-none-any.whl (26.0 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file evx-0.1.3.3.tar.gz.

File metadata

Download URL: evx-0.1.3.3.tar.gz
Upload date: Nov 19, 2025
Size: 27.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for evx-0.1.3.3.tar.gz
Algorithm	Hash digest
SHA256	`a6147d201ea03a19028966045bdce381945c8c2e7a4dd16b8ca5bb8c943f67c6`
MD5	`08da5b77ca309518fc8202cd49c3f932`
BLAKE2b-256	`05e2805f05cbaf30bec527b7a11b3ebc9d85219f147c55328ed1272d09af05c3`

See more details on using hashes here.

File details

Details for the file evx-0.1.3.3-py3-none-any.whl.

File metadata

Download URL: evx-0.1.3.3-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 26.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for evx-0.1.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24d9428b316bb35eb6f6f8f3bceacad5e9864e53f367f1a81be15b85f3eb4ec6`
MD5	`a014cc83fd1bd82af0bdcc525e2d33fc`
BLAKE2b-256	`e551b9e10709f5480a916937d8e9a0e02b975ce49cfd0820dbbb391749c5f97b`

See more details on using hashes here.

evx 0.1.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ev

Key Features

Table of contents

Core concepts

Installation and requirements

Verify installation

Configuration and API keys

.env based config

Key source flag

Project layout

Creating and setting up a test

1) Scaffold a new eval

2) Define your response schema

3) Define your eval criteria

4) Add cases

5) Refine your prompts

Running evaluations

ev run for optimization

ev run Flags

ev eval for evaluation only

ev eval flags

Understanding the active version

Understanding the outputs

Summary table (console)

summary.json

versions/log.json

Other CLI commands

ev list - list tests

ev copy - copy a test

ev delete - delete a test

ev version - show active version

Models and cycles

Supported models

Cycles

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`.env` based config

`ev run` for optimization

`ev run` Flags

`ev eval` for evaluation only

`ev eval` flags

`summary.json`

`versions/log.json`

`ev list` - list tests

`ev copy` - copy a test

`ev delete` - delete a test

`ev version` - show active version