Prompt eval CLI to stress test agents and generate robust prompts.
Project description
ev
ev is an agent evaluation and prompt refinement tool designed to stress-test AI agents and make prompts more robust.
It does three main things:
- Runs a suite of JSON test cases against a prompt pair (
system_prompt.j2+user_prompt.j2) - Evaluates results against explicit criteria defined in
eval.md - Iteratively improves the prompts, only accepting new versions that perform better
Everything is plain files. No external services beyond the LLM APIs you already use.
Key Features
- Multi-criteria evals: Test prompts against any number of criteria defined in
eval.md. - Deterministic scoring: Cases × cycles ensure stable, noise-resistant pass rates.
- Iterative refinement: Automatically proposes and tests improved prompt versions.
- Version gating: Only snapshots a new version when it clearly outperforms the current one.
- File-native: Everything is plain text and folders; no databases, no external infra.
- Model-flexible: Use any provider/model via simple
provider[name]notation.
Table of contents
- Core concepts
- Installation and requirements
- Configuration and API keys
- Project layout
- Creating and setting up a test
- Running evaluations
- Understanding the outputs
summary.jsonversions/andlog.json
- Other CLI commands
ev listev copyev deleteev version
- Models and cycles
Core concepts
-
Eval
A test is a folder underevals(for exampleevals/myAgent). It contains:- JSON cases in
cases/ - Criteria definitions in
eval.md - A Pydantic schema in
schema.py - Prompt templates in
system_prompt.j2anduser_prompt.j2
- JSON cases in
-
Case
A single JSON input file undercases/. E.g. the data you want to test. One eval should have many cases. -
Eval criteria
Each#heading ineval.mddefines one criterion.
You can have many criteria, and each is judged independently. -
Cycles
A cycle means evaluating all cases once.
More cycles reduce randomness and stabilize the score.Total evaluations per iteration:
cases × cycles -
Iterations
Each iteration:- Evaluate the current prompts
- Generate improved prompts
- Re-evaluate the candidate
- Compare pass rates
Total model calls per run:
cases × cycles × iterationsThink of it as a contest: each iteration tries to produce a better prompt.
-
Pass rate
Criteria scores are averaged across cases, then averaged across criteria.
This avoids one noisy criterion dominating the result. -
Versions
A new version is created only if the best candidate from the run beats the active version.
One new version max perev run.
Installation and requirements
pip install evx
or with uv:
uv tool install evx
Requires Python >=3.12
Verify installation
ev --help
Configuration and API keys
The CLI reads configuration from:
.envfile by default- Or environment variables if you request it
.env based config
By default, keys are loaded from .env (must be in root)
Currenly supported .env vars:
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gk-...
Key source flag
You can control where keys are loaded from using the --key flag:
--key fileor-k file(default) loads from.env--key envor-k envloads fromos.environ
Examples:
# use keys from .env
ev create myAgent
ev run myAgent -i 3 --cycles 2 --key file
# use keys from environment variables
set OPENAI_API_KEY=sk-...
set GROQ_API_KEY=gk-...
ev run myAgent -i 3 --cycles 2 --key env
Project layout
At the top level, the tool expects an evals directory.
<repo-root>/
evals/
myAgent/
cases/
example.json
eval.md
schema.py
system_prompt.j2
user_prompt.j2
versions/
base - <timestamp>/
system_prompt.j2
user_prompt.j2
summary.json
<other versions>/
versions/log.json
Creating and setting up a test
1) Scaffold a new eval
ev create myAgent
This will:
- Create
evals/myAgent - Add
cases/example.json - Add a blank
eval.md - Add a minimal
schema.py - Add basic
system_prompt.j2anduser_prompt.j2
which will create files in evals/myAgent/...
2) Define your response schema
Open evals/myAgent/schema.py and define the expected model. For example:
from pydantic import BaseModel
class Response(BaseModel):
risk_class: str
recommendation: str
explanation: str
This schema is used when the cases are generated in evals.
3) Define your eval criteria
Edit evals/myAgent/eval.md and declare your criteria:
# classification
The classification should be one of ["low", "medium", "high"] and should match the scenario.
# use_of_data
The answer should use the provided input fields and not ignore key details.
# explanation
The explanation should be honest, clear, and concise.
Each # heading becomes a separate criterion that the eval agent scores.
4) Add cases
Add JSON files under evals/myAgent/cases/. One file per test case.
// evals/myAgent/cases/case1.json
{
"business_name": "Acme Widgets",
"sector": "Manufacturing",
"revenue": 5000000
}
// evals/myAgent/cases/case2.json
{
"business_name": "Beta Health",
"sector": "Healthcare",
"revenue": 12000000
}
5) Refine your prompts
Edit:
evals/myAgent/system_prompt.j2evals/myAgent/user_prompt.j2
You can access test case JSON fields via {{ data.<field> }}.
Example user_prompt.j2:
A business owner is applying for a loan.
Business name: {{ data.business_name }}
Sector: {{ data.sector }}
Revenue: {{ data.revenue }}
Classify the credit risk and tell the business owner what you recommend they do next.
Respond using the JSON schema described in your system instructions.
Running evaluations
ev run for optimization
ev run runs the whole loop:
- Evaluates the current active version across all cases
- Lets an agent propose changes to the prompts
- Evaluates the candidate version
- Only accepts and snapshots the candidate if the pass rate is higher than the current best
Basic usage:
ev run myAgent
Common options:
# Run 3 optimization iterations, single cycle per case
ev run myAgent -i 3
# Run 5 iterations, 2 cycles per case
ev run myAgent -i 5 -c 2
# Use a specific shared model for both generation and eval
ev run myAgent -m "groq[moonshotai/kimi-k2-instruct]"
# Different models for generation and eval
ev run myAgent \
--gen-model "groq[moonshotai/kimi-k2-instruct]" \
--eval-model "openai[gpt-5]"
New versions are only gnerated if the run beat the active version.
ev run Flags
A simple list of all flags supported by ev run:
-i, --iterations
- Number of self-improvement loops to run.
- Each iteration proposes improved prompts and accepts them only if pass rate increases.
-c, --cycles
- Number of evaluation cycles per case.
- Scores are averaged across cycles to reduce randomness.
-m, --model
- Sets a single model for both generation and evaluation.
--gen-model
- Overrides only the generation model.
- Takes precedence over
--model.
--eval-model
- Overrides only the evaluation model.
- Takes precedence over
--model.
-k, --key
- Where to load API keys from.
file(default, loads from.env) orenv(loans from environment variables).
ev eval for evaluation only
ev eval runs the test suite against the current active version without changing any prompts or creating new versions.
ev eval myAgent
With options:
# Multiple cycles for stability checking
ev eval myAgent -c 3
# Custom model overrides
ev eval myAgent -m "groq[moonshotai/kimi-k2-instruct]"
ev eval flags
--eval-model
- Overrides only the evaluation model.
- Takes precedence over
--model.
-k, --key
- Where to load API keys from.
file(default, loads from.env) orenv(loads from environment variables).
Understanding the active version
Each test has one active version: the best-performing prompt pair so far.
A new version is created only if a candidate from the current ev run achieves a higher pass rate than the active version.
If no candidate beats it, no new version is saved.
Only one new version can be created per ev run (the best candidate of that run).
This keeps history clean and ensures every version is a strict improvement.
Understanding the outputs
Summary table (console)
At the end of an eval, you will see something like:
=== SUMMARY TABLE ===
Version: base - 18 Nov 2025 14-22-10
Pass rate: 96.0 percent
Cycles: 1
Case | Criteria | Score
-------------------- | ------------------- | ----------
1 | classification | 100 percent
| use_of_data | 67 percent
| explanation | 100 percent
-------------------- | ------------------- | ----------
2 | classification | 100 percent
| use_of_data | 100 percent
| explanation | 100 percent
-------------------- | ------------------- | ----------
Notes:
Pass rateis the average across criteria, not just number of fully passing cases.Scoreis per criterion, expressed in percent.- Each score is averaged across cycles when
--cycles > 1.
summary.json
For each version, summary.json is written under:
evals/<test>/versions/<version-id>/summary.json
It contains:
version- the version identifiertotal_casespassed_cases- cases where all criteria passedpass_rate- overall criteria based pass ratecycles- number of cycles used in this runcases- per case metrics
You can use this file for dashboards or CI integration.
versions/log.json
evals/<test>/versions/log.json tracks versions:
[
{
"version": "base - 18 Nov 2025 14-22-10",
"pass_rate": 0.83,
"is_active": false,
"date": "2025-11-18T14:22:10.123456",
"cycles": 1
},
{
"version": "abcd1234 - 18 Nov 2025 15-01-42",
"pass_rate": 0.95,
"is_active": true,
"date": "2025-11-18T15:01:42.789012",
"cycles": 1
}
]
The is_active flag marks which version will be used when you run ev run or ev eval.
Other CLI commands
ev list - list tests
Lists tests under evals:
ev list
Example output:
› Available tests
myAgent
creditRisk_v2
onboarding_bot
ev copy - copy a test
Duplicates an existing test folder:
ev copy myAgent
This creates evals/myAgent_copy.
ev delete - delete a test
Deletes a test and everything inside it:
ev delete myAgent
You can add -y to skip confirmation:
ev delete myAgent -y
Use with care.
ev version - show active version
Displays the active version for a test:
ev version myAgent
Output:
› Fetching active version for 'myAgent'
path: <repo>/evals/myAgent
✓ Active version: abcd1234 - 18 Nov 2025 15-01-42
Models and cycles
### Models
You can control which LLMs are used for generation and evaluation.
* `-m, --model` sets both generation and eval model.
* `--gen-model` overrides only the generation model.
* `--eval-model` overrides only the eval model.
The format is:
```text
provider[identifier]
Examples:
ev run myAgent -m "openai[gpt-5]"
ev run myAgent --gen-model "groq[moonshotai/kimi-k2-instruct]" --eval-model "openai[gpt-5]"
Resolution is handled by your resolve_model_config helper.
Supported models
| Provider | Model Identifier |
|---|---|
| openai | gpt-5 |
| openai | gpt-5-mini |
| openai | gpt-5-nano |
| groq | openai/gpt-oss-120b |
| groq | qwen/qwen3-32b |
| groq | moonshotai/kimi-k2-instruct |
Cycles
--cycles or -c repeats the eval multiple times per case to check stability.
cycles = 1(default) - single passcycles = N- each criterion score is averaged acrossNruns
Example:
ev eval myAgent -c 3
If a criterion is flaky, you will see it reflected in non 100 percent scores.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evx-0.1.3.3.tar.gz.
File metadata
- Download URL: evx-0.1.3.3.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6147d201ea03a19028966045bdce381945c8c2e7a4dd16b8ca5bb8c943f67c6
|
|
| MD5 |
08da5b77ca309518fc8202cd49c3f932
|
|
| BLAKE2b-256 |
05e2805f05cbaf30bec527b7a11b3ebc9d85219f147c55328ed1272d09af05c3
|
File details
Details for the file evx-0.1.3.3-py3-none-any.whl.
File metadata
- Download URL: evx-0.1.3.3-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24d9428b316bb35eb6f6f8f3bceacad5e9864e53f367f1a81be15b85f3eb4ec6
|
|
| MD5 |
a014cc83fd1bd82af0bdcc525e2d33fc
|
|
| BLAKE2b-256 |
e551b9e10709f5480a916937d8e9a0e02b975ce49cfd0820dbbb391749c5f97b
|