Lightweight aspect-based evaluation framework with YAML check definitions.
Project description
Eval Banana
Aspect-based evaluation framework - deterministic checks + LLM judges. Score anything (agentic outputs, workflows, banana!) with simple YAML check definitions.
What it does
Eval Banana discovers YAML check definitions from eval_checks/ directories, runs them, and produces a report. Every check scores 0 or 1 with equal weight.
Eval Banana can also drive an AI coding agent (Claude Code, Codex CLI, Gemini CLI, etc.) as a harness before running checks. The harness executes a task prompt, then eval-banana scores the resulting workspace. For llm_judge checks — which evaluate qualitative aspects of generated outputs — this harness → judge pairing is the typical end-to-end flow.
Two check types:
| Type | Purpose | How it works |
|---|---|---|
deterministic |
File existence, content assertions, data validation | Runs a Python script via subprocess; exit 0 = pass |
llm_judge |
Qualitative evaluation (coherence, accuracy, tone) | Sends target files + instructions to an LLM; expects {"score": 0|1} |
Quick start
# Install
uv sync
# Initialize project config and example check
eval-banana init
# Run all discovered checks
eval-banana run
# List discovered checks without running
eval-banana list
# Validate YAML definitions without running
eval-banana validate
Installation
# Using uv (recommended)
uv add eval-banana
# Using pip
pip install eval-banana
# From source (development)
git clone https://github.com/writeitai/eval-banana.git
cd eval-banana
uv sync --extra dev
After installation, two CLI commands are available: eval-banana and eb (short alias).
Writing checks
Create a directory called eval_checks/ anywhere in your project. Add YAML files -- one per check.
Deterministic check
schema_version: 1
id: output_file_exists
type: deterministic
description: Verify that output.json was generated.
target_paths:
- output.json
script: |
import json, sys
from pathlib import Path
ctx = json.loads(Path(sys.argv[1]).read_text())
target = ctx["targets"][0]
assert target["exists"], f"{target['path']} not found"
The script receives a context.json path as sys.argv[1] with this shape:
{
"check_id": "output_file_exists",
"description": "...",
"project_root": "/abs/path",
"targets": [
{"path": "output.json", "resolved_path": "/abs/path/output.json", "exists": true, "is_dir": false}
]
}
LLM judge check
schema_version: 1
id: summary_is_accurate
type: llm_judge
description: The generated summary accurately reflects source data.
target_paths:
- summary.txt
- source_data.json
instructions: |
Compare the summary against the source data.
Score 1 if accurate, 0 if it contains fabricated claims.
Requires an API key. Set OPENROUTER_API_KEY or configure in .eval-banana/config.toml.
Harness support
eval-banana can drive an AI coding agent before running checks. The agent receives a task prompt, works on the project, and then eval-banana scores the result.
Built-in agent templates: codex, gemini, claude, openhands, opencode, pi.
Inline prompt
eval-banana run --harness-agent codex --harness-prompt "Fix all failing tests"
Prompt from file
eval-banana run --harness-agent claude --harness-prompt-file prompts/task.md --harness-model claude-sonnet-4-6
TOML configuration
# .eval-banana/config.toml
[harness]
agent = "codex"
prompt_file = "prompts/task.md"
model = "gpt-5.4"
# reasoning_effort = "high"
Harness behavior
- The harness runs once before any checks execute.
- Install bundled skills explicitly with
eb installbefore harness-driven work in a target project. - If the harness fails (non-zero exit, missing binary), checks are not run and the eval run is marked as failed.
- If any
llm_judgecheck is present, a harness must be configured. eval-banana aborts early with a configuration error otherwise. - Harness artifacts (stdout, stderr, prompt, result) are written to
<run_id>/harness/.
Skills
eval-banana ships two bundled skills inside the wheel package:
src/eval_banana/skills/
eval-banana/
gemini_media_use/
Install them into a target project's native agent directories with:
eb install
eb install --target-agents codex
eb install --skills gemini_media_use --dry-run
eb install is the only supported way to move bundled skills out of the wheel
and into a project. eval-banana run does not install them automatically.
Supported target agents and their destination directories:
| Agent | Destination |
|---|---|
claude |
.claude/skills/ |
codex |
.codex/skills/ |
openhands |
.agents/skills/ |
opencode |
.agents/skills/ |
gemini |
.gemini/skills/ |
The legacy eval-banana distribute-skills command was deprecated in 0.2.x and
will be removed no earlier than 0.3.0. Use eb install instead.
If a project has custom skills, place them directly in the agent-native
directories above. eval-banana no longer copies custom repo-local skills/
directories at runtime.
The bundled gemini_media_use helper scripts depend on the optional google-genai package. They authenticate via GEMINI_API_KEY, then GOOGLE_API_KEY, then Application Default Credentials with GOOGLE_CLOUD_PROJECT (Vertex AI mode -- requires gcloud auth application-default login, not just gcloud auth login). The scripts print targeted setup instructions when auth is misconfigured, distinguishing between missing ADC, missing project, and nothing configured at all.
Generated skill directories such as .claude/skills/, .codex/skills/, .agents/skills/, and .gemini/skills/ should usually be added to .gitignore and treated as install artifacts.
Custom agent templates
Add [agents.<name>] sections to override built-in templates or define new ones:
[agents.myagent]
command = ["my-cli", "run"]
shared_flags = ["--headless"]
prompt_flag = "--prompt"
model_flag = "--model"
Configuration
eval-banana uses a single project-level TOML config at .eval-banana/config.toml.
Create it with eval-banana init.
Config precedence (highest to lowest)
- CLI arguments (
--model,--provider, etc.) - Environment variables (
EVAL_BANANA_*) OPENROUTER_API_KEY/OPENAI_API_KEY(provider-aware)- Project config (
.eval-banana/config.toml) - Built-in defaults
Key settings
| Setting | Default | Env var |
|---|---|---|
output_dir |
.eval-banana/results |
EVAL_BANANA_OUTPUT_DIR |
pass_threshold |
1.0 |
EVAL_BANANA_PASS_THRESHOLD |
provider |
openai_compat |
EVAL_BANANA_PROVIDER |
model |
openai/gpt-5.4 |
EVAL_BANANA_MODEL |
api_base |
https://openrouter.ai/api/v1 |
EVAL_BANANA_API_BASE |
LLM provider setup
OpenRouter (default):
export OPENROUTER_API_KEY=your-key
OpenAI direct:
export EVAL_BANANA_API_BASE=https://api.openai.com/v1
export OPENAI_API_KEY=your-key
Codex (local ChatGPT subscription):
# Run `codex login` first, then:
eval-banana run --provider codex
CLI reference
eval-banana init [--force] Create config + example check
eval-banana run [OPTIONS] Run all discovered checks
eval-banana list [OPTIONS] List discovered checks
eval-banana validate [OPTIONS] Validate YAML without running
eval-banana install [OPTIONS] Install bundled skills into agent dirs
eval-banana distribute-skills [OPTIONS] Deprecated alias for install
Options for run/list/validate:
--check-dir PATH Scan only this directory
--check-id TEXT Run only this check ID
--output-dir TEXT Override output directory
--provider TEXT LLM provider (openai_compat or codex)
--model TEXT LLM model name
--pass-threshold FLOAT Minimum pass ratio (0.0-1.0)
--verbose Enable debug logging
--cwd TEXT Working directory
Harness options (run only):
--harness-agent TEXT Agent CLI to run before checks
--harness-prompt TEXT Task prompt for the agent
--harness-prompt-file PATH File containing the task prompt
--harness-model TEXT Model override for the agent
--harness-reasoning-effort TEXT Reasoning effort level
Output
Each run creates a timestamped directory under the configured output_dir:
.eval-banana/results/<run_id>/
report.json # Machine-readable full report
report.md # Human-readable Markdown report
harness/ # Only when a harness was executed
prompt.txt # Resolved prompt sent to the agent
stdout.txt # Agent stdout
stderr.txt # Agent stderr
result.json # Harness result metadata
checks/
<check_id>.json # Per-check result
<check_id>.stdout.txt # Captured stdout (if any)
<check_id>.stderr.txt # Captured stderr (if any)
Development
uv sync --extra dev
make test # Run tests
make fix # Auto-fix lint + format
make pyright # Type check
make all-check # Lint + format + types + tests (matches CI)
make install-skills # Install bundled skills into the current project
Inspiration
eval-banana's binary 0/1 scoring philosophy draws directly on two earlier bodies of work:
- Hamel Husain's Creating LLM-as-a-Judge that drives business results — argues that binary pass/fail judgments produce more reliable, actionable evals than Likert-style 1-5 scales.
- RAGAS's Aspect Critic metric — evaluates outputs against a natural-language aspect definition and returns a binary verdict.
The llm_judge check type is essentially an Aspect Critic: you describe what "good" looks like in plain language, and the judge returns {"score": 0|1}.
Contributing
Issues and pull requests are welcome. Please run make all-check before opening a PR.
Changelog
See CHANGELOG.md for release notes.
License
Apache License 2.0 — see LICENSE for details.
Copyright 2026 WriteIt.ai s.r.o.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eval_banana-0.0.4.tar.gz.
File metadata
- Download URL: eval_banana-0.0.4.tar.gz
- Upload date:
- Size: 2.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
161689aefebbf610fb359ba7537044ec0f67064f4cda8dd74f804772c3afe7d2
|
|
| MD5 |
16793f8fd7d7d0d417ac762f8fd8c821
|
|
| BLAKE2b-256 |
e4db79e89b483bbe4e706cd635a46746919954759e5788957652bcff9d65449a
|
Provenance
The following attestation bundles were made for eval_banana-0.0.4.tar.gz:
Publisher:
release.yml on writeitai/eval-banana
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eval_banana-0.0.4.tar.gz -
Subject digest:
161689aefebbf610fb359ba7537044ec0f67064f4cda8dd74f804772c3afe7d2 - Sigstore transparency entry: 1340643585
- Sigstore integration time:
-
Permalink:
writeitai/eval-banana@769ef24a9d25c1094e8f23e7256fbf43b15b3f5b -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/writeitai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@769ef24a9d25c1094e8f23e7256fbf43b15b3f5b -
Trigger Event:
push
-
Statement type:
File details
Details for the file eval_banana-0.0.4-py3-none-any.whl.
File metadata
- Download URL: eval_banana-0.0.4-py3-none-any.whl
- Upload date:
- Size: 62.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ae834e30982f5db653f872c4e2a9c4855ac2b03c95c2a6ac4f5f5cc32258fc2
|
|
| MD5 |
983a6bb7f76b9214aaa4694d36a0ff6a
|
|
| BLAKE2b-256 |
bcc6c1063b64e96837422cd658ee0a80f6eb30b935a584d23110ef81764325ca
|
Provenance
The following attestation bundles were made for eval_banana-0.0.4-py3-none-any.whl:
Publisher:
release.yml on writeitai/eval-banana
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
eval_banana-0.0.4-py3-none-any.whl -
Subject digest:
9ae834e30982f5db653f872c4e2a9c4855ac2b03c95c2a6ac4f5f5cc32258fc2 - Sigstore transparency entry: 1340643588
- Sigstore integration time:
-
Permalink:
writeitai/eval-banana@769ef24a9d25c1094e8f23e7256fbf43b15b3f5b -
Branch / Tag:
refs/tags/v0.0.4 - Owner: https://github.com/writeitai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@769ef24a9d25c1094e8f23e7256fbf43b15b3f5b -
Trigger Event:
push
-
Statement type: