Skip to main content

Lightweight aspect-based evaluation framework with YAML check definitions.

Project description

Eval Banana

CI License: Apache 2.0 Python 3.12+

Aspect-based evaluation framework - deterministic checks + harness judges. Score anything (agentic outputs, workflows, banana!) with simple YAML check definitions.

Eval Banana logo

What it does

Eval Banana discovers YAML check definitions from eval_checks/ directories, runs them, and produces a report. Every check scores 0 or 1 with equal weight.

Two check types:

Type Purpose How it works
deterministic Objective assertions (file existence, content, structure) Runs a Python script via subprocess; exit 0 = pass
harness_judge LLM-as-a-judge (coherence, accuracy, tone) Invokes the configured AI agent to score target files; expects {"score": 0|1}

The harness judge uses one of the following: codex, gemini, claude, openhands, opencode, pi

Writing checks

Create a directory called eval_checks/ anywhere in your project. Add YAML files -- one per check.

Deterministic check

schema_version: 1
id: output_file_exists
type: deterministic
description: Verify that output.json was generated.
target_paths:
  - output.json
script: |
  import json, sys
  from pathlib import Path
  ctx = json.loads(Path(sys.argv[1]).read_text())
  target = ctx["targets"][0]
  assert target["exists"], f"{target['path']} not found"

Harness judge check

schema_version: 1
id: summary_is_accurate
type: harness_judge
description: The generated summary accurately reflects source data.
target_paths:
  - summary.txt
  - source_data.json
instructions: |
  Compare the summary against the source data.
  Score 1 if accurate, 0 if it contains fabricated claims.

Requires a configured harness agent. Set [harness] agent in config or pass --harness-agent.

Quick start

# Install
uv sync

# Initialize project config
eb init

# Run all discovered checks
eb run

# List discovered checks without running
eb list

# Validate YAML definitions without running
eb validate

Installation

# Using uv (recommended)
uv add eval-banana

# Using pip
pip install eval-banana

# From source (development)
git clone https://github.com/writeitai/eval-banana.git
cd eval-banana
uv sync --extra dev

After installation the CLI is available as eb.

Harness configuration

harness_judge checks require a configured harness agent. Configure it via TOML or CLI flags.

TOML

# .eval-banana/config.toml
[harness]
agent = "codex"
model = "gpt-5.4"
# reasoning_effort = "high"

Custom agent templates

Add [agents.<name>] sections to override built-in templates or define new ones:

[agents.myagent]
command = ["my-cli", "run"]
shared_flags = ["--headless"]
prompt_flag = "--prompt"
model_flag = "--model"

Skills

Install bundled skills into a target project's native agent directories:

eb install
eb install --target-agents codex
eb install --skills gemini_media_use --dry-run
Agent Destination
claude .claude/skills/
codex .codex/skills/
openhands .agents/skills/
opencode .agents/skills/
gemini .gemini/skills/

See docs/configuration.md for details on bundled skills, authentication, and the deprecated distribute-skills command.

Configuration

Eval Banana uses a single project-level TOML config at .eval-banana/config.toml.

Create it with eb init.

Config precedence (highest to lowest)

  1. CLI arguments (--output-dir, --harness-model, etc.)
  2. Environment variables (EVAL_BANANA_*)
  3. Project config (.eval-banana/config.toml)
  4. Built-in defaults

Key settings

Setting Default Env var
output_dir .eval-banana/results EVAL_BANANA_OUTPUT_DIR
pass_threshold 1.0 EVAL_BANANA_PASS_THRESHOLD
llm_max_input_chars 0 EVAL_BANANA_LLM_MAX_INPUT_CHARS
harness.agent unset EVAL_BANANA_HARNESS_AGENT
harness.model unset EVAL_BANANA_HARNESS_MODEL

CLI reference

eb init [--force]                Create project config
eb run [OPTIONS]                  Run all discovered checks
eb list [OPTIONS]                 List discovered checks
eb validate [OPTIONS]             Validate YAML without running
eb install [OPTIONS]              Install bundled skills into agent dirs

Options for run/list/validate:
  --check-dir PATH              Scan only this directory
  --check-id TEXT               Run only this check ID
  --output-dir TEXT             Override output directory
  --pass-threshold FLOAT        Minimum pass ratio (0.0-1.0)
  --verbose                     Enable debug logging
  --cwd TEXT                    Working directory

Harness options (run only):
  --harness-agent TEXT          Agent CLI used by harness_judge checks
  --harness-model TEXT          Model override for the agent
  --harness-reasoning-effort TEXT  Reasoning effort level

Output

Each run creates a timestamped directory under the configured output_dir:

.eval-banana/results/<run_id>/
  report.json       # Machine-readable full report
  report.md         # Human-readable Markdown report
  checks/
    <check_id>.json       # Per-check result
    <check_id>.stdout.txt # Captured stdout (if any)
    <check_id>.stderr.txt # Captured stderr (if any)

Development

uv sync --extra dev
make test         # Run tests
make fix          # Auto-fix lint + format
make pyright      # Type check
make all-check    # Lint + format + types + tests (matches CI)

Inspiration

Eval Banana's binary 0/1 scoring philosophy draws directly on two earlier bodies of work:

The harness_judge check type is essentially an Aspect Critic: you describe what "good" looks like in plain language, and the judge returns {"score": 0|1}.

Contributing

Issues and pull requests are welcome. Please run make all-check before opening a PR.

Changelog

See CHANGELOG.md for release notes.

License

Apache License 2.0 — see LICENSE for details.

Copyright 2026 WriteIt.ai s.r.o.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_banana-0.0.8.tar.gz (2.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_banana-0.0.8-py3-none-any.whl (56.1 kB view details)

Uploaded Python 3

File details

Details for the file eval_banana-0.0.8.tar.gz.

File metadata

  • Download URL: eval_banana-0.0.8.tar.gz
  • Upload date:
  • Size: 2.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_banana-0.0.8.tar.gz
Algorithm Hash digest
SHA256 269681eb3ebd856a5db18755a38ed1c6d55a1337882635fcb4c929eeed358b58
MD5 1eb48e6d207dcea150ea8c5bb28dabeb
BLAKE2b-256 4920bd6a1e3bfb8682d95bffbaa5deca0d7ccb8e38048faf8b923764fb07b6b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_banana-0.0.8.tar.gz:

Publisher: release.yml on writeitai/eval-banana

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file eval_banana-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: eval_banana-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 56.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_banana-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 75b0920146d9b285519300f0fa0b61b6972bc380eac9c2b1f784e56a9b8662ba
MD5 4b71e9248c2ac2376623a02a87f4adf6
BLAKE2b-256 5afbe4394c4561c1b437c0063865b28320370c67ffe8f1bec08d1b0a818718af

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_banana-0.0.8-py3-none-any.whl:

Publisher: release.yml on writeitai/eval-banana

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page