CLI tool that evaluates LLM outputs from production logs against a dual-dimension rubric.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

onicarps

These details have not been verified by PyPI

Project description

eval-harness

Evaluate LLM outputs from production logs against a dual-dimension rubric (faithfulness + task completion).

eval-harness ingests JSONL/CSV logs, scores each record using LLM-as-judge via OpenRouter, and produces a rich terminal report — or exports JSON/CSV for CI pipelines. Built for teams who need fast, repeatable quality checks on production LLM traffic.

Features

Dual-dimension scoring — faithfulness + task completion, combined 50/50
Multi-judge support — round-robin fallback across free OpenRouter models
Regression detection — trend command tracks score changes over time
CI/CD ready — gate command with pass/fail exit codes and threshold suggestions
Rubric templates — manage reusable rubric definitions (faithfulness, safety, accuracy, conciseness)
Response caching — judge responses cached in SQLite; skip re-evaluation of unchanged records
Calibration mode — measure inter-judge agreement across all available models
Degraded mode — local heuristic fallback when the judge API is unreachable

Install

# From PyPI
pip install llm-eval-harness

# From source (development)
pip install -e ".[dev]"

Quickstart

# 1. Set your API key
export OPENROUTER_API_KEY=sk-or-...

# 2. Run an evaluation
eval-harness run path/to/logs.jsonl --judge meta-llama/llama-3.1-8b-instruct:free

# 3. View a previous run
eval-harness report --run-id <RUN_ID>

# 4. Export results
eval-harness export --run-id <RUN_ID> --format json --output-file results.json

Input Format

JSONL (one JSON object per line) or CSV. Fields default to input, output, reference (reference is optional):

{"input": "user prompt", "output": "model response", "reference": "optional ground truth"}

Commands

`run` — Ingest, evaluate, and report

eval-harness run <file> [options]

Option	Default	Description
`--format`	`jsonl`	Input format: `jsonl` or `csv`
`--input-col`	`input`	Column name for the user prompt
`--output-col`	`output`	Column name for the model response
`--reference-col`	`reference`	Column name for the ground truth
`--sample`	—	Randomly sample N records for evaluation
`--since`	—	Only evaluate records after this date (ISO-8601, e.g. `2026-06-01`)
`--limit`	—	Maximum number of records to evaluate
`--judge`	—	Judge model ID (e.g. `meta-llama/llama-3.1-8b-instruct:free`). If omitted, uses all available free models
`--no-fallback`	`false`	Disable round-robin fallback to other judges
`--max-fallbacks`	`3`	Maximum number of fallback judges to try
`--pass-threshold`	`0.7`	Score threshold for pass/fail (0.0–1.0)
`--output`	`table`	Output format: `table` (rich terminal) or `json`
`--output-file`	—	Write output to a file instead of stdout
`--dry-run`	`false`	Parse input and show record count without evaluating
`--resume`	`false`	Skip records that were already evaluated in a previous run
`--timeout`	`60`	Per-request timeout in seconds
`--rpm-limit`	—	Rate limit (requests per minute) for judge API calls
`--yes` / `-y`	`false`	Skip confirmation prompt
`--quiet`	`false`	Suppress progress output
`--feedback`	`false`	Generate improvement suggestions for low-scoring records
`--compare-judges`	`false`	Show side-by-side judge comparison table
`--degrade`	`false`	Use local heuristic fallback when judge API is unreachable
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database
`--judges-cache`	—	Path to judge registry cache file

Examples:

# Basic evaluation with default settings
eval-harness run logs.jsonl

# Use a specific judge with a higher pass threshold
eval-harness run logs.jsonl --judge meta-llama/llama-3.1-8b-instruct:free --pass-threshold 0.8

# Quick dry-run to validate input
eval-harness run logs.jsonl --dry-run

# Evaluate a random sample of 50 records
eval-harness run logs.jsonl --sample 50

# CSV with custom column names
eval-harness run logs.csv --format csv --input-col prompt --output-col response --reference-col expected

# Export JSON results to a file
eval-harness run logs.jsonl --output json --output-file results.json

# Resume a previously interrupted run
eval-harness run logs.jsonl --resume

# Generate improvement suggestions for failed records
eval-harness run logs.jsonl --feedback

# Compare scores across multiple judges
eval-harness run logs.jsonl --compare-judges

# Pipe from stdin
cat logs.jsonl | eval-harness run -

`judges` — List free judge models

eval-harness judges [--refresh] [--json]

Lists available free judge models from OpenRouter. Results are cached locally; use --refresh to force an update.

eval-harness judges                  # list models in a table
eval-harness judges --json           # output as JSON
eval-harness judges --refresh        # force refresh from API

`list-runs` — List previous evaluation runs

eval-harness list-runs [--limit N] [--json]

Option	Default	Description
`--limit`	`20`	Maximum number of runs to show
`--json`	`false`	Output as JSON
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database

eval-harness list-runs               # show last 20 runs
eval-harness list-runs --limit 50    # show last 50 runs
eval-harness list-runs --json        # output as JSON

`report` — Show results for a previous run

eval-harness report --run-id <ID> [--output table|json|csv] [--output-file <path>]

Option	Default	Description
`--run-id`	required	Run ID to display
`--output`	`table`	Output format: `table`, `json`, or `csv`
`--output-file`	—	Write output to a file
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database

eval-harness report --run-id abc123
eval-harness report --run-id abc123 --output json
eval-harness report --run-id abc123 --output csv --output-file report.csv

`export` — Export run results

eval-harness export --run-id <ID> --format json|csv --output-file <path>

Option	Default	Description
`--run-id`	required	Run ID to export
`--format`	`json`	Export format: `json` or `csv`
`--output-file`	required	Output file path
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database

eval-harness export --run-id abc123 --format json --output-file results.json
eval-harness export --run-id abc123 --format csv --output-file results.csv

`cache` — Manage the judge response cache

eval-harness cache [--stats] [--clear]

Option	Description
`--stats`	Show cache statistics (entry count, hit rate, size)
`--clear`	Remove all cached judge responses

eval-harness cache --stats            # show cache stats
eval-harness cache --clear            # clear all cached responses
eval-harness cache                    # default: show stats

`trend` — Score timeline and regression detection

eval-harness trend [--rubric <id>] [--judge <model>] [--since <date>] [--json]

Option	Description
`--rubric`	Filter by rubric template ID
`--judge`	Filter by judge model ID
`--since`	Only show runs after this date (ISO-8601, e.g. `2026-06-01`)
`--json`	Output as JSON
`--db`	Path to SQLite database

Requires at least 2 completed runs to display trends.

eval-harness trend                              # show all runs
eval-harness trend --rubric faithfulness-v1     # filter by rubric
eval-harness trend --since 2026-06-01           # recent runs only
eval-harness trend --json                       # output as JSON

`rubric` — Manage rubric templates

eval-harness rubric [--list] [--show <id>] [--create-name <n> --create-file <path>] [--delete <id>] [--json]

Option	Description
`--list`	List all rubric templates
`--show`	Show a specific template by ID
`--create-name`	Name for a new template (use with `--create-file`)
`--create-file`	Path to a YAML file defining the template (use with `--create-name`)
`--delete`	Delete a template by ID (built-in templates cannot be deleted)
`--json`	Output as JSON
`--db`	Path to SQLite database

eval-harness rubric --list                              # list all templates
eval-harness rubric --show faithfulness-v1              # show a template
eval-harness rubric --list --json                       # list as JSON
eval-harness rubric --create-name "my-rubric" --create-file rubric.yaml
eval-harness rubric --delete my-rubric                  # delete a custom template

`calibrate` — Measure inter-judge agreement

eval-harness calibrate <file> [--format jsonl|csv] [--sample N] [--since DATE] [--limit N] [--output-file <path>] [--json]

Runs every record through every available judge model and reports disagreement statistics. Useful for validating that your chosen judge model agrees with others.

Option	Default	Description
`--format`	`jsonl`	Input format: `jsonl` or `csv`
`--input-col`	`input`	Column name for the user prompt
`--output-col`	`output`	Column name for the model response
`--reference-col`	`reference`	Column name for the ground truth
`--sample`	—	Randomly sample N records
`--since`	—	Only evaluate records after this date
`--limit`	—	Maximum number of records
`--output-file`	—	Write output to a file
`--json`	`false`	Output as JSON
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database
`--judges-cache`	—	Path to judge registry cache file

eval-harness calibrate logs.jsonl
eval-harness calibrate logs.jsonl --sample 20 --json

`gate` — CI/CD quality gate

eval-harness gate --run-id <ID> [--threshold 0.7] [--suggest-baseline] [--json] [--output-file <path>]

Checks whether a run's pass rate meets a threshold. Returns exit code 0 (pass) or 1 (fail), making it suitable for CI pipelines.

Option	Default	Description
`--run-id`	—	Run ID to check (required unless `--suggest-baseline`)
`--threshold`	`0.7`	Pass rate threshold (0.0–1.0)
`--suggest-baseline`	`false`	Analyze run history and suggest a threshold
`--json`	`false`	Output as JSON
`--output-file`	—	Write output to a file
`--db`	`~/.eval-harness/eval.db`	Path to SQLite database

# Check a run against the default threshold (0.7)
eval-harness gate --run-id abc123

# Use a custom threshold
eval-harness gate --run-id abc123 --threshold 0.8

# Get a suggested baseline from historical runs
eval-harness gate --suggest-baseline

# Use in CI
eval-harness gate --run-id abc123 --json --output-file gate-result.json

Exit Codes

Code	Meaning
`0`	All records passed (score >= threshold)
`1`	One or more records failed
`2`	Evaluator error (API key missing, file not found, etc.)

Configuration

Configuration is handled via environment variables. See .env.example for all options.

Variable	Required	Description
`OPENROUTER_API_KEY`	Yes	Your OpenRouter API key (get one)
`OPENROUTER_ENV_PATH`	No	Path to a `.env` file (default: `.env` in current directory)

CI/CD Example

# .github/workflows/eval.yml
name: LLM Evaluation
on:
  push:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install llm-eval-harness
      - run: eval-harness run eval/cases.jsonl --pass-threshold 0.7 --yes
        env:
          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
      - run: eval-harness gate --run-id $(eval-harness list-runs --json | jq -r '.[0].run_id') --threshold 0.7
        env:
          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}

Troubleshooting

`OPENROUTER_API_KEY is not set`

Set the environment variable before running any command:

export OPENROUTER_API_KEY=sk-or-...

Or create a .env file in your project root (see .env.example).

`no judges available`

The judge registry cache may be empty or corrupt. Force a refresh:

eval-harness judges --refresh

If the API is unreachable, the built-in judge list is used as a fallback.

`file not found` or `no records to evaluate`

Check that the file path is correct.
For stdin input, use - as the file argument: cat data.jsonl | eval-harness run -
Verify the input format matches --format (jsonl or csv).
If using CSV, confirm column names match --input-col / --output-col.

`evaluator error: ...` (exit code 2)

Check your API key is valid and has available credits.
Try --degrade to use a local heuristic fallback.
Use --verbose (-v) for debug-level logging.

Slow evaluations

Use --sample N to evaluate a subset.
Use --rpm-limit N to rate-limit API calls.
Use --limit N to cap the total number of records.

`need at least 2 completed runs for trend display`

The trend command requires 2+ completed runs. Run more evaluations first.

Development

# Setup
git clone https://github.com/onicarps/eval-harness.git
cd eval-harness
pip install -e ".[dev]"

# Lint and format
ruff check src tests
ruff format --check src tests

# Type check
mypy --config-file pyproject.toml src

# Test
pytest tests/ -v --cov=src --cov-report=term-missing

# Run the CLI locally
python -m src.cli run sample.jsonl --dry-run

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

onicarps

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Jun 25, 2026

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_eval_harness-0.2.1.tar.gz (136.6 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_eval_harness-0.2.1-py3-none-any.whl (64.7 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file llm_eval_harness-0.2.1.tar.gz.

File metadata

Download URL: llm_eval_harness-0.2.1.tar.gz
Upload date: Jun 25, 2026
Size: 136.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_eval_harness-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`dba28d411e719af8f3002540e70c8b5fcf5d8dc61ad854d60e45a98ec48f22d1`
MD5	`1a24bcfe4ea6d7857cbf09a216b0b9f8`
BLAKE2b-256	`fcd8d4821c92b87df0be7518d3d97ac9b2c0fe41fd321c6085e54b42f7e1b999`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_eval_harness-0.2.1.tar.gz:

Publisher: publish.yml on onicarps/eval-harness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_eval_harness-0.2.1.tar.gz
- Subject digest: dba28d411e719af8f3002540e70c8b5fcf5d8dc61ad854d60e45a98ec48f22d1
- Sigstore transparency entry: 1945566764
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: onicarps/eval-harness@6d83f583c74bf98de68ac2f557cdd69c33494b58
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/onicarps
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6d83f583c74bf98de68ac2f557cdd69c33494b58
- Trigger Event: push

File details

Details for the file llm_eval_harness-0.2.1-py3-none-any.whl.

File metadata

Download URL: llm_eval_harness-0.2.1-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 64.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_eval_harness-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2181d17fe02e455907293a184781a8bdc94f3b767ecc839963535c92f14144a5`
MD5	`ae65d80087f9a71eedb28c66be6138ef`
BLAKE2b-256	`182b0268951823f297902a83933189dea903c5c6cf32a76e8f1e14b7afc02248`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_eval_harness-0.2.1-py3-none-any.whl:

Publisher: publish.yml on onicarps/eval-harness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_eval_harness-0.2.1-py3-none-any.whl
- Subject digest: 2181d17fe02e455907293a184781a8bdc94f3b767ecc839963535c92f14144a5
- Sigstore transparency entry: 1945566787
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: onicarps/eval-harness@6d83f583c74bf98de68ac2f557cdd69c33494b58
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/onicarps
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6d83f583c74bf98de68ac2f557cdd69c33494b58
- Trigger Event: push

llm-eval-harness 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

eval-harness

Features

Install

Quickstart

Input Format

Commands

run — Ingest, evaluate, and report

judges — List free judge models

list-runs — List previous evaluation runs

report — Show results for a previous run

export — Export run results

cache — Manage the judge response cache

trend — Score timeline and regression detection

rubric — Manage rubric templates

calibrate — Measure inter-judge agreement

gate — CI/CD quality gate

Exit Codes

Configuration

CI/CD Example

Troubleshooting

OPENROUTER_API_KEY is not set

no judges available

file not found or no records to evaluate

evaluator error: ... (exit code 2)

Slow evaluations

need at least 2 completed runs for trend display

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`run` — Ingest, evaluate, and report

`judges` — List free judge models

`list-runs` — List previous evaluation runs

`report` — Show results for a previous run

`export` — Export run results

`cache` — Manage the judge response cache

`trend` — Score timeline and regression detection

`rubric` — Manage rubric templates

`calibrate` — Measure inter-judge agreement

`gate` — CI/CD quality gate

`OPENROUTER_API_KEY is not set`

`no judges available`

`file not found` or `no records to evaluate`

`evaluator error: ...` (exit code 2)

`need at least 2 completed runs for trend display`