Skip to main content

CLI tool that evaluates LLM outputs from production logs against a dual-dimension rubric.

Project description

eval-harness

A Python CLI that evaluates LLM outputs from production logs against a dual-dimension rubric (faithfulness + task completion).

Install

pip install -e ".[dev]"

Quickstart

export OPENRIXER_API_KEY=sk-or-...
eval-harness run path/to/logs.jsonl --judge meta-llama/llama-3.1-8b-instruct:free

Input JSONL schema:

{"input": "user prompt", "output": "model response", "reference": "optional ground truth"}

Commands

  • eval-harness run <file> — ingest, evaluate, and report
  • eval-harness judges — list free judge models (cached in ~/.eval-harness/judges.json)
  • eval-harness report --run-id UUID — show a stored run
  • eval-harness export --run-id UUID --format json|csv --output-file PATH
  • eval-harness cache [--stats] [--clear]

Exit codes: 0 all pass, 1 any failures, 2 evaluator error.

CI/CD example

- run: pip install eval-harness
- run: OPENRIXER_API_KEY=${{ secrets.OPENRIXER_API_KEY }} eval-harness run eval/cases.jsonl --pass-threshold 0.7

Development

pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
pytest tests/ -v --cov=src

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalguide-0.1.0.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalguide-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file evalguide-0.1.0.tar.gz.

File metadata

  • Download URL: evalguide-0.1.0.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for evalguide-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5397dfae2b8f11fcd776e8f0e008258030a6f2a5a3d1c81989e8456fc11479d4
MD5 e1a40ff80b7c59d50c9c8ab2cde928ee
BLAKE2b-256 4d3d0e22ffd95f06f04b699b2002fbffd4350fed6562663a7d5c3a67baf656a4

See more details on using hashes here.

File details

Details for the file evalguide-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evalguide-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for evalguide-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 693c8dffb5afaa304c4d0554e7b5a7a68b0e28276ec805d5c424211615859bc5
MD5 1c50fc7b9d78ca73afb910775ea39b48
BLAKE2b-256 66d027b5eba0c45b51ead3614b3c25b4e4460f784c571613d0ee5d29c159b32e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page