CLI tool that evaluates LLM outputs from production logs against a dual-dimension rubric.
Project description
eval-harness
A Python CLI that evaluates LLM outputs from production logs against a dual-dimension rubric (faithfulness + task completion).
Install
pip install -e ".[dev]"
Quickstart
export OPENRIXER_API_KEY=sk-or-...
eval-harness run path/to/logs.jsonl --judge meta-llama/llama-3.1-8b-instruct:free
Input JSONL schema:
{"input": "user prompt", "output": "model response", "reference": "optional ground truth"}
Commands
eval-harness run <file>— ingest, evaluate, and reporteval-harness judges— list free judge models (cached in~/.eval-harness/judges.json)eval-harness report --run-id UUID— show a stored runeval-harness export --run-id UUID --format json|csv --output-file PATHeval-harness cache [--stats] [--clear]
Exit codes: 0 all pass, 1 any failures, 2 evaluator error.
CI/CD example
- run: pip install eval-harness
- run: OPENRIXER_API_KEY=${{ secrets.OPENRIXER_API_KEY }} eval-harness run eval/cases.jsonl --pass-threshold 0.7
Development
pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
pytest tests/ -v --cov=src
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
evalguide-0.1.0.tar.gz
(28.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
evalguide-0.1.0-py3-none-any.whl
(21.7 kB
view details)
File details
Details for the file evalguide-0.1.0.tar.gz.
File metadata
- Download URL: evalguide-0.1.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5397dfae2b8f11fcd776e8f0e008258030a6f2a5a3d1c81989e8456fc11479d4
|
|
| MD5 |
e1a40ff80b7c59d50c9c8ab2cde928ee
|
|
| BLAKE2b-256 |
4d3d0e22ffd95f06f04b699b2002fbffd4350fed6562663a7d5c3a67baf656a4
|
File details
Details for the file evalguide-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evalguide-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
693c8dffb5afaa304c4d0554e7b5a7a68b0e28276ec805d5c424211615859bc5
|
|
| MD5 |
1c50fc7b9d78ca73afb910775ea39b48
|
|
| BLAKE2b-256 |
66d027b5eba0c45b51ead3614b3c25b4e4460f784c571613d0ee5d29c159b32e
|