Zero-dependency eval harness for LLM and agent regression testing. Scores outputs with exact, contains, regex, JSON, citation, and token-F1 checks.
Project description
ai-eval-forge
Zero-dependency eval harness for LLM and agent regression testing. Score outputs with exact, contains, regex, token_f1, json_valid, json_field, and citation_coverage checks. Ships a CLI and a small library API. No runtime dependencies — pure stdlib.
Python port of the @mukundakatta/ai-eval-forge npm package. Same check types, same output shape — cases files you wrote for the npm version work here unchanged.
Install
pip install ai-eval-forge
Run the CLI
aef score cases.jsonl
# or
ai-eval-forge score cases.jsonl --format markdown
Exits 0 on all pass, 1 on any failures, 2 on bad input.
Case file format
Each case is a JSON object. The file can be either a JSON array or JSONL (one object per line).
{"id": "greeting", "actual": "hello world", "expected": "hello world"}
{"id": "json-output", "actual": "{\"user\":{\"name\":\"Alice\"}}", "checks":[{"type":"json_field","path":"user.name","value":"Alice"}]}
{"id": "cited", "actual": "See [src1] and [src2].", "sources":[{"id":"src1"},{"id":"src2"}], "checks":[{"type":"citation_coverage","min":1}]}
Check types
| Type | What it does |
|---|---|
exact |
Normalized (lowercase, whitespace-collapsed) string equality. |
contains |
All listed substrings present in actual. Optional caseSensitive. |
regex |
Python regex match against actual. flags accepts i, m, s. |
token_f1 |
F1 over lowercase alphanumeric tokens. Default check if none specified. |
json_valid |
actual parses as valid JSON. |
json_field |
Parse JSON, drill into path, deep-equal against value. |
citation_coverage |
Fraction of source IDs from sources that appear inside actual. |
Every check accepts required (default true) and min (default 1). The case passes iff every required check has score >= min. The case's overall score is the average of all checks.
Library API
from ai_eval_forge import evaluate_suite, parse_cases, render_markdown
from pathlib import Path
cases = parse_cases(Path("cases.jsonl").read_text())
suite = evaluate_suite(cases)
print(render_markdown(suite))
print(f"Pass rate: {suite.summary.passRate:.0%}")
Output shape (JSON)
{
"summary": {
"total": 2,
"passed": 1,
"failed": 1,
"passRate": 0.5,
"averageScore": 0.82,
"totalCostUsd": 0.0,
"averageLatencyMs": 0
},
"cases": [
{
"id": "greeting",
"passed": true,
"score": 1.0,
"checks": [{"type": "token_f1", "required": true, "passed": true, "score": 1.0, "min": 0.65, "detail": "token_f1=1.0"}],
"meta": {"input": null, "tags": [], "costUsd": 0, "latencyMs": 0}
}
]
}
Differences from the npm version
js_expressioncheck type is dropped. The JS version lets you run a JavaScript expression against case context. Python's equivalent (eval) is harder to sandbox, so the Python port omits this check type rather than ship a half-sandbox. If you need custom logic, useregexorjson_field— or extend the library via your ownrun_checkwrapper.
Everything else matches the npm package 1:1: same check types, same scoring formulas, same summary fields, same exit codes, same CLI flags.
Development
pip install -e '.[dev]'
pytest
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_eval_forge-0.1.0.tar.gz.
File metadata
- Download URL: ai_eval_forge-0.1.0.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce4fa62d0f19f8f47e9613f1d97f0330a3dc9ec7b9ddaf078a32f7463d4182d6
|
|
| MD5 |
942ebf05a46c1400ec58524e6f038882
|
|
| BLAKE2b-256 |
0c9238f393fcadd01b3d1610fdedd31b22b6eb11d35f8392228d3e3d05dc142b
|
File details
Details for the file ai_eval_forge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_eval_forge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f9c595cb7ee8d0fef5126c971a29022e24cc0fc67fffb82910acbc898b889dd
|
|
| MD5 |
398d43b8f12c3e404d711b682d487fa4
|
|
| BLAKE2b-256 |
a4461a99a68098eee1ae6506a27c97afec228c0eec738271fdb701af117363d6
|