Zero-dependency eval harness for LLM and agent regression testing. Scores outputs with exact, contains, regex, JSON, citation, and token-F1 checks.

These details have not been verified by PyPI

Project links

Project description

ai-eval-forge

Zero-dependency eval harness for LLM and agent regression testing. Score outputs with exact, contains, regex, token_f1, json_valid, json_field, and citation_coverage checks. Ships a CLI and a small library API. No runtime dependencies — pure stdlib.

Python port of the @mukundakatta/ai-eval-forge npm package. Same check types, same output shape — cases files you wrote for the npm version work here unchanged.

Install

pip install ai-eval-forge

Run the CLI

aef score cases.jsonl
# or
ai-eval-forge score cases.jsonl --format markdown

Exits 0 on all pass, 1 on any failures, 2 on bad input.

Case file format

Each case is a JSON object. The file can be either a JSON array or JSONL (one object per line).

{"id": "greeting", "actual": "hello world", "expected": "hello world"}
{"id": "json-output", "actual": "{\"user\":{\"name\":\"Alice\"}}", "checks":[{"type":"json_field","path":"user.name","value":"Alice"}]}
{"id": "cited", "actual": "See [src1] and [src2].", "sources":[{"id":"src1"},{"id":"src2"}], "checks":[{"type":"citation_coverage","min":1}]}

Check types

Type	What it does
`exact`	Normalized (lowercase, whitespace-collapsed) string equality.
`contains`	All listed substrings present in `actual`. Optional `caseSensitive`.
`regex`	Python regex match against `actual`. `flags` accepts `i`, `m`, `s`.
`token_f1`	F1 over lowercase alphanumeric tokens. Default check if none specified.
`json_valid`	`actual` parses as valid JSON.
`json_field`	Parse JSON, drill into `path`, deep-equal against `value`.
`citation_coverage`	Fraction of source IDs from `sources` that appear inside `actual`.

Every check accepts required (default true) and min (default 1). The case passes iff every required check has score >= min. The case's overall score is the average of all checks.

Library API

from ai_eval_forge import evaluate_suite, parse_cases, render_markdown
from pathlib import Path

cases = parse_cases(Path("cases.jsonl").read_text())
suite = evaluate_suite(cases)
print(render_markdown(suite))
print(f"Pass rate: {suite.summary.passRate:.0%}")

Output shape (JSON)

{
  "summary": {
    "total": 2,
    "passed": 1,
    "failed": 1,
    "passRate": 0.5,
    "averageScore": 0.82,
    "totalCostUsd": 0.0,
    "averageLatencyMs": 0
  },
  "cases": [
    {
      "id": "greeting",
      "passed": true,
      "score": 1.0,
      "checks": [{"type": "token_f1", "required": true, "passed": true, "score": 1.0, "min": 0.65, "detail": "token_f1=1.0"}],
      "meta": {"input": null, "tags": [], "costUsd": 0, "latencyMs": 0}
    }
  ]
}

Differences from the npm version

js_expression check type is dropped. The JS version lets you run a JavaScript expression against case context. Python's equivalent (eval) is harder to sandbox, so the Python port omits this check type rather than ship a half-sandbox. If you need custom logic, use regex or json_field — or extend the library via your own run_check wrapper.

Everything else matches the npm package 1:1: same check types, same scoring formulas, same summary fields, same exit codes, same CLI flags.

Development

pip install -e '.[dev]'
pytest

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 24, 2026

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_eval_forge-0.1.0.tar.gz (9.4 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_eval_forge-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file ai_eval_forge-0.1.0.tar.gz.

File metadata

Download URL: ai_eval_forge-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 9.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_eval_forge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ce4fa62d0f19f8f47e9613f1d97f0330a3dc9ec7b9ddaf078a32f7463d4182d6`
MD5	`942ebf05a46c1400ec58524e6f038882`
BLAKE2b-256	`0c9238f393fcadd01b3d1610fdedd31b22b6eb11d35f8392228d3e3d05dc142b`

See more details on using hashes here.

File details

Details for the file ai_eval_forge-0.1.0-py3-none-any.whl.

File metadata

Download URL: ai_eval_forge-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_eval_forge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f9c595cb7ee8d0fef5126c971a29022e24cc0fc67fffb82910acbc898b889dd`
MD5	`398d43b8f12c3e404d711b682d487fa4`
BLAKE2b-256	`a4461a99a68098eee1ae6506a27c97afec228c0eec738271fdb701af117363d6`

See more details on using hashes here.

ai-eval-forge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ai-eval-forge

Install

Run the CLI

Case file format

Check types

Library API

Output shape (JSON)

Differences from the npm version

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes