Skip to main content

Zero-dependency eval harness for LLM and agent regression testing. Scores outputs with exact, contains, regex, JSON, citation, and token-F1 checks. Compares two runs to flag regressions.

Project description

ai-eval-forge

CI PyPI Python License: MIT

Zero-dependency eval harness for LLM and agent regression testing. Score outputs with exact, contains, regex, token_f1, json_valid, json_field, and citation_coverage checks. Ships a CLI and a small library API. No runtime dependencies — pure stdlib.

Python port of the @mukundakatta/ai-eval-forge npm package. Same check types, same output shape — cases files you wrote for the npm version work here unchanged.

Install

pip install ai-eval-forge

Run the CLI

aef score cases.jsonl
# or
ai-eval-forge score cases.jsonl --format markdown

Exits 0 on all pass, 1 on any failures, 2 on bad input.

Case file format

Each case is a JSON object. The file can be either a JSON array or JSONL (one object per line).

{"id": "greeting", "actual": "hello world", "expected": "hello world"}
{"id": "json-output", "actual": "{\"user\":{\"name\":\"Alice\"}}", "checks":[{"type":"json_field","path":"user.name","value":"Alice"}]}
{"id": "cited", "actual": "See [src1] and [src2].", "sources":[{"id":"src1"},{"id":"src2"}], "checks":[{"type":"citation_coverage","min":1}]}

Check types

Type What it does
exact Normalized (lowercase, whitespace-collapsed) string equality.
contains All listed substrings present in actual. Optional caseSensitive.
regex Python regex match against actual. flags accepts i, m, s.
token_f1 F1 over lowercase alphanumeric tokens. Default check if none specified.
json_valid actual parses as valid JSON.
json_field Parse JSON, drill into path, deep-equal against value.
citation_coverage Fraction of source IDs from sources that appear inside actual.

Every check accepts required (default true) and min (default 1). The case passes iff every required check has score >= min. The case's overall score is the average of all checks.

Library API

from ai_eval_forge import evaluate_suite, parse_cases, render_markdown
from pathlib import Path

cases = parse_cases(Path("cases.jsonl").read_text())
suite = evaluate_suite(cases)
print(render_markdown(suite))
print(f"Pass rate: {suite.summary.passRate:.0%}")

Output shape (JSON)

{
  "summary": {
    "total": 2,
    "passed": 1,
    "failed": 1,
    "passRate": 0.5,
    "averageScore": 0.82,
    "totalCostUsd": 0.0,
    "averageLatencyMs": 0
  },
  "cases": [
    {
      "id": "greeting",
      "passed": true,
      "score": 1.0,
      "checks": [{"type": "token_f1", "required": true, "passed": true, "score": 1.0, "min": 0.65, "detail": "token_f1=1.0"}],
      "meta": {"input": null, "tags": [], "costUsd": 0, "latencyMs": 0}
    }
  ]
}

Differences from the npm version

  • js_expression check type is dropped. The JS version lets you run a JavaScript expression against case context. Python's equivalent (eval) is harder to sandbox, so the Python port omits this check type rather than ship a half-sandbox. If you need custom logic, use regex or json_field — or extend the library via your own run_check wrapper.

Everything else matches the npm package 1:1: same check types, same scoring formulas, same summary fields, same exit codes, same CLI flags.

Development

pip install -e '.[dev]'
pytest

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_eval_forge-0.2.0.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_eval_forge-0.2.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file ai_eval_forge-0.2.0.tar.gz.

File metadata

  • Download URL: ai_eval_forge-0.2.0.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_eval_forge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 04d44e6bc424e632a0f8e4b5ad32db4d37bd64ac7ffa24f4312de433406ebe1e
MD5 cc7cc88357f27a43579d10a28e5e6df3
BLAKE2b-256 7ea890b9f5e6302fef023647f0b1bfa8394fa9e7177a118904dbbf3f73bb08d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_eval_forge-0.2.0.tar.gz:

Publisher: publish.yml on MukundaKatta/ai-eval-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai_eval_forge-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ai_eval_forge-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_eval_forge-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d49a86fb62c0624794087b15a31b91a21d4b1d09696fb78dc35685e25e833e0
MD5 552bb5f99d196dfcaf6b51e1ebb69cbd
BLAKE2b-256 b6e2055dcfa03858596b3147108c37f9bb2193b0c1f50a8628bf9eef5ea7bd44

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_eval_forge-0.2.0-py3-none-any.whl:

Publisher: publish.yml on MukundaKatta/ai-eval-forge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page