Skip to main content

Zero-dependency eval harness for LLM and agent regression testing. Scores outputs with exact, contains, regex, JSON, citation, and token-F1 checks.

Project description

ai-eval-forge

CI PyPI Python License: MIT

Zero-dependency eval harness for LLM and agent regression testing. Score outputs with exact, contains, regex, token_f1, json_valid, json_field, and citation_coverage checks. Ships a CLI and a small library API. No runtime dependencies — pure stdlib.

Python port of the @mukundakatta/ai-eval-forge npm package. Same check types, same output shape — cases files you wrote for the npm version work here unchanged.

Install

pip install ai-eval-forge

Run the CLI

aef score cases.jsonl
# or
ai-eval-forge score cases.jsonl --format markdown

Exits 0 on all pass, 1 on any failures, 2 on bad input.

Case file format

Each case is a JSON object. The file can be either a JSON array or JSONL (one object per line).

{"id": "greeting", "actual": "hello world", "expected": "hello world"}
{"id": "json-output", "actual": "{\"user\":{\"name\":\"Alice\"}}", "checks":[{"type":"json_field","path":"user.name","value":"Alice"}]}
{"id": "cited", "actual": "See [src1] and [src2].", "sources":[{"id":"src1"},{"id":"src2"}], "checks":[{"type":"citation_coverage","min":1}]}

Check types

Type What it does
exact Normalized (lowercase, whitespace-collapsed) string equality.
contains All listed substrings present in actual. Optional caseSensitive.
regex Python regex match against actual. flags accepts i, m, s.
token_f1 F1 over lowercase alphanumeric tokens. Default check if none specified.
json_valid actual parses as valid JSON.
json_field Parse JSON, drill into path, deep-equal against value.
citation_coverage Fraction of source IDs from sources that appear inside actual.

Every check accepts required (default true) and min (default 1). The case passes iff every required check has score >= min. The case's overall score is the average of all checks.

Library API

from ai_eval_forge import evaluate_suite, parse_cases, render_markdown
from pathlib import Path

cases = parse_cases(Path("cases.jsonl").read_text())
suite = evaluate_suite(cases)
print(render_markdown(suite))
print(f"Pass rate: {suite.summary.passRate:.0%}")

Output shape (JSON)

{
  "summary": {
    "total": 2,
    "passed": 1,
    "failed": 1,
    "passRate": 0.5,
    "averageScore": 0.82,
    "totalCostUsd": 0.0,
    "averageLatencyMs": 0
  },
  "cases": [
    {
      "id": "greeting",
      "passed": true,
      "score": 1.0,
      "checks": [{"type": "token_f1", "required": true, "passed": true, "score": 1.0, "min": 0.65, "detail": "token_f1=1.0"}],
      "meta": {"input": null, "tags": [], "costUsd": 0, "latencyMs": 0}
    }
  ]
}

Differences from the npm version

  • js_expression check type is dropped. The JS version lets you run a JavaScript expression against case context. Python's equivalent (eval) is harder to sandbox, so the Python port omits this check type rather than ship a half-sandbox. If you need custom logic, use regex or json_field — or extend the library via your own run_check wrapper.

Everything else matches the npm package 1:1: same check types, same scoring formulas, same summary fields, same exit codes, same CLI flags.

Development

pip install -e '.[dev]'
pytest

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_eval_forge-0.1.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_eval_forge-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file ai_eval_forge-0.1.0.tar.gz.

File metadata

  • Download URL: ai_eval_forge-0.1.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_eval_forge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ce4fa62d0f19f8f47e9613f1d97f0330a3dc9ec7b9ddaf078a32f7463d4182d6
MD5 942ebf05a46c1400ec58524e6f038882
BLAKE2b-256 0c9238f393fcadd01b3d1610fdedd31b22b6eb11d35f8392228d3e3d05dc142b

See more details on using hashes here.

File details

Details for the file ai_eval_forge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ai_eval_forge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for ai_eval_forge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f9c595cb7ee8d0fef5126c971a29022e24cc0fc67fffb82910acbc898b889dd
MD5 398d43b8f12c3e404d711b682d487fa4
BLAKE2b-256 a4461a99a68098eee1ae6506a27c97afec228c0eec738271fdb701af117363d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page