Skip to main content

Eval runner for LLM applications

Project description

evalforge

Eval runner for LLM applications. CI-native AI quality.

Your code has tests. Your prompts don't. evalforge brings the same quality gates you have for code to your LLM features — run on every PR, catch regressions before they reach production.

from evalforge import EvalDataset, run_eval
from evalforge.scorers import llm_judge, exact_match

dataset = EvalDataset.from_json("evals/rag_quality.json")

report = run_eval(
    dataset=dataset,
    model="claude-3-5-sonnet",
    scorers=[llm_judge(rubric="Is the answer accurate and grounded?"), exact_match()],
)

report.assert_pass(threshold=0.85)  # fails CI if score drops below 85%

Status

🚧 Early development. Star to follow progress.

What it does

  • Eval runner — LLM-as-judge, exact match, ROUGE, semantic similarity, custom Python scorers
  • CI integration — GitHub Actions native, fails the build if quality drops
  • Dataset versioning — store, version, and sample eval sets
  • Comparison mode — A/B test prompts and models against a baseline
  • Shadow traffic — route production traffic to a new model and compare live
  • Vertical eval packs — pre-built datasets and scorers for RAG, customer support, code generation, legal, medical

Roadmap

  • Python SDK
  • CLI
  • GitHub Actions action
  • Hosted control plane (mawlaia.com)
  • Vertical eval packs

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mawlaia_evalforge-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mawlaia_evalforge-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file mawlaia_evalforge-0.1.0.tar.gz.

File metadata

  • Download URL: mawlaia_evalforge-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mawlaia_evalforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3fa549a8588f463f0d6e101d53717997a9c66ebd5673667550f997ac3ada3e8c
MD5 2354f296f43893ac6c472b56f036bed6
BLAKE2b-256 c9c903ecd1d8c396571111558a49155ecc8652a913000d86098abea0118ff4ad

See more details on using hashes here.

File details

Details for the file mawlaia_evalforge-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mawlaia_evalforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f87c5fa3bdab3c8fc0f85c0a502b1874efe2bf6b225343055a00df781a2ba170
MD5 81e775e0bea803dac9cca4501b00d981
BLAKE2b-256 b32a3b1b3e5da22fb8e26cf96d86bbe9c145da4f40c42fd3f4b5c4e173322beb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page