Eval runner for LLM applications
Project description
evalforge
Eval runner for LLM applications. CI-native AI quality.
Your code has tests. Your prompts don't. evalforge brings the same quality gates you have for code to your LLM features — run on every PR, catch regressions before they reach production.
from evalforge import EvalDataset, run_eval
from evalforge.scorers import llm_judge, exact_match
dataset = EvalDataset.from_json("evals/rag_quality.json")
report = run_eval(
dataset=dataset,
model="claude-3-5-sonnet",
scorers=[llm_judge(rubric="Is the answer accurate and grounded?"), exact_match()],
)
report.assert_pass(threshold=0.85) # fails CI if score drops below 85%
Status
🚧 Early development. Star to follow progress.
What it does
- Eval runner — LLM-as-judge, exact match, ROUGE, semantic similarity, custom Python scorers
- CI integration — GitHub Actions native, fails the build if quality drops
- Dataset versioning — store, version, and sample eval sets
- Comparison mode — A/B test prompts and models against a baseline
- Shadow traffic — route production traffic to a new model and compare live
- Vertical eval packs — pre-built datasets and scorers for RAG, customer support, code generation, legal, medical
Roadmap
- Python SDK
- CLI
- GitHub Actions action
- Hosted control plane (mawlaia.com)
- Vertical eval packs
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mawlaia_evalforge-0.1.0.tar.gz.
File metadata
- Download URL: mawlaia_evalforge-0.1.0.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fa549a8588f463f0d6e101d53717997a9c66ebd5673667550f997ac3ada3e8c
|
|
| MD5 |
2354f296f43893ac6c472b56f036bed6
|
|
| BLAKE2b-256 |
c9c903ecd1d8c396571111558a49155ecc8652a913000d86098abea0118ff4ad
|
File details
Details for the file mawlaia_evalforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mawlaia_evalforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f87c5fa3bdab3c8fc0f85c0a502b1874efe2bf6b225343055a00df781a2ba170
|
|
| MD5 |
81e775e0bea803dac9cca4501b00d981
|
|
| BLAKE2b-256 |
b32a3b1b3e5da22fb8e26cf96d86bbe9c145da4f40c42fd3f4b5c4e173322beb
|