LLM-judge evaluators (G-Eval) for AgentForge
Project description
agentforge-eval-geval
LLM-judge evaluators for AgentForge (feat-006).
This package ships the G-Eval engine — an LLM-as-judge grader that scores agent outputs against a rubric YAML — plus six named graders that wrap G-Eval with reference rubrics:
| Grader | Scores |
|---|---|
Correctness |
Output matches the ground-truth answer (rubric-tunable for binary / Likert / ordinal) |
Faithfulness |
Output is supported by the retrieved evidence (no claims beyond what was retrieved) |
Groundedness |
Output stays inside the provided sources (no off-source content) |
Hallucination |
Output contains content not derivable from inputs (faithfulness + groundedness combined into a single risk score) |
Relevance |
Output addresses the user's question vs going off-topic |
Helpfulness |
Output is useful — actionable, complete, well-structured |
Each grader takes a judge LLMClient at construction. The judge is
typically a cheaper model than the agent's primary model (e.g. Haiku
judging Sonnet output) — cuts cost and reduces "judge agrees with
itself" bias.
Quick start
from agentforge import Agent
from agentforge_bedrock import BedrockClient
from agentforge_eval_geval import Correctness, Faithfulness
judge = BedrockClient(model_id="us.anthropic.claude-haiku-4-5")
agent = Agent(
model="bedrock:us.anthropic.claude-sonnet-4-5-20250929",
evaluators=[
Correctness(judge=judge, ground_truth_field="expected"),
Faithfulness(judge=judge, sources_field="retrieved_docs"),
],
)
result = await agent.run("Summarise PR #42")
print(result.eval_scores)
Cost-bounded: each judge call bills against the run's BudgetPolicy
(feat-007). When the remaining budget falls below the grader's
cost_estimate_usd, the agent skips the grader and logs a WARN.
Custom rubrics with G-Eval
from agentforge_eval_geval import GEval
grader = GEval(
judge=judge,
name="code-review-quality",
rubric={
"criteria": "Score the PR description's accuracy and completeness.",
"scoring": "0.0 = incorrect/incomplete; 1.0 = accurate and complete",
"examples": [
{"output": "...", "score": 0.9, "reasoning": "..."},
],
},
)
Or load from a YAML file:
grader = GEval.from_rubric_file("./rubrics/code-review.yaml", judge=judge)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentforge_eval_geval-0.2.3.tar.gz.
File metadata
- Download URL: agentforge_eval_geval-0.2.3.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6169e1f45f74f71836d7a0d1fa0729a1f6c2e24126693804f8ba3c506af4d8e
|
|
| MD5 |
a546a4fab1380fa7f527763fc4859ae4
|
|
| BLAKE2b-256 |
a87ce02505824db814a51ac9cb9947c6ac778bc108cb04c6a136da727ca98812
|
File details
Details for the file agentforge_eval_geval-0.2.3-py3-none-any.whl.
File metadata
- Download URL: agentforge_eval_geval-0.2.3-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86f31f182e9d0fcccd09e0cc5f6396db4e247a1e3f77dbb84da22c77f4eea5d1
|
|
| MD5 |
5f03a5483fd7100ff013bcfe12656c8d
|
|
| BLAKE2b-256 |
5ceb8e9f19a50927f32c74a8451233c28dbd68bfa5a71c326135e29367858718
|