Skip to main content

LLM-judge evaluators (G-Eval) for AgentForge

Project description

agentforge-eval-geval

LLM-judge evaluators for AgentForge (feat-006).

This package ships the G-Eval engine — an LLM-as-judge grader that scores agent outputs against a rubric YAML — plus six named graders that wrap G-Eval with reference rubrics:

Grader Scores
Correctness Output matches the ground-truth answer (rubric-tunable for binary / Likert / ordinal)
Faithfulness Output is supported by the retrieved evidence (no claims beyond what was retrieved)
Groundedness Output stays inside the provided sources (no off-source content)
Hallucination Output contains content not derivable from inputs (faithfulness + groundedness combined into a single risk score)
Relevance Output addresses the user's question vs going off-topic
Helpfulness Output is useful — actionable, complete, well-structured

Each grader takes a judge LLMClient at construction. The judge is typically a cheaper model than the agent's primary model (e.g. Haiku judging Sonnet output) — cuts cost and reduces "judge agrees with itself" bias.

Quick start

from agentforge import Agent
from agentforge_bedrock import BedrockClient
from agentforge_eval_geval import Correctness, Faithfulness

judge = BedrockClient(model_id="us.anthropic.claude-haiku-4-5")

agent = Agent(
    model="bedrock:us.anthropic.claude-sonnet-4-5-20250929",
    evaluators=[
        Correctness(judge=judge, ground_truth_field="expected"),
        Faithfulness(judge=judge, sources_field="retrieved_docs"),
    ],
)
result = await agent.run("Summarise PR #42")
print(result.eval_scores)

Cost-bounded: each judge call bills against the run's BudgetPolicy (feat-007). When the remaining budget falls below the grader's cost_estimate_usd, the agent skips the grader and logs a WARN.

Custom rubrics with G-Eval

from agentforge_eval_geval import GEval

grader = GEval(
    judge=judge,
    name="code-review-quality",
    rubric={
        "criteria": "Score the PR description's accuracy and completeness.",
        "scoring": "0.0 = incorrect/incomplete; 1.0 = accurate and complete",
        "examples": [
            {"output": "...", "score": 0.9, "reasoning": "..."},
        ],
    },
)

Or load from a YAML file:

grader = GEval.from_rubric_file("./rubrics/code-review.yaml", judge=judge)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentforge_eval_geval-0.2.4.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentforge_eval_geval-0.2.4-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file agentforge_eval_geval-0.2.4.tar.gz.

File metadata

  • Download URL: agentforge_eval_geval-0.2.4.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agentforge_eval_geval-0.2.4.tar.gz
Algorithm Hash digest
SHA256 1e7f44de7dba4150bfab520fdc816ded9e4566fd819af032c543a9529d6b7fdb
MD5 cf7b8421d7a4493f38aea5774aece98b
BLAKE2b-256 eb277aace0a072206d28e171f00e20a560b67945337c20f17ac728b3dd7ae07c

See more details on using hashes here.

File details

Details for the file agentforge_eval_geval-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for agentforge_eval_geval-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2849ad814e506593ce1dda352ccade99f816a5990231eb5ed934a3dbf4b098a7
MD5 111e109a001e8d4bcaf45224f158d2c1
BLAKE2b-256 45d6959020feb3982877f92c520cf62df7e193cab20179700befdedfd67cf5a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page